All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms
@ 2020-12-18 21:10 Megha Dey
  2020-12-18 21:10 ` [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support Megha Dey
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Megha Dey @ 2020-12-18 21:10 UTC (permalink / raw)
  To: herbert, davem
  Cc: linux-crypto, linux-kernel, ravi.v.shankar, tim.c.chen,
	andi.kleen, dave.hansen, megha.dey, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	tony.luck, ira.weiny, x86

Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions
(first implemented on Intel's Icelake client and Xeon CPUs).

These algorithms take advantage of the AVX512 registers to keep the CPU
busy and increase memory bandwidth utilization. They provide substantial
(2-10x) improvements over existing crypto algorithms when update data size
is greater than 128 bytes and do not have any significant impact when used
on small amounts of data.

However, these algorithms may also incur a frequency penalty and cause
collateral damage to other workloads running on the same core(co-scheduled
threads). These frequency drops are also known as bin drops where 1 bin
drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
are observed on the Icelake server.

The AVX512 optimization are disabled by default to avoid impact on other
workloads. In order to use these optimized algorithms:
1. At compile time:
   a. User must enable CONFIG_CRYPTO_AVX512 option
   b. Toolchain(assembler) must support VAES or VPCLMULQDQ instructions
2. At run time:
   a. User must set module parameter use_avx512 at boot time
   (<module_name>.use_avx512 = 1) or post boot using sysfs
   (echo 1 > /sys/module/<module_name>/parameters/use_avx512).
   (except for aesni ctr and gcm which require boot time initialization
   because of commit 0fbafd06bdde ("crypto: aesni - fix failing setkey for
   rfc4106-gcm-aesni")
   b. Platform must support VPCLMULQDQ or VAES features

N.B. It is unclear whether these coarse grain controls(global module
parameter) would meet all user needs. Perhaps some per-thread control might
be useful? Looking for guidance here.
 
Other implementations of these crypto algorithms are possible, which would
result in lower crypto performance but would not cause collateral damage
from frequency drops (AVX512L vs AVX512VL).

The following crypto algorithms are optimized using AVX512 registers:

1. "by16" implementation of T10 Data Integrity Field CRC16 (CRC T10 DIF)
   The "by16" means the main loop processes 256 bytes (16 * 16 bytes) at
   a time in CRC T10 DIF calculation. This algorithm is optimized using
   the VPCLMULQDQ instruction which is the encoded 512 bit version of
   PCLMULQDQ instruction. On an Icelake desktop, with constant frequency
   set, the "by16" CRC T10 DIF AVX512 optimization shows about 1.5X
   improvement when the bytes per update size is 1KB or above as measured
   by the tcrypt module.

2. GHASH computations with vectorized instruction.
   VPCLMULQDQ instruction is used to accelerate the most time-consuming
   part of GHASH, carry-less multiplication. For best parallelism and
   deeper out of order execution, the main loop of the code works on 16x16
   byte blocks at a time and performs reduction every 48 x 16 byte blocks.
   Optimized GHASH computations show a 4x to 10x speedup when the bytes
   per update is 256B or above.

3. "by16" implementation of the AES CTR mode using VAES instructions
   "by16" means that 16 independent blocks (each 128 bits) can be ciphered
   simultaneously. On an Icelake desktop, with constant frequency set, the
   "by16" AES CTR mode shows about 2X improvement when the bytes per update
   size is 256B or above as measured by the tcrypt module.

4. AES GCM using VPCLMULQDQ instructions
   Using AVX 512 registers, an average increase of 2X is observed when the
   bytes per update size is 256B or above as measured by tcrypt module.

Patch 1 checks for assembler support for VPCLMULQDQ instruction
Patch 2 introduces CRC T10 DIF calculation with VPCLMULQDQ instructions
Patch 3 introduces optimized GHASH computation with VPCLMULQDQ instructions
Patch 4 adds new speed test for optimized GHASH computations
Patch 5 introduces "by 16" version of AES CTR mode using VAES instructions
Patch 6 fixes coding style in existing if else block
Patch 7 introduces the AES GCM mode using VPCLMULQDQ instructions

Complex sign off chain in patches 2 and 3. Original implementation (non
kernel) was done by Intel's IPsec team. Kyung Min Park is the author of
Patch 2 and co-author of patch 3 along with me.

Also, most of this code is related to crypto subsystem. X86 mailing list is
copied here because of Patch 1.

Cc: x86@kernel.org
Reviewed-by: Tony Luck <tony.luck@intel.com>

Kyung Min Park (3):
  crypto: crct10dif - Accelerated CRC T10 DIF with vectorized
    instruction
  crypto: ghash - Optimized GHASH computations
  crypto: tcrypt - Add speed test for optimized GHASH computations

Megha Dey (4):
  x86: Probe assembler capabilities for VAES and VPLCMULQDQ support
  crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization
  crypto: aesni - fix coding style for if/else block
  crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ

 arch/x86/Kconfig.assembler                   |   10 +
 arch/x86/crypto/Makefile                     |    4 +
 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S  |  856 ++++++++++++
 arch/x86/crypto/aesni-intel_avx512-x86_64.S  | 1788 ++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c           |  122 +-
 arch/x86/crypto/avx512_vaes_common.S         | 1633 +++++++++++++++++++++++
 arch/x86/crypto/crct10dif-avx512-asm_64.S    |  482 +++++++
 arch/x86/crypto/crct10dif-pclmul_glue.c      |   24 +-
 arch/x86/crypto/ghash-clmulni-intel_avx512.S |   68 +
 arch/x86/crypto/ghash-clmulni-intel_glue.c   |   39 +-
 arch/x86/include/asm/disabled-features.h     |   14 +-
 crypto/Kconfig                               |   59 +
 crypto/tcrypt.c                              |    5 +
 13 files changed, 5091 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S
 create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S
 create mode 100644 arch/x86/crypto/avx512_vaes_common.S
 create mode 100644 arch/x86/crypto/crct10dif-avx512-asm_64.S
 create mode 100644 arch/x86/crypto/ghash-clmulni-intel_avx512.S

-- 
2.7.4


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support
  2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
@ 2020-12-18 21:10 ` Megha Dey
  2021-01-16 16:54   ` Ard Biesheuvel
  2020-12-18 21:10 ` [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction Megha Dey
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Megha Dey @ 2020-12-18 21:10 UTC (permalink / raw)
  To: herbert, davem
  Cc: linux-crypto, linux-kernel, ravi.v.shankar, tim.c.chen,
	andi.kleen, dave.hansen, megha.dey, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	tony.luck, ira.weiny, x86

This is a preparatory patch to introduce the optimized crypto algorithms
using AVX512 instructions which would require VAES and VPLCMULQDQ support.

Check for VAES and VPCLMULQDQ assembler support using AVX512 registers.

Cc: x86@kernel.org
Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 arch/x86/Kconfig.assembler | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index 26b8c08..9ea0bc8 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -1,6 +1,16 @@
 # SPDX-License-Identifier: GPL-2.0
 # Copyright (C) 2020 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
 
+config AS_VAES_AVX512
+	def_bool $(as-instr,vaesenc %zmm0$(comma)%zmm1$(comma)%zmm1) && 64BIT
+	help
+	  Supported by binutils >= 2.30 and LLVM integrated assembler
+
+config AS_VPCLMULQDQ
+	def_bool $(as-instr,vpclmulqdq \$0$(comma)%zmm2$(comma)%zmm6$(comma)%zmm4)
+	help
+	  Supported by binutils >= 2.30 and LLVM integrated assembler
+
 config AS_AVX512
 	def_bool $(as-instr,vpmovm2b %k1$(comma)%zmm5)
 	help
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction
  2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
  2020-12-18 21:10 ` [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support Megha Dey
@ 2020-12-18 21:10 ` Megha Dey
  2021-01-16 17:00   ` Ard Biesheuvel
  2020-12-18 21:11 ` [RFC V1 3/7] crypto: ghash - Optimized GHASH computations Megha Dey
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Megha Dey @ 2020-12-18 21:10 UTC (permalink / raw)
  To: herbert, davem
  Cc: linux-crypto, linux-kernel, ravi.v.shankar, tim.c.chen,
	andi.kleen, dave.hansen, megha.dey, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	tony.luck, ira.weiny

From: Kyung Min Park <kyung.min.park@intel.com>

Update the crc_pcl function that calculates T10 Data Integrity Field
CRC16 (CRC T10 DIF) using VPCLMULQDQ instruction. VPCLMULQDQ instruction
with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
The advantage comes from packing multiples of 4 * 128 bit data into AVX512
reducing instruction latency.

The glue code in crct10diff module overrides the existing PCLMULQDQ version
with the VPCLMULQDQ version when the following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
   only Icelake)
2. If compiled as built-in module, crct10dif_pclmul.use_avx512 is set at
   boot time or /sys/module/crct10dif_pclmul/parameters/use_avx512 is set
   to 1 after boot.
   If compiled as loadable module, use_avx512 module parameter must be set:
   modprobe crct10dif_pclmul use_avx512=1

A typical run of tcrypt with CRC T10 DIF calculation with PCLMULQDQ
instruction and VPCLMULQDQ instruction shows the following results:
For bytes per update >= 1KB, we see the average improvement of 46%(~1.4x)
For bytes per update < 1KB, we see the average improvement of 13%.
Test was performed on an Icelake based platform with constant frequency
set for CPU.

Detailed results for a variety of block sizes and update sizes are in
the table below.

---------------------------------------------------------------------------
|            |            |         cycles/operation         |            |
|            |            |       (the lower the better)     |            |
|    byte    |   bytes    |----------------------------------| percentage |
|   blocks   | per update |   CRC T10 DIF  |  CRC T10 DIF    | loss/gain  |
|            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
|------------|------------|----------------|-----------------|------------|
|      16    |     16     |        77      |        106      |   -27.0    |
|      64    |     16     |       411      |        390      |     5.4    |
|      64    |     64     |        71      |         85      |   -16.0    |
|     256    |     16     |      1224      |       1308      |    -6.4    |
|     256    |     64     |       393      |        407      |    -3.4    |
|     256    |    256     |        93      |         86      |     8.1    |
|    1024    |     16     |      4564      |       5020      |    -9.0    |
|    1024    |    256     |       486      |        475      |     2.3    |
|    1024    |   1024     |       221      |        148      |    49.3    |
|    2048    |     16     |      8945      |       9851      |    -9.1    |
|    2048    |    256     |       982      |        951      |     3.3    |
|    2048    |   1024     |       500      |        369      |    35.5    |
|    2048    |   2048     |       413      |        265      |    55.8    |
|    4096    |     16     |     17885      |      19351      |    -7.5    |
|    4096    |    256     |      1828      |       1713      |     6.7    |
|    4096    |   1024     |       968      |        805      |    20.0    |
|    4096    |   4096     |       739      |        475      |    55.6    |
|    8192    |     16     |     48339      |      41556      |    16.3    |
|    8192    |    256     |      3494      |       3342      |     4.5    |
|    8192    |   1024     |      1959      |       1462      |    34.0    |
|    8192    |   4096     |      1561      |       1036      |    50.7    |
|    8192    |   8192     |      1540      |       1004      |    53.4    |
---------------------------------------------------------------------------

This work was inspired by the CRC T10 DIF AVX512 optimization published
in Intel Intelligent Storage Acceleration Library.
https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_by16_10.asm

Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 arch/x86/crypto/Makefile                  |   1 +
 arch/x86/crypto/crct10dif-avx512-asm_64.S | 482 ++++++++++++++++++++++++++++++
 arch/x86/crypto/crct10dif-pclmul_glue.c   |  24 +-
 arch/x86/include/asm/disabled-features.h  |   8 +-
 crypto/Kconfig                            |  23 ++
 5 files changed, 535 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/crypto/crct10dif-avx512-asm_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index a31de0c..bf0b0fc 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -80,6 +80,7 @@ crc32-pclmul-y := crc32-pclmul_asm.o crc32-pclmul_glue.o
 
 obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
 crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o
+crct10dif-pclmul-$(CONFIG_CRYPTO_CRCT10DIF_AVX512) += crct10dif-avx512-asm_64.o
 
 obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o
 poly1305-x86_64-y := poly1305-x86_64-cryptogams.o poly1305_glue.o
diff --git a/arch/x86/crypto/crct10dif-avx512-asm_64.S b/arch/x86/crypto/crct10dif-avx512-asm_64.S
new file mode 100644
index 0000000..07c9371
--- /dev/null
+++ b/arch/x86/crypto/crct10dif-avx512-asm_64.S
@@ -0,0 +1,482 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Copyright(c) 2020 Intel Corporation.
+ *
+ * Implement CRC T10 DIF calculation with AVX512 instructions. (x86_64)
+ *
+ * This is CRC T10 DIF calculation with AVX512 instructions. It requires
+ * the support of Intel(R) AVX512F and VPCLMULQDQ instructions.
+ */
+
+#include <linux/linkage.h>
+
+.text
+#define		init_crc	%edi
+#define		buf		%rsi
+#define		len		%rdx
+#define		VARIABLE_OFFSET 16*2+8
+
+/*
+ * u16 crct10dif-avx512-asm_64(u16 init_crc, const u8 *buf, size_t len);
+ */
+.align 16
+SYM_FUNC_START(crct10dif_pcl_avx512)
+
+	shl		$16, init_crc
+	/*
+	 * The code flow is exactly same as a 32-bit CRC. The only difference
+	 * is before returning eax, we will shift it right 16 bits, to scale
+	 * back to 16 bits.
+	 */
+	sub		$(VARIABLE_OFFSET), %rsp
+
+	vbroadcasti32x4 SHUF_MASK(%rip), %zmm18
+
+	/* For sizes less than 256 bytes, we can't fold 256 bytes at a time. */
+	cmp		$256, len
+	jl		.less_than_256
+
+	/* load the initial crc value */
+	vmovd		init_crc, %xmm10
+
+	/*
+	 * crc value does not need to be byte-reflected, but it needs to be
+	 * moved to the high part of the register because data will be
+	 * byte-reflected and will align with initial crc at correct place.
+	 */
+	vpslldq		$12, %xmm10, %xmm10
+
+	/* receive the initial 64B data, xor the initial crc value. */
+	vmovdqu8	(buf), %zmm0
+	vmovdqu8	16*4(buf), %zmm4
+	vpshufb		%zmm18, %zmm0, %zmm0
+	vpshufb		%zmm18, %zmm4, %zmm4
+	vpxorq		%zmm10, %zmm0, %zmm0
+	vbroadcasti32x4	rk3(%rip), %zmm10
+
+	sub		$256, len
+	cmp		$256, len
+	jl		.fold_128_B_loop
+
+	vmovdqu8	16*8(buf), %zmm7
+	vmovdqu8	16*12(buf), %zmm8
+	vpshufb		%zmm18, %zmm7, %zmm7
+	vpshufb		%zmm18, %zmm8, %zmm8
+	vbroadcasti32x4 rk_1(%rip), %zmm16
+	sub		$256, len
+
+.fold_256_B_loop:
+	add		$256, buf
+	vmovdqu8	(buf), %zmm3
+	vpshufb		%zmm18, %zmm3, %zmm3
+	vpclmulqdq	$0x00, %zmm16, %zmm0, %zmm1
+	vpclmulqdq	$0x11, %zmm16, %zmm0, %zmm2
+	vpxorq		%zmm2, %zmm1, %zmm0
+	vpxorq		%zmm3, %zmm0, %zmm0
+
+	vmovdqu8	16*4(buf), %zmm9
+	vpshufb		%zmm18, %zmm9, %zmm9
+	vpclmulqdq	$0x00, %zmm16, %zmm4, %zmm5
+	vpclmulqdq	$0x11, %zmm16, %zmm4, %zmm6
+	vpxorq		%zmm6, %zmm5, %zmm4
+	vpxorq		%zmm9, %zmm4, %zmm4
+
+	vmovdqu8	16*8(buf), %zmm11
+	vpshufb		%zmm18, %zmm11, %zmm11
+	vpclmulqdq	$0x00, %zmm16, %zmm7, %zmm12
+	vpclmulqdq	$0x11, %zmm16, %zmm7, %zmm13
+	vpxorq		%zmm13, %zmm12, %zmm7
+	vpxorq		%zmm11, %zmm7, %zmm7
+
+	vmovdqu8	16*12(buf), %zmm17
+	vpshufb		%zmm18, %zmm17, %zmm17
+	vpclmulqdq	$0x00, %zmm16, %zmm8, %zmm14
+	vpclmulqdq	$0x11, %zmm16, %zmm8, %zmm15
+	vpxorq		%zmm15, %zmm14, %zmm8
+	vpxorq		%zmm17, %zmm8, %zmm8
+
+	sub		$256, len
+	jge		.fold_256_B_loop
+
+	/* Fold 256 into 128 */
+	add		$256, buf
+	vpclmulqdq	$0x00, %zmm10, %zmm0, %zmm1
+	vpclmulqdq	$0x11, %zmm10, %zmm0, %zmm2
+	vpternlogq	$0x96, %zmm2, %zmm1, %zmm7
+
+	vpclmulqdq	$0x00, %zmm10, %zmm4, %zmm5
+	vpclmulqdq	$0x11, %zmm10, %zmm4, %zmm6
+	vpternlogq	$0x96, %zmm6, %zmm5, %zmm8
+
+	vmovdqa32	%zmm7, %zmm0
+	vmovdqa32	%zmm8, %zmm4
+
+	add		$128, len
+	jmp		.fold_128_B_register
+
+	/*
+	 * At this section of the code, there is 128*x + y (0 <= y < 128) bytes
+	 * of buffer. The fold_128_B_loop will fold 128B at a time until we have
+	 * 128 + y Bytes of buffer.
+	 * Fold 128B at a time. This section of the code folds 8 xmm registers
+	 * in parallel.
+	 */
+.fold_128_B_loop:
+	add		$128, buf
+	vmovdqu8	(buf), %zmm8
+	vpshufb		%zmm18, %zmm8, %zmm8
+	vpclmulqdq	$0x00, %zmm10, %zmm0, %zmm2
+	vpclmulqdq	$0x11, %zmm10, %zmm0, %zmm1
+	vpxorq		%zmm1, %zmm2, %zmm0
+	vpxorq		%zmm8, %zmm0, %zmm0
+
+	vmovdqu8	16*4(buf), %zmm9
+	vpshufb		%zmm18, %zmm9, %zmm9
+	vpclmulqdq	$0x00, %zmm10, %zmm4, %zmm5
+	vpclmulqdq	$0x11, %zmm10, %zmm4, %zmm6
+	vpxorq		%zmm6, %zmm5, %zmm4
+	vpxorq		%zmm9, %zmm4, %zmm4
+
+	sub		$128, len
+	jge		.fold_128_B_loop
+
+	add		$128, buf
+
+	/*
+	 * At this point, the buffer pointer is pointing at the last y Bytes
+	 * of the buffer, where 0 <= y < 128. The 128B of folded data is in
+	 * 8 of the xmm registers: xmm0 - xmm7.
+	 */
+.fold_128_B_register:
+	/* fold the 8 128b parts into 1 xmm register with different constant. */
+	vmovdqu8	rk9(%rip), %zmm16
+	vmovdqu8	rk17(%rip), %zmm11
+	vpclmulqdq	$0x00, %zmm16, %zmm0, %zmm1
+	vpclmulqdq	$0x11, %zmm16, %zmm0, %zmm2
+	vextracti64x2	$3, %zmm4, %xmm7
+
+	vpclmulqdq	$0x00, %zmm11, %zmm4, %zmm5
+	vpclmulqdq	$0x11, %zmm11, %zmm4, %zmm6
+	vmovdqa		rk1(%rip), %xmm10
+	vpternlogq	$0x96, %zmm5, %zmm2, %zmm1
+	vpternlogq	$0x96, %zmm7, %zmm6, %zmm1
+
+	vshufi64x2      $0x4e, %zmm1, %zmm1, %zmm8
+	vpxorq          %ymm1, %ymm8, %ymm8
+	vextracti64x2   $1, %ymm8, %xmm5
+	vpxorq          %xmm8, %xmm5, %xmm7
+
+	/*
+	 * Instead of 128, we add 128 - 16 to the loop counter to save one
+	 * instruction from the loop. Instead of a cmp instruction, we use
+	 * the negative flag with the jl instruction.
+	 */
+	add		$(128 - 16), len
+	jl		.final_reduction_for_128
+
+	/*
+	 * Now we have 16 + y bytes left to reduce. 16 Bytes is in register xmm7
+	 * and the rest is in memory we can fold 16 bytes at a time if y >= 16.
+	 * continue folding 16B at a time.
+	 */
+.16B_reduction_loop:
+	vpclmulqdq	$0x11, %xmm10, %xmm7, %xmm8
+	vpclmulqdq	$0x00, %xmm10, %xmm7, %xmm7
+	vpxor		%xmm8, %xmm7, %xmm7
+	vmovdqu		(buf), %xmm0
+	vpshufb		%xmm18, %xmm0, %xmm0
+	vpxor		%xmm0, %xmm7, %xmm7
+	add		$16, buf
+	sub		$16, len
+
+	/*
+	 * Instead of a cmp instruction, we utilize the flags with the jge
+	 * instruction equivalent of: cmp len, 16-16. Check if there is any
+	 * more 16B in the buffer to be able to fold.
+	 */
+	jge		.16B_reduction_loop
+
+	/*
+	 * now we have 16+z bytes left to reduce, where 0 <= z < 16.
+	 * first, we reduce the data in the xmm7 register.
+	 */
+.final_reduction_for_128:
+	add		$16, len
+	je		.128_done
+
+	/*
+	 * Here we are getting data that is less than 16 bytes. since we know
+	 * that there was data before the pointer, we can offset the input
+	 * pointer before the actual point to receive exactly 16 bytes.
+	 * After that, the registers need to be adjusted.
+	 */
+.get_last_two_xmms:
+	vmovdqa		%xmm7, %xmm2
+	vmovdqu		-16(buf, len), %xmm1
+	vpshufb		%xmm18, %xmm1, %xmm1
+
+	/*
+	 * get rid of the extra data that was loaded before.
+	 * load the shift constant
+	 */
+	lea		16 + pshufb_shf_table(%rip), %rax
+	sub		len, %rax
+	vmovdqu		(%rax), %xmm0
+
+	vpshufb		%xmm0, %xmm2, %xmm2
+	vpxor		mask1(%rip), %xmm0, %xmm0
+	vpshufb		%xmm0, %xmm7, %xmm7
+	vpblendvb	%xmm0, %xmm2, %xmm1, %xmm1
+
+	vpclmulqdq	$0x11, %xmm10, %xmm7, %xmm8
+	vpclmulqdq	$0x00, %xmm10, %xmm7, %xmm7
+	vpxor		%xmm8, %xmm7, %xmm7
+	vpxor		%xmm1, %xmm7, %xmm7
+
+.128_done:
+	/* compute crc of a 128-bit value. */
+	vmovdqa		rk5(%rip), %xmm10
+	vmovdqa		%xmm7, %xmm0
+
+	vpclmulqdq	$0x01, %xmm10, %xmm7, %xmm7
+	vpslldq		$8, %xmm0, %xmm0
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	vmovdqa		%xmm7, %xmm0
+	vpand		mask2(%rip), %xmm0, %xmm0
+	vpsrldq		$12, %xmm7, %xmm7
+	vpclmulqdq	$0x10, %xmm10, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	/* barrett reduction */
+.barrett:
+	vmovdqa		rk7(%rip), %xmm10
+	vmovdqa		%xmm7, %xmm0
+	vpclmulqdq	$0x01, %xmm10, %xmm7, %xmm7
+	vpslldq		$4, %xmm7, %xmm7
+	vpclmulqdq	$0x11, %xmm10, %xmm7, %xmm7
+
+	vpslldq		$4, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+	vpextrd		$1, %xmm7, %eax
+
+.cleanup:
+	/* scale the result back to 16 bits. */
+	shr		$16, %eax
+	add		$(VARIABLE_OFFSET), %rsp
+	ret
+
+.align 16
+.less_than_256:
+	/* check if there is enough buffer to be able to fold 16B at a time. */
+	cmp		$32, len
+	jl		.less_than_32
+
+	/* If there is, load the constants. */
+	vmovdqa		rk1(%rip), %xmm10
+
+	/*
+	 * get the initial crc value and align it to its correct place.
+	 * And load the plaintext and byte-reflect it.
+	 */
+	vmovd		init_crc, %xmm0
+	vpslldq		$12, %xmm0, %xmm0
+	vmovdqu		(buf), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	/* update the buffer pointer */
+	add		$16, buf
+
+	/* subtract 32 instead of 16 to save one instruction from the loop */
+	sub		$32, len
+
+	jmp		.16B_reduction_loop
+
+.align 16
+.less_than_32:
+	/*
+	 * mov initial crc to the return value. This is necessary for
+	 * zero-length buffers.
+	 */
+	mov		init_crc, %eax
+	test		len, len
+	je		.cleanup
+
+	vmovd		init_crc, %xmm0
+	vpslldq		$12, %xmm0, %xmm0
+
+	cmp		$16, len
+	je		.exact_16_left
+	jl		.less_than_16_left
+
+	vmovdqu		(buf), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+	add		$16, buf
+	sub		$16, len
+	vmovdqa		rk1(%rip), %xmm10
+	jmp		.get_last_two_xmms
+
+.align 16
+.less_than_16_left:
+	/*
+	 * use stack space to load data less than 16 bytes, zero-out the 16B
+	 * in the memory first.
+	 */
+	vpxor		%xmm1, %xmm1, %xmm1
+	mov		%rsp, %r11
+	vmovdqa		%xmm1, (%r11)
+
+	cmp		$4, len
+	jl		.only_less_than_4
+
+	mov		len, %r9
+	cmp		$8, len
+	jl		.less_than_8_left
+
+	mov		(buf), %rax
+	mov		%rax, (%r11)
+	add		$8, %r11
+	sub		$8, len
+	add		$8, buf
+.less_than_8_left:
+	cmp		$4, len
+	jl		.less_than_4_left
+
+	mov		(buf), %eax
+	mov		%eax, (%r11)
+	add		$4, %r11
+	sub		$4, len
+	add		$4, buf
+
+.less_than_4_left:
+	cmp		$2, len
+	jl		.less_than_2_left
+
+	mov		(buf), %ax
+	mov		%ax, (%r11)
+	add		$2, %r11
+	sub		$2, len
+	add		$2, buf
+.less_than_2_left:
+	cmp		$1, len
+	jl		.zero_left
+
+	mov		(buf), %al
+	mov		%al, (%r11)
+
+.zero_left:
+	vmovdqa		(%rsp), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	lea		16 + pshufb_shf_table(%rip), %rax
+	sub		%r9, %rax
+	vmovdqu		(%rax), %xmm0
+	vpxor		mask1(%rip), %xmm0, %xmm0
+
+	vpshufb		%xmm0,%xmm7, %xmm7
+	jmp		.128_done
+
+.align 16
+.exact_16_left:
+	vmovdqu		(buf), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+	jmp		.128_done
+
+.only_less_than_4:
+	cmp		$3, len
+	jl		.only_less_than_3
+
+	mov		(buf), %al
+	mov		%al, (%r11)
+
+	mov		1(buf), %al
+	mov		%al, 1(%r11)
+
+	mov		2(buf), %al
+	mov		%al, 2(%r11)
+
+	vmovdqa		(%rsp), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	vpsrldq		$5, %xmm7, %xmm7
+	jmp		.barrett
+
+.only_less_than_3:
+	cmp		$2, len
+	jl		.only_less_than_2
+
+	mov		(buf), %al
+	mov		%al, (%r11)
+
+	mov		1(buf), %al
+	mov		%al, 1(%r11)
+
+	vmovdqa		(%rsp), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	vpsrldq		$6, %xmm7, %xmm7
+	jmp		.barrett
+
+.only_less_than_2:
+	mov		(buf), %al
+	mov		%al, (%r11)
+
+	vmovdqa		(%rsp), %xmm7
+	vpshufb		%xmm18, %xmm7, %xmm7
+	vpxor		%xmm0, %xmm7, %xmm7
+
+	vpsrldq		$7, %xmm7, %xmm7
+	jmp		.barrett
+SYM_FUNC_END(crct10dif_pcl_avx512)
+
+.section        .data
+.align 32
+rk_1:		.quad 0xdccf000000000000
+rk_2:		.quad 0x4b0b000000000000
+rk1:		.quad 0x2d56000000000000
+rk2:		.quad 0x06df000000000000
+rk3:		.quad 0x9d9d000000000000
+rk4:		.quad 0x7cf5000000000000
+rk5:		.quad 0x2d56000000000000
+rk6:		.quad 0x1368000000000000
+rk7:		.quad 0x00000001f65a57f8
+rk8:		.quad 0x000000018bb70000
+rk9:		.quad 0xceae000000000000
+rk10:		.quad 0xbfd6000000000000
+rk11:		.quad 0x1e16000000000000
+rk12:		.quad 0x713c000000000000
+rk13:		.quad 0xf7f9000000000000
+rk14:		.quad 0x80a6000000000000
+rk15:		.quad 0x044c000000000000
+rk16:		.quad 0xe658000000000000
+rk17:		.quad 0xad18000000000000
+rk18:		.quad 0xa497000000000000
+rk19:		.quad 0x6ee3000000000000
+rk20:		.quad 0xe7b5000000000000
+rk_1b:		.quad 0x2d56000000000000
+rk_2b:		.quad 0x06df000000000000
+		.quad 0x0000000000000000
+		.quad 0x0000000000000000
+
+.align 16
+mask1:
+	.octa	0x80808080808080808080808080808080
+
+.align 16
+mask2:
+	.octa	0x00000000FFFFFFFFFFFFFFFFFFFFFFFF
+
+.align 16
+SHUF_MASK:
+	.octa	0x000102030405060708090A0B0C0D0E0F
+
+.align 16
+pshufb_shf_table:	.octa 0x8f8e8d8c8b8a89888786858483828100
+			.octa 0x000e0d0c0b0a09080706050403020100
+			.octa 0x0f0e0d0c0b0a09088080808080808080
+			.octa 0x80808080808080808080808080808080
diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c
index 71291d5a..26a6350 100644
--- a/arch/x86/crypto/crct10dif-pclmul_glue.c
+++ b/arch/x86/crypto/crct10dif-pclmul_glue.c
@@ -35,6 +35,16 @@
 #include <asm/simd.h>
 
 asmlinkage u16 crc_t10dif_pcl(u16 init_crc, const u8 *buf, size_t len);
+#ifdef CONFIG_CRYPTO_CRCT10DIF_AVX512
+asmlinkage u16 crct10dif_pcl_avx512(u16 init_crc, const u8 *buf, size_t len);
+#else
+static u16 crct10dif_pcl_avx512(u16 init_crc, const u8 *buf, size_t len)
+{ return 0; }
+#endif
+
+static bool use_avx512;
+module_param(use_avx512, bool, 0644);
+MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
 
 struct chksum_desc_ctx {
 	__u16 crc;
@@ -56,7 +66,12 @@ static int chksum_update(struct shash_desc *desc, const u8 *data,
 
 	if (length >= 16 && crypto_simd_usable()) {
 		kernel_fpu_begin();
-		ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
+		if (IS_ENABLED(CONFIG_CRYPTO_CRCT10DIF_AVX512) &&
+		    cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ) &&
+		    use_avx512)
+			ctx->crc = crct10dif_pcl_avx512(ctx->crc, data, length);
+		else
+			ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
 		kernel_fpu_end();
 	} else
 		ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
@@ -75,7 +90,12 @@ static int __chksum_finup(__u16 crc, const u8 *data, unsigned int len, u8 *out)
 {
 	if (len >= 16 && crypto_simd_usable()) {
 		kernel_fpu_begin();
-		*(__u16 *)out = crc_t10dif_pcl(crc, data, len);
+		if (IS_ENABLED(CONFIG_CRYPTO_CRCT10DIF_AVX512) &&
+		    cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ) &&
+		    use_avx512)
+			*(__u16 *)out = crct10dif_pcl_avx512(crc, data, len);
+		else
+			*(__u16 *)out = crc_t10dif_pcl(crc, data, len);
 		kernel_fpu_end();
 	} else
 		*(__u16 *)out = crc_t10dif_generic(crc, data, len);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5861d34..1192dea 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -56,6 +56,12 @@
 # define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
 #endif
 
+#if defined(CONFIG_AS_VPCLMULQDQ)
+# define DISABLE_VPCLMULQDQ	0
+#else
+# define DISABLE_VPCLMULQDQ	(1 << (X86_FEATURE_VPCLMULQDQ & 31))
+#endif
+
 #ifdef CONFIG_IOMMU_SUPPORT
 # define DISABLE_ENQCMD	0
 #else
@@ -82,7 +88,7 @@
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-			 DISABLE_ENQCMD)
+			 DISABLE_ENQCMD|DISABLE_VPCLMULQDQ)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/crypto/Kconfig b/crypto/Kconfig
index a367fcf..b090f14 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -613,6 +613,29 @@ config CRYPTO_CRC32C_VPMSUM
 	  (vpmsum) instructions, introduced in POWER8. Enable on POWER8
 	  and newer processors for improved performance.
 
+config CRYPTO_AVX512
+	bool "AVX512 hardware acceleration for crypto algorithms"
+	depends on X86
+	depends on 64BIT
+	help
+	  This option will compile in AVX512 hardware accelerated crypto
+	  algorithms. These optimized algorithms provide substantial(2-10x)
+	  improvements over existing crypto algorithms for large data size.
+	  However, it may also incur a frequency penalty (aka. "bin drops")
+	  and cause collateral damage to other workloads running on the
+	  same core.
+
+# We default CRYPTO_CRCT10DIF_AVX512 to Y but depend on CRYPTO_AVX512 in
+# order to have a singular option (CRYPTO_AVX512) select multiple algorithms
+# when supported. Specifically, if the platform and/or toolset does not
+# support VPLMULQDQ. Then this algorithm should not be supported as part of
+# the set that CRYPTO_AVX512 selects.
+config CRYPTO_CRCT10DIF_AVX512
+	bool
+	default y
+	depends on CRYPTO_AVX512
+	depends on CRYPTO_CRCT10DIF_PCLMUL
+	depends on AS_VPCLMULQDQ
 
 config CRYPTO_CRC32C_SPARC64
 	tristate "CRC32c CRC algorithm (SPARC64)"
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
  2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
  2020-12-18 21:10 ` [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support Megha Dey
  2020-12-18 21:10 ` [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction Megha Dey
@ 2020-12-18 21:11 ` Megha Dey
  2020-12-19 17:03   ` Ard Biesheuvel
  2020-12-18 21:11 ` [RFC V1 4/7] crypto: tcrypt - Add speed test for optimized " Megha Dey
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Megha Dey @ 2020-12-18 21:11 UTC (permalink / raw)
  To: herbert, davem
  Cc: linux-crypto, linux-kernel, ravi.v.shankar, tim.c.chen,
	andi.kleen, dave.hansen, megha.dey, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	tony.luck, ira.weiny

From: Kyung Min Park <kyung.min.park@intel.com>

Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.
The new instruction allows to work on 4 x 16 byte blocks at the time.
For best parallelism and deeper out of order execution, the main loop of
the code works on 16 x 16 byte blocks at the time and performs reduction
every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys
and the precompute operation has been optimized as well to leverage 512 bit
registers, parallel carry less multiply and reduction.

VPCLMULQDQ instruction is used to accelerate the most time-consuming
part of GHASH, carry-less multiplication. VPCLMULQDQ instruction
with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.

The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ
version with the VPCLMULQDQ version when the following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
   only Icelake)
2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at
   boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set
   to 1 after boot.
   If compiled as loadable module, use_avx512 module parameter must be set:
   modprobe ghash_clmulni_intel use_avx512=1

With new implementation, tcrypt ghash speed test shows about 4x to 10x
speedup improvement for GHASH calculation compared to the original
implementation with PCLMULQDQ when the bytes per update size is 256 Bytes
or above. Detailed results for a variety of block sizes and update
sizes are in the table below. The test was performed on Icelake based
platform with constant frequency set for CPU.

The average performance improvement of the AVX512 version over the current
implementation is as follows:
For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).
For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).

A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction
and VPCLMULQDQ instruction shows the following results.

---------------------------------------------------------------------------
|            |            |         cycles/operation         |            |
|            |            |       (the lower the better)     |            |
|    byte    |   bytes    |----------------------------------| percentage |
|   blocks   | per update |   GHASH test   |   GHASH test    | loss/gain  |
|            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
|------------|------------|----------------|-----------------|------------|
|      16    |     16     |       144      |        233      |   -38.0    |
|      64    |     16     |       535      |        709      |   -24.5    |
|      64    |     64     |       210      |        146      |    43.8    |
|     256    |     16     |      1808      |       1911      |    -5.4    |
|     256    |     64     |       865      |        581      |    48.9    |
|     256    |    256     |       682      |        170      |   301.0    |
|    1024    |     16     |      6746      |       6935      |    -2.7    |
|    1024    |    256     |      2829      |        714      |   296.0    |
|    1024    |   1024     |      2543      |        341      |   645.0    |
|    2048    |     16     |     13219      |      13403      |    -1.3    |
|    2048    |    256     |      5435      |       1408      |   286.0    |
|    2048    |   1024     |      5218      |        685      |   661.0    |
|    2048    |   2048     |      5061      |        565      |   796.0    |
|    4096    |     16     |     40793      |      27615      |    47.8    |
|    4096    |    256     |     10662      |       2689      |   297.0    |
|    4096    |   1024     |     10196      |       1333      |   665.0    |
|    4096    |   4096     |     10049      |       1011      |   894.0    |
|    8192    |     16     |     51672      |      54599      |    -5.3    |
|    8192    |    256     |     21228      |       5284      |   301.0    |
|    8192    |   1024     |     20306      |       2556      |   694.0    |
|    8192    |   4096     |     20076      |       2044      |   882.0    |
|    8192    |   8192     |     20071      |       2017      |   895.0    |
---------------------------------------------------------------------------

This work was inspired by the AES GCM mode optimization published
in Intel Optimized IPSEC Cryptographic library.
https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm

Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Co-developed-by: Megha Dey <megha.dey@intel.com>
Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 arch/x86/crypto/Makefile                     |    1 +
 arch/x86/crypto/avx512_vaes_common.S         | 1211 ++++++++++++++++++++++++++
 arch/x86/crypto/ghash-clmulni-intel_avx512.S |   68 ++
 arch/x86/crypto/ghash-clmulni-intel_glue.c   |   39 +-
 crypto/Kconfig                               |   12 +
 5 files changed, 1329 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/crypto/avx512_vaes_common.S
 create mode 100644 arch/x86/crypto/ghash-clmulni-intel_avx512.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index bf0b0fc..0a86cfb 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -70,6 +70,7 @@ blake2s-x86_64-y := blake2s-core.o blake2s-glue.o
 
 obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
 ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
+ghash-clmulni-intel-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_AVX512) += ghash-clmulni-intel_avx512.o
 
 obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o
 crc32c-intel-y := crc32c-intel_glue.o
diff --git a/arch/x86/crypto/avx512_vaes_common.S b/arch/x86/crypto/avx512_vaes_common.S
new file mode 100644
index 0000000..f3ee898
--- /dev/null
+++ b/arch/x86/crypto/avx512_vaes_common.S
@@ -0,0 +1,1211 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright © 2020 Intel Corporation.
+ *
+ * Collection of macros which can be used by any crypto code using VAES,
+ * VPCLMULQDQ and AVX512 optimizations.
+ */
+
+#include <linux/linkage.h>
+#include <asm/inst.h>
+
+#define zmm31y ymm31
+#define zmm30y ymm30
+#define zmm29y ymm29
+#define zmm28y ymm28
+#define zmm27y ymm27
+#define zmm26y ymm26
+#define zmm25y ymm25
+#define zmm24y ymm24
+#define zmm23y ymm23
+#define zmm22y ymm22
+#define zmm21y ymm21
+#define zmm20y ymm20
+#define zmm19y ymm19
+#define zmm18y ymm18
+#define zmm17y ymm17
+#define zmm16y ymm16
+#define zmm15y ymm15
+#define zmm13y ymm13
+#define zmm12y ymm12
+#define zmm11y ymm11
+#define zmm10y ymm10
+#define zmm9y  ymm9
+#define zmm8y  ymm8
+#define zmm7y  ymm7
+#define zmm6y  ymm6
+#define zmm5y  ymm5
+#define zmm4y  ymm4
+#define zmm3y  ymm3
+#define zmm2y  ymm2
+#define zmm1y  ymm1
+#define zmm0y  ymm0
+
+#define zmm31x xmm31
+#define zmm30x xmm30
+#define zmm29x xmm29
+#define zmm28x xmm28
+#define zmm27x xmm27
+#define zmm26x xmm26
+#define zmm25x xmm25
+#define zmm24x xmm24
+#define zmm23x xmm23
+#define zmm22x xmm22
+#define zmm21x xmm21
+#define zmm20x xmm20
+#define zmm19x xmm19
+#define zmm18x xmm18
+#define zmm17x xmm17
+#define zmm16x xmm16
+#define zmm15x xmm15
+#define zmm14x xmm14
+#define zmm13x xmm13
+#define zmm12x xmm12
+#define zmm11x xmm11
+#define zmm10x xmm10
+#define zmm9x  xmm9
+#define zmm8x  xmm8
+#define zmm7x  xmm7
+#define zmm6x  xmm6
+#define zmm5x  xmm5
+#define zmm4x  xmm4
+#define zmm3x  xmm3
+#define zmm2x  xmm2
+#define zmm1x  xmm1
+#define zmm0x xmm0
+
+#define ymm5y ymm5
+#define ymm4y ymm4
+#define ymm3y ymm3
+#define ymm2y ymm2
+#define ymm1y ymm1
+
+#define ymm12x xmm12
+#define ymm11x xmm11
+#define ymm7x xmm7
+#define ymm6x xmm6
+#define ymm5x xmm5
+#define ymm4x xmm4
+#define ymm3x xmm3
+#define ymm2x xmm2
+#define ymm1x xmm1
+
+#define xmm14z zmm14
+#define xmm10z zmm10
+#define xmm2z zmm2
+#define xmm0z zmm0
+#define xmm5z zmm5
+#define xmm4z zmm4
+#define xmm3z zmm3
+#define xmm1z zmm1
+#define xmm6z zmm6
+#define xmm7z zmm7
+#define xmm8z zmm8
+#define xmm9z zmm9
+
+#define xmm11y ymm11
+#define xmm9y ymm9
+#define xmm5y ymm5
+#define xmm4y ymm4
+#define xmm3y ymm3
+#define xmm2y ymm2
+#define xmm1y ymm1
+#define xmm0y ymm0
+
+#define xmm14x xmm14
+#define xmm8x xmm8
+#define xmm7x xmm7
+#define xmm6x xmm6
+#define xmm5x xmm5
+#define xmm4x xmm4
+#define xmm3x xmm3
+#define xmm2x xmm2
+#define xmm1x xmm1
+#define xmm0x xmm0
+
+#define xmm0z  zmm0
+#define xmm0y  ymm0
+#define xmm0x  xmm0
+
+#define stringify(reg,y)       reg##y
+#define str(reg,y)	       stringify(reg,y)
+#define concat(reg,y)	       str(reg,y)
+
+#define YWORD(reg)     concat(reg, y)
+#define XWORD(reg)     concat(reg, x)
+#define ZWORD(reg)     concat(reg, z)
+#define DWORD(reg)     concat(reg, d)
+#define WORD(reg)      concat(reg, w)
+#define BYTE(reg)      concat(reg, b)
+
+#define arg1	%rdi
+#define arg2	%rsi
+#define arg3	%rdx
+#define arg4	%rcx
+#define arg5	%r8
+#define arg6	%r9
+
+#define STACK_LOCAL_OFFSET	  64
+#define LOCAL_STORAGE		  (48*16)	 //space for up to 128 AES blocks
+#define STACK_FRAME_SIZE_GHASH	  (STACK_LOCAL_OFFSET + LOCAL_STORAGE)
+
+#define HashKey_48	(16*0)
+#define HashKey_47	(16*1)
+#define HashKey_46	(16*2)
+#define HashKey_45	(16*3)
+#define HashKey_44	(16*4)
+#define HashKey_43	(16*5)
+#define HashKey_42	(16*6)
+#define HashKey_41	(16*7)
+#define HashKey_40	(16*8)
+#define HashKey_39	(16*9)
+#define HashKey_38	(16*10)
+#define HashKey_37	(16*11)
+#define HashKey_36	(16*12)
+#define HashKey_35	(16*13)
+#define HashKey_34	(16*14)
+#define HashKey_33	(16*15)
+#define HashKey_32	(16*16)
+#define HashKey_31	(16*17)
+#define HashKey_30	(16*18)
+#define HashKey_29	(16*19)
+#define HashKey_28	(16*20)
+#define HashKey_27	(16*21)
+#define HashKey_26	(16*22)
+#define HashKey_25	(16*23)
+#define HashKey_24	(16*24)
+#define HashKey_23	(16*25)
+#define HashKey_22     (16*26)
+#define HashKey_21     (16*27)
+#define HashKey_20     (16*28)
+#define HashKey_19     (16*29)
+#define HashKey_18     (16*30)
+#define HashKey_17     (16*31)
+#define HashKey_16	(16*32)
+#define HashKey_15	(16*33)
+#define HashKey_14	(16*34)
+#define HashKey_13	(16*35)
+#define HashKey_12	(16*36)
+#define HashKey_11	(16*37)
+#define HashKey_10	(16*38)
+#define HashKey_9      (16*39)
+#define HashKey_8      (16*40)
+#define HashKey_7      (16*41)
+#define HashKey_6      (16*42)
+#define HashKey_5      (16*43)
+#define HashKey_4      (16*44)
+#define HashKey_3      (16*45)
+#define HashKey_2      (16*46)
+#define HashKey_1      (16*47)
+#define HashKey      (16*47)
+
+.data
+
+.align 16
+ONE:
+.octa	0x00000000000000000000000000000001
+
+.align 16
+POLY:
+.octa	0xC2000000000000000000000000000001
+
+.align 16
+TWOONE:
+.octa	0x00000001000000000000000000000001
+
+/*
+ * Order of these constants should not change.
+ * ALL_F should follow SHIFT_MASK, ZERO should follow ALL_F
+ */
+.align 16
+SHIFT_MASK:
+.octa	0x0f0e0d0c0b0a09080706050403020100
+
+ALL_F:
+.octa	0xffffffffffffffffffffffffffffffff
+
+ZERO:
+.octa	0x00000000000000000000000000000000
+
+.align 16
+ONEf:
+.octa	0x01000000000000000000000000000000
+
+.align 64
+SHUF_MASK:
+.octa	0x000102030405060708090A0B0C0D0E0F
+.octa	0x000102030405060708090A0B0C0D0E0F
+.octa	0x000102030405060708090A0B0C0D0E0F
+.octa	0x000102030405060708090A0B0C0D0E0F
+
+.align 64
+byte_len_to_mask_table:
+.quad	0x0007000300010000
+.quad	0x007f003f001f000f
+.quad	0x07ff03ff01ff00ff
+.quad	0x7fff3fff1fff0fff
+.quad	0xffff
+
+.align 64
+byte64_len_to_mask_table:
+.octa	0x00000000000000010000000000000000
+.octa	0x00000000000000070000000000000003
+.octa	0x000000000000001f000000000000000f
+.octa	0x000000000000007f000000000000003f
+.octa	0x00000000000001ff00000000000000ff
+.octa	0x00000000000007ff00000000000003ff
+.octa	0x0000000000001fff0000000000000fff
+.octa	0x0000000000007fff0000000000003fff
+.octa	0x000000000001ffff000000000000ffff
+.octa	0x000000000007ffff000000000003ffff
+.octa	0x00000000001fffff00000000000fffff
+.octa	0x00000000007fffff00000000003fffff
+.octa	0x0000000001ffffff0000000000ffffff
+.octa	0x0000000007ffffff0000000003ffffff
+.octa	0x000000001fffffff000000000fffffff
+.octa	0x000000007fffffff000000003fffffff
+.octa	0x00000001ffffffff00000000ffffffff
+.octa	0x00000007ffffffff00000003ffffffff
+.octa	0x0000001fffffffff0000000fffffffff
+.octa	0x0000007fffffffff0000003fffffffff
+.octa	0x000001ffffffffff000000ffffffffff
+.octa	0x000007ffffffffff000003ffffffffff
+.octa	0x00001fffffffffff00000fffffffffff
+.octa	0x00007fffffffffff00003fffffffffff
+.octa	0x0001ffffffffffff0000ffffffffffff
+.octa	0x0007ffffffffffff0003ffffffffffff
+.octa	0x001fffffffffffff000fffffffffffff
+.octa	0x007fffffffffffff003fffffffffffff
+.octa	0x01ffffffffffffff00ffffffffffffff
+.octa	0x07ffffffffffffff03ffffffffffffff
+.octa	0x1fffffffffffffff0fffffffffffffff
+.octa	0x7fffffffffffffff3fffffffffffffff
+.octa	0xffffffffffffffff
+
+.align 64
+mask_out_top_block:
+.octa	0xffffffffffffffffffffffffffffffff
+.octa	0xffffffffffffffffffffffffffffffff
+.octa	0xffffffffffffffffffffffffffffffff
+.octa	0x00000000000000000000000000000000
+
+.align 64
+ddq_add_1234:
+.octa	0x00000000000000000000000000000001
+.octa	0x00000000000000000000000000000002
+.octa	0x00000000000000000000000000000003
+.octa	0x00000000000000000000000000000004
+
+.align 64
+ddq_add_5678:
+.octa	0x00000000000000000000000000000005
+.octa	0x00000000000000000000000000000006
+.octa	0x00000000000000000000000000000007
+.octa	0x00000000000000000000000000000008
+
+.align 64
+ddq_add_4444:
+.octa	0x00000000000000000000000000000004
+.octa	0x00000000000000000000000000000004
+.octa	0x00000000000000000000000000000004
+.octa	0x00000000000000000000000000000004
+
+.align 64
+ddq_add_8888:
+.octa	0x00000000000000000000000000000008
+.octa	0x00000000000000000000000000000008
+.octa	0x00000000000000000000000000000008
+.octa	0x00000000000000000000000000000008
+
+.align 64
+ddq_addbe_1234:
+.octa	0x01000000000000000000000000000000
+.octa	0x02000000000000000000000000000000
+.octa	0x03000000000000000000000000000000
+.octa	0x04000000000000000000000000000000
+
+.align 64
+ddq_addbe_4444:
+.octa	0x04000000000000000000000000000000
+.octa	0x04000000000000000000000000000000
+.octa	0x04000000000000000000000000000000
+.octa	0x04000000000000000000000000000000
+
+.align 64
+ddq_addbe_8888:
+.octa	0x08000000000000000000000000000000
+.octa	0x08000000000000000000000000000000
+.octa	0x08000000000000000000000000000000
+.octa	0x08000000000000000000000000000000
+
+.align 64
+POLY2:
+.octa	0xC20000000000000000000001C2000000
+.octa	0xC20000000000000000000001C2000000
+.octa	0xC20000000000000000000001C2000000
+.octa	0xC20000000000000000000001C2000000
+
+.align 16
+byteswap_const:
+.octa	0x000102030405060708090A0B0C0D0E0F
+
+.text
+
+/* Save register content for the caller */
+#define FUNC_SAVE_GHASH()			\
+	mov	%rsp, %rax;		\
+	sub	$STACK_FRAME_SIZE_GHASH, %rsp;\
+	and	$~63, %rsp;		\
+	mov	%r12, 0*8(%rsp);	\
+	mov	%r13, 1*8(%rsp);	\
+	mov	%r14, 2*8(%rsp);	\
+	mov	%r15, 3*8(%rsp);	\
+	mov	%rax, 4*8(%rsp);	\
+	mov	%rax, 4*8(%rsp);	\
+	mov	%rax, %r14;		\
+	mov	%rbp, 5*8(%rsp);	\
+	mov	%rbx, 6*8(%rsp);	\
+
+/* Restore register content for the caller */
+#define FUNC_RESTORE_GHASH()		  \
+	mov	5*8(%rsp), %rbp;	\
+	mov	6*8(%rsp), %rbx;	\
+	mov	0*8(%rsp), %r12;	\
+	mov	1*8(%rsp), %r13;	\
+	mov	2*8(%rsp), %r14;	\
+	mov	3*8(%rsp), %r15;	\
+	mov	4*8(%rsp), %rsp;	\
+
+/*
+ * GHASH school book multiplication
+ */
+#define GHASH_MUL(GH, HK, T1, T2, T3, T4, T5)			\
+	vpclmulqdq	$0x11, HK, GH, T1;			\
+	vpclmulqdq	$0x00, HK, GH, T2;			\
+	vpclmulqdq	$0x01, HK, GH, T3;			\
+	vpclmulqdq	$0x10, HK, GH, GH;			\
+	vpxorq		T3, GH, GH;				\
+	vpsrldq		$8, GH, T3;				\
+	vpslldq		$8, GH, GH;				\
+	vpxorq		T3, T1, T1;				\
+	vpxorq		T2, GH, GH;				\
+	vmovdqu64	POLY2(%rip), T3;			\
+	vpclmulqdq	$0x01, GH, T3, T2;			\
+	vpslldq		$8, T2, T2;				\
+	vpxorq		T2, GH, GH;				\
+	vpclmulqdq	$0x00, GH, T3, T2;			\
+	vpsrldq		$4, T2, T2;				\
+	vpclmulqdq	$0x10, GH, T3, GH;			\
+	vpslldq		$4, GH, GH;				\
+	vpternlogq	$0x96, T2, T1, GH;
+
+/*
+ * Precomputation of hash keys. These precomputated keys
+ * are saved in memory and reused for as many 8 blocks sets
+ * as necessary.
+ */
+#define PRECOMPUTE(GDATA, HK, T1, T2, T3, T4, T5, T6, T7, T8) \
+\
+	vmovdqa64	HK, T5; \
+	vinserti64x2	$3, HK, ZWORD(T7), ZWORD(T7); \
+	GHASH_MUL(T5, HK, T1, T3, T4, T6, T2) \
+	vmovdqu64	T5, HashKey_2(GDATA); \
+	vinserti64x2	$2, T5, ZWORD(T7), ZWORD(T7); \
+	GHASH_MUL(T5, HK, T1, T3, T4, T6, T2) \
+	vmovdqu64	T5, HashKey_3(GDATA); \
+	vinserti64x2	$1, T5, ZWORD(T7), ZWORD(T7); \
+	GHASH_MUL(T5, HK, T1, T3, T4, T6, T2) \
+	vmovdqu64	T5, HashKey_4(GDATA); \
+	vinserti64x2	$0, T5, ZWORD(T7), ZWORD(T7); \
+	vshufi64x2	$0x00, ZWORD(T5), ZWORD(T5), ZWORD(T5); \
+	vmovdqa64	ZWORD(T7), ZWORD(T8); \
+	GHASH_MUL(ZWORD(T7), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64	ZWORD(T7), HashKey_8(GDATA); \
+	vshufi64x2	$0x00, ZWORD(T7), ZWORD(T7), ZWORD(T5); \
+	GHASH_MUL(ZWORD(T8), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64 ZWORD(T8), HashKey_12(GDATA); \
+	GHASH_MUL(ZWORD(T7), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64 ZWORD(T7), HashKey_16(GDATA); \
+	GHASH_MUL(ZWORD(T8), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64 ZWORD(T8), HashKey_20(GDATA); \
+	GHASH_MUL(ZWORD(T7), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64 ZWORD(T7), HashKey_24(GDATA); \
+	GHASH_MUL(ZWORD(T8), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64 ZWORD(T8), HashKey_28(GDATA); \
+	GHASH_MUL(ZWORD(T7), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64 ZWORD(T7), HashKey_32(GDATA); \
+	GHASH_MUL(ZWORD(T8), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64 ZWORD(T8), HashKey_36(GDATA); \
+	GHASH_MUL(ZWORD(T7), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64 ZWORD(T7), HashKey_40(GDATA); \
+	GHASH_MUL(ZWORD(T8), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64 ZWORD(T8), HashKey_44(GDATA); \
+	GHASH_MUL(ZWORD(T7), ZWORD(T5), ZWORD(T1), ZWORD(T3), ZWORD(T4), ZWORD(T6), ZWORD(T2)) \
+	vmovdqu64 ZWORD(T7), HashKey_48(GDATA);
+
+#define VHPXORI4x128(REG,TMP)					\
+	vextracti64x4	$1, REG, YWORD(TMP);			\
+	vpxorq		YWORD(TMP), YWORD(REG), YWORD(REG);	\
+	vextracti32x4	$1, YWORD(REG), XWORD(TMP);		\
+	vpxorq		XWORD(TMP), XWORD(REG), XWORD(REG);
+
+#define VCLMUL_REDUCE(OUT, POLY, HI128, LO128, TMP0, TMP1)	\
+	vpclmulqdq	$0x01, LO128, POLY, TMP0;		\
+	vpslldq		$8, TMP0, TMP0;				\
+	vpxorq		TMP0, LO128, TMP0;			\
+	vpclmulqdq	$0x00, TMP0, POLY, TMP1;		\
+	vpsrldq		$4, TMP1, TMP1;				\
+	vpclmulqdq	$0x10, TMP0, POLY, OUT;			\
+	vpslldq		$4, OUT, OUT;				\
+	vpternlogq	$0x96, HI128, TMP1, OUT;
+
+/*
+ * GHASH 1 to 16 blocks of the input buffer.
+ *  - It performs reduction at the end.
+ *  - It can take intermediate GHASH sums as input.
+ */
+#define GHASH_1_TO_16(KP, OFFSET, GHASH, T1, T2, T3, T4, T5, T6, T7, T8, T9, AAD_HASH_IN, CIPHER_IN0, CIPHER_IN1, CIPHER_IN2, CIPHER_IN3, NUM_BLOCKS, BOOL, INSTANCE_TYPE, ROUND, HKEY_START, PREV_H, PREV_L, PREV_M1, PREV_M2) \
+.set	reg_idx, 0;	\
+.set	blocks_left, NUM_BLOCKS;	\
+.ifc INSTANCE_TYPE, single_call; \
+	.if BOOL == 1; \
+	.set	hashk, concat(HashKey_, NUM_BLOCKS);	\
+	.else; \
+	.set	hashk, concat(HashKey_, NUM_BLOCKS) + 0x11;	 \
+	.endif; \
+	.set	first_result, 1; \
+	.set	reduce, 1; \
+	vpxorq		AAD_HASH_IN, CIPHER_IN0, CIPHER_IN0; \
+.else;	\
+	.set	hashk, concat(HashKey_, HKEY_START);	\
+	.ifc ROUND, first; \
+		.set first_result, 1; \
+		.set reduce, 0; \
+		vpxorq		AAD_HASH_IN, CIPHER_IN0, CIPHER_IN0; \
+	.else; \
+		.ifc ROUND, mid; \
+		    .set first_result, 0; \
+		    .set reduce, 0; \
+		    vmovdqa64	    PREV_H, T1; \
+		    vmovdqa64	    PREV_L, T2; \
+		    vmovdqa64	    PREV_M1, T3; \
+		    vmovdqa64	    PREV_M2, T4; \
+		.else; \
+		    .set first_result, 0; \
+		    .set reduce, 1; \
+		    vmovdqa64	    PREV_H, T1; \
+		    vmovdqa64	    PREV_L, T2; \
+		    vmovdqa64	    PREV_M1, T3; \
+		    vmovdqa64	    PREV_M2, T4; \
+		.endif; \
+	.endif; \
+.endif; \
+.if NUM_BLOCKS < 4;	\
+	.if blocks_left == 1; \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), XWORD(T9); \
+			vpclmulqdq	$0x11, XWORD(T9), XWORD(CIPHER_IN0), XWORD(T1); \
+			vpclmulqdq	$0x00, XWORD(T9), XWORD(CIPHER_IN0), XWORD(T2); \
+			vpclmulqdq	$0x01, XWORD(T9), XWORD(CIPHER_IN0), XWORD(T3); \
+			vpclmulqdq	$0x10, XWORD(T9), XWORD(CIPHER_IN0), XWORD(T4); \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), XWORD(T9); \
+			vpclmulqdq	$0x11, XWORD(T9), XWORD(CIPHER_IN0), XWORD(T5); \
+			vpclmulqdq	$0x00, XWORD(T9), XWORD(CIPHER_IN0), XWORD(T6); \
+			vpclmulqdq	$0x01, XWORD(T9), XWORD(CIPHER_IN0), XWORD(T7); \
+			vpclmulqdq	$0x10, XWORD(T9), XWORD(CIPHER_IN0), XWORD(T8); \
+		.endif; \
+	.elseif blocks_left == 2; \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vpclmulqdq	$0x11, YWORD(T9), YWORD(CIPHER_IN0), YWORD(T1); \
+			vpclmulqdq	$0x00, YWORD(T9), YWORD(CIPHER_IN0), YWORD(T2); \
+			vpclmulqdq	$0x01, YWORD(T9), YWORD(CIPHER_IN0), YWORD(T3); \
+			vpclmulqdq	$0x10, YWORD(T9), YWORD(CIPHER_IN0), YWORD(T4); \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vpclmulqdq	$0x11, YWORD(T9), YWORD(CIPHER_IN0), YWORD(T5); \
+			vpclmulqdq	$0x00, YWORD(T9), YWORD(CIPHER_IN0), YWORD(T6); \
+			vpclmulqdq	$0x01, YWORD(T9), YWORD(CIPHER_IN0), YWORD(T7); \
+			vpclmulqdq	$0x10, YWORD(T9), YWORD(CIPHER_IN0), YWORD(T8); \
+		.endif; \
+	.elseif blocks_left == 3;	\
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vinserti64x2	$2, 32 + hashk + OFFSET(KP), T9, T9; \
+			vpclmulqdq	$0x11, T9, CIPHER_IN0, T1; \
+			vpclmulqdq	$0x00, T9, CIPHER_IN0, T2; \
+			vpclmulqdq	$0x01, T9, CIPHER_IN0, T3; \
+			vpclmulqdq	$0x10, T9, CIPHER_IN0, T4; \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vinserti64x2	$2, 32 + hashk + OFFSET(KP), T9, T9; \
+			vpclmulqdq	$0x11, T9, CIPHER_IN0, T5; \
+			vpclmulqdq	$0x00, T9, CIPHER_IN0, T6; \
+			vpclmulqdq	$0x01, T9, CIPHER_IN0, T7; \
+			vpclmulqdq	$0x10, T9, CIPHER_IN0, T8; \
+		.endif; \
+	.endif; \
+	.if first_result != 1; \
+		 vpxorq		 T5, T1, T1; \
+		 vpxorq		 T6, T2, T2; \
+		 vpxorq		 T7, T3, T3; \
+		 vpxorq		 T8, T4, T4; \
+	.endif; \
+.elseif (NUM_BLOCKS >= 4) && (NUM_BLOCKS < 8); \
+	vmovdqu64	hashk + OFFSET(KP), T9; \
+	.if first_result == 1; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN0, T1; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN0, T2; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN0, T3; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN0, T4; \
+		.set first_result, 0; \
+	.else; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN0, T5; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN0, T6; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN0, T7; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN0, T8; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.set hashk, hashk + 64; \
+	.set blocks_left, blocks_left - 4;	\
+	.set reg_idx, reg_idx + 1;	\
+	.if blocks_left > 0;	\
+	.if blocks_left == 1; \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), XWORD(T9); \
+			vpclmulqdq	$0x11, XWORD(T9), XWORD(CIPHER_IN1), XWORD(T1); \
+			vpclmulqdq	$0x00, XWORD(T9), XWORD(CIPHER_IN1), XWORD(T2); \
+			vpclmulqdq	$0x01, XWORD(T9), XWORD(CIPHER_IN1), XWORD(T3); \
+			vpclmulqdq	$0x10, XWORD(T9), XWORD(CIPHER_IN1), XWORD(T4); \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), XWORD(T9); \
+			vpclmulqdq	$0x11, XWORD(T9), XWORD(CIPHER_IN1), XWORD(T5); \
+			vpclmulqdq	$0x00, XWORD(T9), XWORD(CIPHER_IN1), XWORD(T6); \
+			vpclmulqdq	$0x01, XWORD(T9), XWORD(CIPHER_IN1), XWORD(T7); \
+			vpclmulqdq	$0x10, XWORD(T9), XWORD(CIPHER_IN1), XWORD(T8); \
+		.endif; \
+	.elseif blocks_left == 2; \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vpclmulqdq	$0x11, YWORD(T9), YWORD(CIPHER_IN1), YWORD(T1); \
+			vpclmulqdq	$0x00, YWORD(T9), YWORD(CIPHER_IN1), YWORD(T2); \
+			vpclmulqdq	$0x01, YWORD(T9), YWORD(CIPHER_IN1), YWORD(T3); \
+			vpclmulqdq	$0x10, YWORD(T9), YWORD(CIPHER_IN1), YWORD(T4); \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vpclmulqdq	$0x11, YWORD(T9), YWORD(CIPHER_IN1), YWORD(T5); \
+			vpclmulqdq	$0x00, YWORD(T9), YWORD(CIPHER_IN1), YWORD(T6); \
+			vpclmulqdq	$0x01, YWORD(T9), YWORD(CIPHER_IN1), YWORD(T7); \
+			vpclmulqdq	$0x10, YWORD(T9), YWORD(CIPHER_IN1), YWORD(T8); \
+		.endif; \
+	.elseif blocks_left == 3;  \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vinserti64x2	$2, 32 + hashk + OFFSET(KP), T9, T9; \
+			vpclmulqdq	$0x11, T9, CIPHER_IN1, T1; \
+			vpclmulqdq	$0x00, T9, CIPHER_IN1, T2; \
+			vpclmulqdq	$0x01, T9, CIPHER_IN1, T3; \
+			vpclmulqdq	$0x10, T9, CIPHER_IN1, T4; \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vinserti64x2	$2, 32 + hashk + OFFSET(KP), T9, T9; \
+			vpclmulqdq	$0x11, T9, CIPHER_IN1, T5; \
+			vpclmulqdq	$0x00, T9, CIPHER_IN1, T6; \
+			vpclmulqdq	$0x01, T9, CIPHER_IN1, T7; \
+			vpclmulqdq	$0x10, T9, CIPHER_IN1, T8; \
+		.endif; \
+	.endif; \
+	.if first_result != 1; \
+			vpxorq		T5, T1, T1; \
+			vpxorq		T6, T2, T2; \
+			vpxorq		T7, T3, T3; \
+			vpxorq		T8, T4, T4; \
+	.endif; \
+	.endif; \
+.elseif (NUM_BLOCKS >= 8) && (NUM_BLOCKS < 12); \
+	vmovdqu64	hashk + OFFSET(KP), T9; \
+	.if first_result == 1; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN0, T1; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN0, T2; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN0, T3; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN0, T4; \
+		.set first_result, 0; \
+	.else; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN0, T5; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN0, T6; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN0, T7; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN0, T8; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.set hashk, hashk + 64; \
+	.set blocks_left, blocks_left - 4; \
+	.set reg_idx, reg_idx + 1;	\
+	vmovdqu64	hashk + OFFSET(KP), T9; \
+	.if first_result == 1; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN1, T1; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN1, T2; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN1, T3; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN1, T4; \
+		.set first_result, 0; \
+	.else; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN1, T5; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN1, T6; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN1, T7; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN1, T8; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.set hashk, hashk + 64; \
+	.set blocks_left, blocks_left - 4; \
+	.set reg_idx, reg_idx + 1;	\
+	.if blocks_left > 0; \
+	.if blocks_left == 1; \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), XWORD(T9); \
+			vpclmulqdq	$0x11, XWORD(T9), XWORD(CIPHER_IN2), XWORD(T1); \
+			vpclmulqdq	$0x00, XWORD(T9), XWORD(CIPHER_IN2), XWORD(T2); \
+			vpclmulqdq	$0x01, XWORD(T9), XWORD(CIPHER_IN2), XWORD(T3); \
+			vpclmulqdq	$0x10, XWORD(T9), XWORD(CIPHER_IN2), XWORD(T4); \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), XWORD(T9); \
+			vpclmulqdq	$0x11, XWORD(T9), XWORD(CIPHER_IN2), XWORD(T5); \
+			vpclmulqdq	$0x00, XWORD(T9), XWORD(CIPHER_IN2), XWORD(T6); \
+			vpclmulqdq	$0x01, XWORD(T9), XWORD(CIPHER_IN2), XWORD(T7); \
+			vpclmulqdq	$0x10, XWORD(T9), XWORD(CIPHER_IN2), XWORD(T8); \
+		.endif; \
+	.elseif blocks_left == 2; \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vpclmulqdq	$0x11, YWORD(T9), YWORD(CIPHER_IN2), YWORD(T1); \
+			vpclmulqdq	$0x00, YWORD(T9), YWORD(CIPHER_IN2), YWORD(T2); \
+			vpclmulqdq	$0x01, YWORD(T9), YWORD(CIPHER_IN2), YWORD(T3); \
+			vpclmulqdq	$0x10, YWORD(T9), YWORD(CIPHER_IN2), YWORD(T4); \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vpclmulqdq	$0x11, YWORD(T9), YWORD(CIPHER_IN2), YWORD(T5); \
+			vpclmulqdq	$0x00, YWORD(T9), YWORD(CIPHER_IN2), YWORD(T6); \
+			vpclmulqdq	$0x01, YWORD(T9), YWORD(CIPHER_IN2), YWORD(T7); \
+			vpclmulqdq	$0x10, YWORD(T9), YWORD(CIPHER_IN2), YWORD(T8); \
+		.endif; \
+	.elseif blocks_left == 3;  \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vinserti64x2	$2, 32 + hashk + OFFSET(KP), T9, T9; \
+			vpclmulqdq	$0x11, T9, CIPHER_IN2, T1; \
+			vpclmulqdq	$0x00, T9, CIPHER_IN2, T2; \
+			vpclmulqdq	$0x01, T9, CIPHER_IN2, T3; \
+			vpclmulqdq	$0x10, T9, CIPHER_IN2, T4; \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vinserti64x2	$2, 32 + hashk + OFFSET(KP), T9, T9; \
+			vpclmulqdq	$0x11, T9, CIPHER_IN2, T5; \
+			vpclmulqdq	$0x00, T9, CIPHER_IN2, T6; \
+			vpclmulqdq	$0x01, T9, CIPHER_IN2, T7; \
+			vpclmulqdq	$0x10, T9, CIPHER_IN2, T8; \
+		.endif; \
+	.endif; \
+	.if first_result != 1; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.endif; \
+.elseif (NUM_BLOCKS >= 12) && (NUM_BLOCKS < 16); \
+	vmovdqu64	hashk + OFFSET(KP), T9; \
+	.if first_result == 1; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN0, T1; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN0, T2; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN0, T3; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN0, T4; \
+		first_result = 0; \
+	.else; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN0, T5; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN0, T6; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN0, T7; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN0, T8; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.set hashk, hashk + 64; \
+	.set blocks_left, blocks_left - 4; \
+	.set reg_idx, reg_idx + 1;	\
+	vmovdqu64	hashk + OFFSET(KP), T9; \
+	.if first_result == 1; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN1, T1; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN1, T2; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN1, T3; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN1, T4; \
+		first_result = 0; \
+	.else; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN1, T5; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN1, T6; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN1, T7; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN1, T8; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.set hashk, hashk + 64; \
+	.set blocks_left, blocks_left - 4; \
+	.set reg_idx, reg_idx + 1;	\
+	vmovdqu64	hashk + OFFSET(KP), T9; \
+	.if first_result == 1; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN2, T1; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN2, T2; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN2, T3; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN2, T4; \
+		first_result = 0; \
+	.else; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN2, T5; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN2, T6; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN2, T7; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN2, T8; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.set hashk, hashk + 64; \
+	.set blocks_left, blocks_left - 4; \
+	.set reg_idx, reg_idx + 1;	\
+	.if blocks_left > 0;	\
+	.if blocks_left == 1; \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), XWORD(T9); \
+			vpclmulqdq	$0x11, XWORD(T9), XWORD(CIPHER_IN3), XWORD(T1); \
+			vpclmulqdq	$0x00, XWORD(T9), XWORD(CIPHER_IN3), XWORD(T2); \
+			vpclmulqdq	$0x01, XWORD(T9), XWORD(CIPHER_IN3), XWORD(T3); \
+			vpclmulqdq	$0x10, XWORD(T9), XWORD(CIPHER_IN3), XWORD(T4); \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), XWORD(T9); \
+			vpclmulqdq	$0x11, XWORD(T9), XWORD(CIPHER_IN3), XWORD(T5); \
+			vpclmulqdq	$0x00, XWORD(T9), XWORD(CIPHER_IN3), XWORD(T6); \
+			vpclmulqdq	$0x01, XWORD(T9), XWORD(CIPHER_IN3), XWORD(T7); \
+			vpclmulqdq	$0x10, XWORD(T9), XWORD(CIPHER_IN3), XWORD(T8); \
+		.endif; \
+	.elseif blocks_left == 2; \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vpclmulqdq	$0x11, YWORD(T9), YWORD(CIPHER_IN3), YWORD(T1); \
+			vpclmulqdq	$0x00, YWORD(T9), YWORD(CIPHER_IN3), YWORD(T2); \
+			vpclmulqdq	$0x01, YWORD(T9), YWORD(CIPHER_IN3), YWORD(T3); \
+			vpclmulqdq	$0x10, YWORD(T9), YWORD(CIPHER_IN3), YWORD(T4); \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vpclmulqdq	$0x11, YWORD(T9), YWORD(CIPHER_IN3), YWORD(T5); \
+			vpclmulqdq	$0x00, YWORD(T9), YWORD(CIPHER_IN3), YWORD(T6); \
+			vpclmulqdq	$0x01, YWORD(T9), YWORD(CIPHER_IN3), YWORD(T7); \
+			vpclmulqdq	$0x10, YWORD(T9), YWORD(CIPHER_IN3), YWORD(T8); \
+		.endif; \
+	.elseif blocks_left == 3;  \
+		.if first_result == 1;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vinserti64x2	$2, 32 + hashk + OFFSET(KP), T9, T9; \
+			vpclmulqdq	$0x11, T9, CIPHER_IN3, T1; \
+			vpclmulqdq	$0x00, T9, CIPHER_IN3, T2; \
+			vpclmulqdq	$0x01, T9, CIPHER_IN3, T3; \
+			vpclmulqdq	$0x10, T9, CIPHER_IN3, T4; \
+		.else;	\
+			vmovdqu64	hashk + OFFSET(KP), YWORD(T9); \
+			vinserti64x2	$2, 32 + hashk + OFFSET(KP), T9, T9; \
+			vpclmulqdq	$0x11, T9, CIPHER_IN3, T5; \
+			vpclmulqdq	$0x00, T9, CIPHER_IN3, T6; \
+			vpclmulqdq	$0x01, T9, CIPHER_IN3, T7; \
+			vpclmulqdq	$0x10, T9, CIPHER_IN3, T8; \
+		.endif; \
+	.endif; \
+	.if first_result != 1; \
+			vpxorq		T5, T1, T1; \
+			vpxorq		T6, T2, T2; \
+			vpxorq		T7, T3, T3; \
+			vpxorq		T8, T4, T4; \
+	.endif; \
+	.endif; \
+.else;	\
+	vmovdqu64	hashk + OFFSET(KP), T9; \
+	.if first_result == 1; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN0, T1; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN0, T2; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN0, T3; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN0, T4; \
+		first_result = 0; \
+	.else; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN0, T5; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN0, T6; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN0, T7; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN0, T8; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.set hashk, hashk + 64; \
+	.set blocks_left, blocks_left - 4; \
+	.set reg_idx, reg_idx + 1;     \
+	vmovdqu64	hashk + OFFSET(KP), T9; \
+	.if first_result == 1; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN1, T1; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN1, T2; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN1, T3; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN1, T4; \
+		first_result = 0; \
+	.else; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN1, T5; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN1, T6; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN1, T7; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN1, T8; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.set hashk, hashk + 64; \
+	.set blocks_left, blocks_left - 4; \
+	.set reg_idx, reg_idx + 1;	\
+	vmovdqu64	hashk + OFFSET(KP), T9; \
+	.if first_result == 1; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN2, T1; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN2, T2; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN2, T3; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN2, T4; \
+		first_result = 0; \
+	.else; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN2, T5; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN2, T6; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN2, T7; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN2, T8; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.set hashk, hashk + 64; \
+	.set blocks_left, blocks_left - 4; \
+	.set reg_idx, reg_idx + 1;	\
+	vmovdqu64	hashk + OFFSET(KP), T9; \
+	.if first_result == 1; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN3, T1; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN3, T2; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN3, T3; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN3, T4; \
+		first_result = 0; \
+	.else; \
+		vpclmulqdq	$0x11, T9, CIPHER_IN3, T5; \
+		vpclmulqdq	$0x00, T9, CIPHER_IN3, T6; \
+		vpclmulqdq	$0x01, T9, CIPHER_IN3, T7; \
+		vpclmulqdq	$0x10, T9, CIPHER_IN3, T8; \
+		vpxorq		T5, T1, T1; \
+		vpxorq		T6, T2, T2; \
+		vpxorq		T7, T3, T3; \
+		vpxorq		T8, T4, T4; \
+	.endif; \
+	.set hashk, hashk + 64; \
+	.set blocks_left, blocks_left - 4; \
+	.set reg_idx, reg_idx + 1;	\
+.endif; \
+.if reduce == 1; \
+	vpxorq		T4, T3, T3; \
+	vpsrldq		$8, T3, T7; \
+	vpslldq		$8, T3, T8; \
+	vpxorq		T7, T1, T1; \
+	vpxorq		T8, T2, T2; \
+	VHPXORI4x128(T1, T7); \
+	VHPXORI4x128(T2, T8); \
+	vmovdqa64	POLY2(%rip), XWORD(T9); \
+	VCLMUL_REDUCE(XWORD(GHASH), XWORD(T9), XWORD(T1), XWORD(T2), XWORD(T3), XWORD(T4)) \
+.else; \
+	vmovdqa64	T1, PREV_H; \
+	vmovdqa64	T2, PREV_L; \
+	vmovdqa64	T3, PREV_M1; \
+	vmovdqa64	T4, PREV_M2; \
+.endif;
+
+/*
+ * Calculates the hash of the data which will not be encrypted.
+ * Input: The input data (A_IN), that data's length (A_LEN), and the hash key (GDATA_KEY).
+ * Output: The hash of the data (AAD_HASH).
+ */
+#define CALC_AAD_HASH(A_IN, A_LEN, AAD_HASH, GDATA_KEY, ZT0, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZT13, ZT14, ZT15, ZT16, ZT17, T1, T2, T3, MASKREG, OFFSET) \
+	mov	A_IN, T1; \
+	mov	A_LEN, T2; \
+	or	T2, T2; \
+	jz	0f; \
+	vmovdqa64	SHUF_MASK(%rip), ZT13; \
+20:; \
+	cmp	$(48*16), T2; \
+	jl	21f; \
+	vmovdqu64	64*0(T1), ZT1; \
+	vmovdqu64	64*1(T1), ZT2; \
+	vmovdqu64	64*2(T1), ZT3; \
+	vmovdqu64	64*3(T1), ZT4; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	vpshufb ZT13, ZT4, ZT4; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, ZT4, 16, 1, multi_call, first, 48, ZT14, ZT15, ZT16, ZT17) \
+	vmovdqu64     0 + 256(T1), ZT1; \
+	vmovdqu64     64 + 256(T1), ZT2; \
+	vmovdqu64     128 + 256(T1), ZT3; \
+	vmovdqu64     192 + 256(T1), ZT4; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	vpshufb ZT13, ZT4, ZT4; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, ZT4, 16, 1, multi_call, mid, 32, ZT14, ZT15, ZT16, ZT17) \
+	vmovdqu64     0 + 512(T1), ZT1; \
+	vmovdqu64     64 + 512(T1), ZT2; \
+	vmovdqu64     128 + 512(T1), ZT3; \
+	vmovdqu64     192 + 512(T1), ZT4; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	vpshufb ZT13, ZT4, ZT4; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, ZT4, 16, 1, multi_call, last, 16, ZT14, ZT15, ZT16, ZT17) \
+	sub	$(48*16), T2; \
+	je	0f; \
+	add	$(48*16), T1; \
+	jmp	20b; \
+21:; \
+	cmp	$(32*16), T2; \
+	jl	22f; \
+	vmovdqu64	64*0(T1), ZT1; \
+	vmovdqu64	64*1(T1), ZT2; \
+	vmovdqu64	64*2(T1), ZT3; \
+	vmovdqu64	64*3(T1), ZT4; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	vpshufb ZT13, ZT4, ZT4; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, ZT4, 16, 1, multi_call, first, 32, ZT14, ZT15, ZT16, ZT17) \
+	vmovdqu64     0 + 256(T1), ZT1; \
+	vmovdqu64     64 + 256(T1), ZT2; \
+	vmovdqu64     128 + 256(T1), ZT3; \
+	vmovdqu64     192 + 256(T1), ZT4; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	vpshufb ZT13, ZT4, ZT4; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, ZT4, 16, 1, multi_call, last, 16, ZT14, ZT15, ZT16, ZT17) \
+	sub	$(32*16), T2; \
+	je	0f; \
+	add	$(32*16), T1; \
+	jmp	23f; \
+22:; \
+	cmp	$(16*16), T2; \
+	jl	23f; \
+	vmovdqu64	64*0(T1), ZT1; \
+	vmovdqu64	64*1(T1), ZT2; \
+	vmovdqu64	64*2(T1), ZT3; \
+	vmovdqu64	64*3(T1), ZT4; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	vpshufb ZT13, ZT4, ZT4; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, ZT4, 16, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	sub	$(16*16), T2; \
+	je	0f; \
+	add	$(16*16), T1; \
+23:; \
+	lea	byte64_len_to_mask_table(%rip), T3; \
+	lea	(T3, T2, 8), T3; \
+	add	$15, T2; \
+	and	$-16, T2; \
+	shr	$4, T2; \
+	cmp	$1, T2; \
+	je	1f; \
+	cmp	$2, T2; \
+	je	2f; \
+	cmp	$3, T2; \
+	je	3f; \
+	cmp	$4, T2; \
+	je	4f; \
+	cmp	$5, T2; \
+	je	5f; \
+	cmp	$6, T2; \
+	je	6f; \
+	cmp	$7, T2; \
+	je	7f; \
+	cmp	$8, T2; \
+	je	8f; \
+	cmp	$9, T2; \
+	je	9f; \
+	cmp	$10, T2; \
+	je	10f; \
+	cmp	$11, T2; \
+	je	11f; \
+	cmp	$12, T2; \
+	je	12f; \
+	cmp	$13, T2; \
+	je	13f; \
+	cmp	$14, T2; \
+	je	14f; \
+	cmp	$15, T2; \
+	je	15f; \
+16:; \
+	sub $(64*3*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), ZT2; \
+	vmovdqu8  64*2(T1), ZT3; \
+	vmovdqu8  64*3(T1), ZT4{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	vpshufb ZT13, ZT4, ZT4; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, ZT4, 16, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+15:; \
+	sub $(64*3*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), ZT2; \
+	vmovdqu8  64*2(T1), ZT3; \
+	vmovdqu8  64*3(T1), ZT4{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	vpshufb ZT13, ZT4, ZT4; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, ZT4, 15, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+14:; \
+	sub $(64*3*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), ZT2; \
+	vmovdqu8  64*2(T1), ZT3; \
+	vmovdqu8  64*3(T1), ZT4{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	vpshufb ZT13, ZT4, ZT4; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, ZT4, 14, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+13:; \
+	sub $(64*3*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), ZT2; \
+	vmovdqu8  64*2(T1), ZT3; \
+	vmovdqu8  64*3(T1), ZT4{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	vpshufb ZT13, ZT4, ZT4; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, ZT4, 13, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+12:; \
+	sub $(64*2*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), ZT2; \
+	vmovdqu8  64*2(T1), ZT3{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, no_zmm, 12, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+11:; \
+	sub $(64*2*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), ZT2; \
+	vmovdqu8  64*2(T1), ZT3{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, no_zmm, 11, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+10:; \
+	sub $(64*2*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), ZT2; \
+	vmovdqu8  64*2(T1), ZT3{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, no_zmm, 10, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+9:; \
+	sub $(64*2*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), ZT2; \
+	vmovdqu8  64*2(T1), ZT3{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	vpshufb ZT13, ZT3, ZT3; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZWORD(AAD_HASH), ZT1, ZT2, ZT3, no_zmm, 9, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+8:; \
+	sub $(64*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), ZT2{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZWORD(AAD_HASH), ZT1, ZT2, no_zmm, no_zmm, 8, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+7:; \
+	sub $(64*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), ZT2{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb ZT13, ZT2, ZT2; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZWORD(AAD_HASH), ZT1, ZT2, no_zmm, no_zmm, 7, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+6:; \
+	sub $(64*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), YWORD(ZT2){MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb YWORD(ZT13), YWORD(ZT2), YWORD(ZT2); \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZWORD(AAD_HASH), ZT1, ZT2, no_zmm, no_zmm, 6, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+5:; \
+	sub $(64*8), T3; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1; \
+	vmovdqu8  64*1(T1), XWORD(ZT2){MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	vpshufb XWORD(ZT13), XWORD(ZT2), XWORD(ZT2); \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZWORD(AAD_HASH), ZT1, ZT2, no_zmm, no_zmm, 5, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+4:; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZWORD(AAD_HASH), ZT1, no_zmm, no_zmm, no_zmm, 4, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+3:; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), ZT1{MASKREG}{z}; \
+	vpshufb ZT13, ZT1, ZT1; \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZWORD(AAD_HASH), ZT1, no_zmm, no_zmm, no_zmm, 3, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+2:; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), YWORD(ZT1){MASKREG}{z}; \
+	vpshufb YWORD(ZT13), YWORD(ZT1), YWORD(ZT1); \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZWORD(AAD_HASH), ZT1, no_zmm, no_zmm, no_zmm, 2, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+	jmp	0f; \
+1:; \
+	kmovq	(T3), MASKREG; \
+	vmovdqu8  64*0(T1), XWORD(ZT1){MASKREG}{z}; \
+	vpshufb XWORD(ZT13), XWORD(ZT1), XWORD(ZT1); \
+	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZWORD(AAD_HASH), ZT1, no_zmm, no_zmm, no_zmm, 1, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
+0:;
diff --git a/arch/x86/crypto/ghash-clmulni-intel_avx512.S b/arch/x86/crypto/ghash-clmulni-intel_avx512.S
new file mode 100644
index 0000000..9cbc40f
--- /dev/null
+++ b/arch/x86/crypto/ghash-clmulni-intel_avx512.S
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright © 2020 Intel Corporation.
+ *
+ * Implement GHASH calculation with AVX512 instructions. (x86_64)
+ *
+ * This is GHASH calculation with AVX512 instructions. It requires
+ * the support of Intel(R) AVX512F and VPCLMULQDQ instructions.
+ */
+
+#include "avx512_vaes_common.S"
+
+/*
+ * void ghash_precomp_avx512(u8 *key_data);
+ */
+SYM_FUNC_START(ghash_precomp_avx512)
+        FUNC_SAVE_GHASH()
+
+        /* move original key to xmm6 */
+        vmovdqu HashKey_1(arg1), %xmm6
+
+        vpshufb SHUF_MASK(%rip), %xmm6, %xmm6
+
+        vmovdqa %xmm6, %xmm2
+        vpsllq  $1, %xmm6, %xmm6
+        vpsrlq  $63, %xmm2, %xmm2
+        vmovdqa %xmm2, %xmm1
+        vpslldq $8, %xmm2, %xmm2
+        vpsrldq $8, %xmm1, %xmm1
+        vpor %xmm2, %xmm6, %xmm6
+        vpshufd $36, %xmm1, %xmm2
+        vpcmpeqd TWOONE(%rip), %xmm2, %xmm2
+        vpand POLY(%rip), %xmm2, %xmm2
+        vpxor %xmm2, %xmm6, %xmm6
+        vmovdqu %xmm6, HashKey_1(arg1)
+
+        PRECOMPUTE(arg1, %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm7, %xmm8)
+
+        FUNC_RESTORE_GHASH()
+
+        ret
+SYM_FUNC_END(ghash_precomp_avx512)
+
+/*
+ * void clmul_ghash_update_avx512
+ *      (uint8_t *dst,
+ *       const uint8_t *src,
+ *       unsigned int srclen,
+ *       struct ghash_ctx_new *key_data);
+ */
+SYM_FUNC_START(clmul_ghash_update_avx512)
+        FUNC_SAVE_GHASH()
+
+        /* Read current hash value from dst */
+        vmovdqa (arg1), %xmm0
+
+        /* Bswap current hash value */
+        vpshufb SHUF_MASK(%rip), %xmm0, %xmm0
+
+        CALC_AAD_HASH(arg2, arg3, %xmm0, arg4, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm8, %zmm9, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %zmm17, %zmm18, %zmm19, %r10, %r11, %r12, %k1, 0)
+
+        /* Bswap current hash value before storing */
+        vpshufb SHUF_MASK(%rip), %xmm0, %xmm0
+        vmovdqu %xmm0, (arg1)
+
+        FUNC_RESTORE_GHASH()
+
+        ret
+SYM_FUNC_END(clmul_ghash_update_avx512)
diff --git a/arch/x86/crypto/ghash-clmulni-intel_glue.c b/arch/x86/crypto/ghash-clmulni-intel_glue.c
index 1f1a95f..3a3e8ea 100644
--- a/arch/x86/crypto/ghash-clmulni-intel_glue.c
+++ b/arch/x86/crypto/ghash-clmulni-intel_glue.c
@@ -22,18 +22,39 @@
 
 #define GHASH_BLOCK_SIZE	16
 #define GHASH_DIGEST_SIZE	16
+#define GHASH_KEY_LEN		16
+
+static bool use_avx512;
+module_param(use_avx512, bool, 0644);
+MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
 
 void clmul_ghash_mul(char *dst, const u128 *shash);
 
 void clmul_ghash_update(char *dst, const char *src, unsigned int srclen,
 			const u128 *shash);
 
+extern void ghash_precomp_avx512(u8 *key_data);
+#ifdef CONFIG_CRYPTO_GHASH_CLMUL_NI_AVX512
+extern void clmul_ghash_update_avx512(char *dst, const char *src, unsigned int srclen,
+				      u8 *shash);
+#else
+static void clmul_ghash_update_avx512(char *dst, const char *src, unsigned int srclen,
+			       u8 *shash)
+{}
+#endif
+
 struct ghash_async_ctx {
 	struct cryptd_ahash *cryptd_tfm;
 };
 
+/*
+ * This is needed for schoolbook multiply purposes.
+ * (HashKey << 1 mod poly), (HashKey^2 << 1 mod poly), ...,
+ * (Hashkey^48 << 1 mod poly)
+ */
 struct ghash_ctx {
 	u128 shash;
+	u8 hkey[GHASH_KEY_LEN * 48];
 };
 
 struct ghash_desc_ctx {
@@ -56,6 +77,15 @@ static int ghash_setkey(struct crypto_shash *tfm,
 	struct ghash_ctx *ctx = crypto_shash_ctx(tfm);
 	be128 *x = (be128 *)key;
 	u64 a, b;
+	int i;
+
+	if (IS_ENABLED(CONFIG_CRYPTO_GHASH_CLMUL_NI_AVX512) &&
+	    cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ) && use_avx512) {
+		for (i = 0; i < 16; i++)
+			ctx->hkey[(16 * 47) + i] = key[i];
+
+		ghash_precomp_avx512(ctx->hkey);
+	}
 
 	if (keylen != GHASH_BLOCK_SIZE)
 		return -EINVAL;
@@ -94,8 +124,13 @@ static int ghash_update(struct shash_desc *desc,
 		if (!dctx->bytes)
 			clmul_ghash_mul(dst, &ctx->shash);
 	}
-
-	clmul_ghash_update(dst, src, srclen, &ctx->shash);
+	if (IS_ENABLED(CONFIG_CRYPTO_GHASH_CLMUL_NI_AVX512) &&
+	    cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ) && use_avx512) {
+		/* Assembly code handles fragments in 16 byte multiples */
+		srclen = ALIGN_DOWN(srclen, 16);
+		clmul_ghash_update_avx512(dst, src, srclen, ctx->hkey);
+	} else
+		clmul_ghash_update(dst, src, srclen, &ctx->shash);
 	kernel_fpu_end();
 
 	if (srclen & 0xf) {
diff --git a/crypto/Kconfig b/crypto/Kconfig
index b090f14..70d1d35 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -637,6 +637,18 @@ config CRYPTO_CRCT10DIF_AVX512
 	depends on CRYPTO_CRCT10DIF_PCLMUL
 	depends on AS_VPCLMULQDQ
 
+# We default CRYPTO_GHASH_CLMUL_NI_AVX512 to Y but depend on CRYPTO_AVX512 in
+# order to have a singular option (CRYPTO_AVX512) select multiple algorithms
+# when supported. Specifically, if the platform and/or toolset does not
+# support VPLMULQDQ. Then this algorithm should not be supported as part of
+# the set that CRYPTO_AVX512 selects.
+config CRYPTO_GHASH_CLMUL_NI_AVX512
+	bool
+	default y
+	depends on CRYPTO_AVX512
+	depends on CRYPTO_GHASH_CLMUL_NI_INTEL
+	depends on AS_VPCLMULQDQ
+
 config CRYPTO_CRC32C_SPARC64
 	tristate "CRC32c CRC algorithm (SPARC64)"
 	depends on SPARC64
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC V1 4/7] crypto: tcrypt - Add speed test for optimized GHASH computations
  2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
                   ` (2 preceding siblings ...)
  2020-12-18 21:11 ` [RFC V1 3/7] crypto: ghash - Optimized GHASH computations Megha Dey
@ 2020-12-18 21:11 ` Megha Dey
  2020-12-18 21:11 ` [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization Megha Dey
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Megha Dey @ 2020-12-18 21:11 UTC (permalink / raw)
  To: herbert, davem
  Cc: linux-crypto, linux-kernel, ravi.v.shankar, tim.c.chen,
	andi.kleen, dave.hansen, megha.dey, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	tony.luck, ira.weiny

From: Kyung Min Park <kyung.min.park@intel.com>

Add speed test for optimized GHASH computations with vectorized
instructions. Introduce a new test suite to calculate the speed
for this algorithm.

Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 crypto/tcrypt.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index a647bb2..6e2d74c6 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -2595,6 +2595,11 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				    generic_hash_speed_template, num_mb);
 		if (mode > 400 && mode < 500) break;
 		fallthrough;
+	case 428:
+		klen = 16;
+		test_ahash_speed("ghash", sec, generic_hash_speed_template);
+		if (mode > 400 && mode < 500) break;
+		fallthrough;
 	case 499:
 		break;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization
  2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
                   ` (3 preceding siblings ...)
  2020-12-18 21:11 ` [RFC V1 4/7] crypto: tcrypt - Add speed test for optimized " Megha Dey
@ 2020-12-18 21:11 ` Megha Dey
  2021-01-16 17:03   ` Ard Biesheuvel
  2020-12-18 21:11 ` [RFC V1 6/7] crypto: aesni - fix coding style for if/else block Megha Dey
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Megha Dey @ 2020-12-18 21:11 UTC (permalink / raw)
  To: herbert, davem
  Cc: linux-crypto, linux-kernel, ravi.v.shankar, tim.c.chen,
	andi.kleen, dave.hansen, megha.dey, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	tony.luck, ira.weiny

Introduce the "by16" implementation of the AES CTR mode using AVX512
optimizations. "by16" means that 16 independent blocks (each block
being 128 bits) can be ciphered simultaneously as opposed to the
current 8 blocks.

The glue code in AESNI module overrides the existing "by8" CTR mode
encryption/decryption routines with the "by16" ones when the following
criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VAES instructions
At runtime:
1. VAES and AVX512VL features are supported on platform (currently
   only Icelake)
2. aesni_intel.use_avx512 module parameter is set at boot time. For this
   algorithm, switching from AVX512 optimized version is not possible once
   set at boot time because of how the code is structured today.(Can be
   changed later if required)

The functions aes_ctr_enc_128_avx512_by16(), aes_ctr_enc_192_avx512_by16()
and aes_ctr_enc_256_avx512_by16() are adapted from Intel Optimized IPSEC
Cryptographic library.

On a Icelake desktop, with turbo disabled and all CPUs running at maximum
frequency, the "by16" CTR mode optimization shows better performance
across data & key sizes as measured by tcrypt.

The average performance improvement of the "by16" version over the "by8"
version is as follows:
For all key sizes(128/192/256 bits),
        data sizes < 128 bytes/block, negligible improvement(~3% loss)
        data sizes > 128 bytes/block, there is an average improvement of
48% for both encryption and decryption.

A typical run of tcrypt with AES CTR mode encryption/decryption of the
"by8" and "by16" optimization on a Icelake desktop shows the following
results:

--------------------------------------------------------------
|  key   | bytes | cycles/op (lower is better)| percentage   |
| length |  per  |  encryption  |  decryption |  loss/gain   |
| (bits) | block |-------------------------------------------|
|        |       | by8  | by16  | by8  | by16 |  enc | dec   |
|------------------------------------------------------------|
|  128   |  16   | 156  | 168   | 164  | 168  | -7.7 |  -2.5 |
|  128   |  64   | 180  | 190   | 157  | 146  | -5.6 |   7.1 |
|  128   |  256  | 248  | 158   | 251  | 161  | 36.3 |  35.9 |
|  128   |  1024 | 633  | 316   | 642  | 319  | 50.1 |  50.4 |
|  128   |  1472 | 853  | 411   | 877  | 407  | 51.9 |  53.6 |
|  128   |  8192 | 4463 | 1959  | 4447 | 1940 | 56.2 |  56.4 |
|  192   |  16   | 136  | 145   | 149  | 166  | -6.7 | -11.5 |
|  192   |  64   | 159  | 154   | 157  | 160  |  3.2 |  -2   |
|  192   |  256  | 268  | 172   | 274  | 177  | 35.9 |  35.5 |
|  192   |  1024 | 710  | 358   | 720  | 355  | 49.6 |  50.7 |
|  192   |  1472 | 989  | 468   | 983  | 469  | 52.7 |  52.3 |
|  192   |  8192 | 6326 | 3551  | 6301 | 3567 | 43.9 |  43.4 |
|  256   |  16   | 153  | 165   | 139  | 156  | -7.9 | -12.3 |
|  256   |  64   | 158  | 152   | 174  | 161  |  3.8 |   7.5 |
|  256   |  256  | 283  | 176   | 287  | 202  | 37.9 |  29.7 |
|  256   |  1024 | 797  | 393   | 807  | 395  | 50.7 |  51.1 |
|  256   |  1472 | 1108 | 534   | 1107 | 527  | 51.9 |  52.4 |
|  256   |  8192 | 5763 | 2616  | 5773 | 2617 | 54.7 |  54.7 |
--------------------------------------------------------------

This work was inspired by the AES CTR mode optimization published
in Intel Optimized IPSEC Cryptographic library.
https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/cntr_vaes_avx512.asm

Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 arch/x86/crypto/Makefile                    |   1 +
 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S | 856 ++++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c          |  57 +-
 arch/x86/crypto/avx512_vaes_common.S        | 422 ++++++++++++++
 arch/x86/include/asm/disabled-features.h    |   8 +-
 crypto/Kconfig                              |  12 +
 6 files changed, 1354 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 0a86cfb..5fd9b35 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -53,6 +53,7 @@ chacha-x86_64-$(CONFIG_AS_AVX512) += chacha-avx512vl-x86_64.o
 obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
 aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o
+aesni-intel-$(CONFIG_CRYPTO_AES_CTR_AVX512) += aes_ctrby16_avx512-x86_64.o
 
 obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
 sha1-ssse3-y := sha1_avx2_x86_64_asm.o sha1_ssse3_asm.o sha1_ssse3_glue.o
diff --git a/arch/x86/crypto/aes_ctrby16_avx512-x86_64.S b/arch/x86/crypto/aes_ctrby16_avx512-x86_64.S
new file mode 100644
index 0000000..7ccfcde
--- /dev/null
+++ b/arch/x86/crypto/aes_ctrby16_avx512-x86_64.S
@@ -0,0 +1,856 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright © 2020 Intel Corporation.
+ *
+ * Implement AES CTR mode by16 optimization with VAES instructions. (x86_64)
+ *
+ * This is AES128/192/256 CTR mode optimization implementation. It requires
+ * the support of Intel(R) AVX512VL and VAES instructions.
+ */
+
+#include "avx512_vaes_common.S"
+
+#define ZKEY0	%zmm17
+#define ZKEY1	%zmm18
+#define ZKEY2	%zmm19
+#define ZKEY3	%zmm20
+#define ZKEY4	%zmm21
+#define ZKEY5	%zmm22
+#define ZKEY6	%zmm23
+#define ZKEY7	%zmm24
+#define ZKEY8	%zmm25
+#define ZKEY9	%zmm26
+#define ZKEY10	%zmm27
+#define ZKEY11	%zmm28
+#define ZKEY12	%zmm29
+#define ZKEY13	%zmm30
+#define ZKEY14	%zmm31
+
+#define TMP0		%r10
+#define TMP1		%r11
+#define TMP2		%r12
+#define	TMP3		%rax
+#define DATA_OFFSET	%r13
+#define RBITS		%r14
+#define MASKREG		%k1
+#define SHUFREG		%zmm13
+#define ADD8REG		%zmm14
+
+#define CTR_BLOCKx	%xmm0
+#define CTR_BLOCK_1_4	%zmm1
+#define CTR_BLOCK_5_8	%zmm2
+#define CTR_BLOCK_9_12	%zmm3
+#define CTR_BLOCK_13_16	%zmm4
+
+#define ZTMP0		%zmm5
+#define ZTMP1		%zmm6
+#define ZTMP2		%zmm7
+#define ZTMP3		%zmm8
+#define ZTMP4		%zmm9
+#define ZTMP5		%zmm10
+#define ZTMP6		%zmm11
+#define ZTMP7		%zmm12
+
+#define	XTMP		%xmm15
+
+#define STACK_FRAME_SIZE_CTR	(5*8)	/* space for 5 GP registers */
+
+.text
+
+/* Save register content for the caller */
+#define FUNC_SAVE_CTR()				\
+	mov	%rsp, %rax;			\
+	sub	$STACK_FRAME_SIZE_CTR, %rsp;	\
+	and	$~63, %rsp;			\
+	mov	%r12, (%rsp);			\
+	mov	%r13, 0x8(%rsp);		\
+	mov	%rax, 0x18(%rsp);
+
+/* Restore register content for the caller */
+#define FUNC_RESTORE_CTR()			\
+	vzeroupper;			\
+	mov	(%rsp), %r12;		\
+	mov	0x8(%rsp), %r13;	\
+	mov	0x18(%rsp), %rsp;
+
+/*
+ * Maintain the bits from the output text when writing out the output blocks,
+ * in case there are some bits that do not require encryption
+ */
+#define PRESERVE_BITS(RBITS, LENGTH, CYPH_PLAIN_OUT, ZIN_OUT, ZTMP0, ZTMP1, ZTMP2, IA0, IA1, blocks_to_skip, FULL_PARTIAL, MASKREG, DATA_OFFSET, NUM_ARGS)	\
+/* offset = number of sets of 4 blocks to skip */			\
+.set offset, (((blocks_to_skip) / 4) * 64);				\
+\
+/* num_left_blocks = blocks in the last set, range 1-4 blocks	*/	\
+.set num_left_blocks,(((blocks_to_skip) & 3) + 1);			\
+\
+.if NUM_ARGS == 13;							\
+	/* Load output to get last partial byte */			\
+	.ifc FULL_PARTIAL, partial;					\
+		vmovdqu8	offset(CYPH_PLAIN_OUT, DATA_OFFSET), ZTMP0{MASKREG};	\
+	.else;								\
+		vmovdqu8	offset(CYPH_PLAIN_OUT, DATA_OFFSET), ZTMP0;		\
+	.endif;								\
+.else;									\
+	/* Load o/p to get last partial byte (up to the last 4 blocks) */\
+	ZMM_LOAD_MASKED_BLOCKS_0_16(num_left_blocks, CYPH_PLAIN_OUT, offset, ZTMP0, no_zmm, no_zmm, no_zmm, MASKREG)	\
+.endif;									\
+\
+	/* Save rcx in temporary GP register */				\
+	mov	%rcx, IA0;						\
+	mov	$0xff, DWORD(IA1);					\
+	mov	BYTE(RBITS), %cl;					\
+	/* e.g. 3 remaining bits=> mask = 00011111 */			\
+	shr	%cl, DWORD(IA1);					\
+	mov	IA0, %rcx;						\
+\
+	vmovq	IA1, XWORD(ZTMP1);					\
+\
+	/*
+	 * Get number of full bytes in last block.			\
+	 * Subtracting the bytes in the blocks to skip to the length of	\
+	 * whole set of blocks gives us the number of bytes in the last	\
+	 * block, but the last block has a partial byte at the end, so	\
+	 * an extra byte needs to be subtracted.			\
+	 */								\
+	mov	LENGTH, IA1;						\
+	sub	$(blocks_to_skip * 16 + 1), IA1;			\
+	lea	shift_tab_16 + 16(%rip), IA0;				\
+	sub	IA1, IA0;						\
+	vmovdqu (IA0), XWORD(ZTMP2);					\
+	vpshufb XWORD(ZTMP2), XWORD(ZTMP1), XWORD(ZTMP1);		\
+.if num_left_blocks == 4;						\
+	vshufi64x2	$0x15, ZTMP1, ZTMP1, ZTMP1;			\
+.elseif num_left_blocks == 3;						\
+	vshufi64x2	$0x45, ZTMP1, ZTMP1, ZTMP1;			\
+.elseif num_left_blocks == 2;						\
+	vshufi64x2	$0x51, ZTMP1, ZTMP1, ZTMP1;			\
+.endif;	/* No need to shift if there is only one block */		\
+\
+	/*								\
+	 * At this point, ZTMP1 contains a mask with all 0s, but with	\
+	 * some 1s in the partial byte.					\
+	 * First, clear last bits (not to be ciphered) of last output	\
+	 * block. ZIN_OUT = ZIN_OUT AND NOT ZTMP1 (0x50 = andA!C)	\
+	 */								\
+	vpternlogq	$0x50, ZTMP1, ZTMP1, ZIN_OUT;			\
+\
+	/*								\
+	 * Then, set these last bits to the bits coming from output.	\
+	 * ZIN_OUT = ZIN_OUT OR (ZTMP0 AND ZTMP1) (0xF8 = orAandBC)	\
+	 */								\
+	vpternlogq	$0xF8, ZTMP1, ZTMP0, ZIN_OUT;
+
+/*
+ * CTR(128 bits) needs to be incremented. Since there is no 128-bit add
+ * instruction, we need to increment 64-bit (least significant) and if an
+ * overflow is detected, increment the most significant 64-bits.
+ */
+#define INCR_CNTR_128(CTR, ZT, const, num)			\
+	vpaddq	const(%rip), XWORD(CTR), XTMP;			\
+	vptest	ddq_low_msk(%rip), XTMP;			\
+	jnz	64f;						\
+	vpaddq	ddq_high_add_1(%rip), XTMP, XTMP;		\
+	vpaddq	ddq_high_add_1(%rip), XWORD(CTR), XWORD(CTR);	\
+64:;	\
+	vinserti64x2	$num, XTMP, ZT, ZT;
+
+/* Increment 4, 128 bit counters stored in a ZMM register */
+#define INCR_CNTR_4_128(CTR, ZT)		\
+	vmovdqa64	XWORD(CTR), XWORD(ZT);	\
+	vshufi64x2	$0, ZT, ZT, ZT;		\
+	INCR_CNTR_128(CTR, ZT, ddq_add_1, 1)	\
+	INCR_CNTR_128(CTR, ZT, ddq_add_2, 2)	\
+	INCR_CNTR_128(CTR, ZT, ddq_add_3, 3)	\
+	vextracti32x4	$3, ZT, XWORD(CTR);
+
+#define up_count(CTR, NUM_blocks, num, ZTMP)		\
+.if NUM_blocks == num;					\
+	jmp	76f;					\
+.endif;							\
+.if NUM_blocks > num;					\
+	vpaddq	ddq_add_1(%rip), XWORD(CTR), XWORD(CTR);\
+	INCR_CNTR_4_128(CTR, ZTMP)			\
+.endif;							\
+76:;
+
+/* Increment 1 to 16 counters (1 to 4 ZMM registers based on number of blocks */
+#define INCR_CNTR_NUM_BLOCKS(CNTR, ZTMP0, ZTMP1, ZTMP2, ZTMP3, NUM)	\
+	INCR_CNTR_4_128(CNTR, ZTMP0)	\
+	up_count(CNTR, NUM, 1, ZTMP1)	\
+	up_count(CNTR, NUM, 2, ZTMP2)	\
+	up_count(CNTR, NUM, 3, ZTMP3)
+
+#define UPDATE_COUNTERS(CTR, ZT1, ZT2, ZT3, ZT4, of_num, num)	\
+	vshufi64x2	$0, ZWORD(CTR), ZWORD(CTR), ZWORD(CTR); \
+	vmovq		XWORD(CTR), TMP3;			\
+	cmp		$~of_num, TMP3;				\
+	jb		77f;					\
+	INCR_CNTR_NUM_BLOCKS(CTR, ZT1, ZT2, ZT3, ZT4, num)	\
+	jmp		78f;					\
+77:;								\
+	vpaddd	ddq_add_0_3(%rip), ZWORD(CTR), ZT1;		\
+.if (num > 1);							\
+	vpaddd	ddq_add_4_7(%rip), ZWORD(CTR), ZT2;		\
+.endif;								\
+.if (num > 2);							\
+	vpaddd	ddq_add_8_11(%rip), ZWORD(CTR), ZT3;		\
+.endif;								\
+.if (num > 3);							\
+	vpaddd	ddq_add_12_15(%rip), ZWORD(CTR), ZT4;		\
+.endif;								\
+78:;
+
+/* Prepares the AES counter blocka */
+#define PREPARE_COUNTER_BLOCKS(CTR, ZT1, ZT2, ZT3, ZT4, num_initial_blocks)	\
+.if num_initial_blocks == 1;					\
+	vmovdqa64	XWORD(CTR), XWORD(ZT1);			\
+.elseif num_initial_blocks == 2;				\
+	vshufi64x2	$0, YWORD(CTR), YWORD(CTR), YWORD(ZT1); \
+	vmovq		XWORD(CTR), TMP3;			\
+	cmp		$~1, TMP3;				\
+	jb		50f;					\
+	vmovdqa64	XWORD(CTR), XWORD(ZT1);			\
+	vshufi64x2	$0, YWORD(ZT1), YWORD(ZT1), YWORD(ZT1); \
+	INCR_CNTR_128(CTR, ZT1, ddq_add_1, 1)			\
+	vextracti32x4	$1, YWORD(ZT1), XWORD(CTR);		\
+	jmp		55f;					\
+50:;								\
+	vpaddd	ddq_add_0_3(%rip), YWORD(ZT1), YWORD(ZT1);	\
+.elseif num_initial_blocks <= 4;				\
+	UPDATE_COUNTERS(CTR, ZT1, ZT2, ZT3, ZT4, 3, 1)		\
+.elseif num_initial_blocks <= 8;				\
+	UPDATE_COUNTERS(CTR, ZT1, ZT2, ZT3, ZT4, 7, 2)		\
+.elseif num_initial_blocks <= 12;				\
+	UPDATE_COUNTERS(CTR, ZT1, ZT2, ZT3, ZT4, 11, 3)		\
+.else;								\
+	UPDATE_COUNTERS(CTR, ZT1, ZT2, ZT3, ZT4, 15, 4)		\
+.endif;								\
+55:;
+
+/* Extract and Shuffle the updated counters for AES rounds */
+#define	EXTRACT_CNTR_VAL(ZT1, ZT2, ZT3, ZT4, SHUFREG, CTR, num_initial_blocks)	\
+.if num_initial_blocks == 1;					\
+	vpshufb		XWORD(SHUFREG), CTR, XWORD(ZT1);	\
+.elseif num_initial_blocks == 2;				\
+	vextracti32x4	$1, YWORD(ZT1), CTR;			\
+	vpshufb		YWORD(SHUFREG), YWORD(ZT1), YWORD(ZT1); \
+.elseif num_initial_blocks <= 4;				\
+	vextracti32x4	$(num_initial_blocks - 1), ZT1, CTR;	\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+.elseif num_initial_blocks == 5;				\
+	vmovdqa64	XWORD(ZT2), CTR;			\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		XWORD(SHUFREG), XWORD(ZT2), XWORD(ZT2); \
+.elseif num_initial_blocks == 6;				\
+	vextracti32x4	$1, YWORD(ZT2), CTR;			\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		YWORD(SHUFREG), YWORD(ZT2), YWORD(ZT2); \
+.elseif num_initial_blocks == 7;				\
+	vextracti32x4	$2, ZT2, CTR;				\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		SHUFREG, ZT2, ZT2;			\
+.elseif num_initial_blocks == 8;				\
+	vextracti32x4	$3, ZT2, CTR;				\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		SHUFREG, ZT2, ZT2;			\
+.elseif num_initial_blocks == 9;				\
+	vmovdqa64	XWORD(ZT3), CTR;			\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		SHUFREG, ZT2, ZT2;			\
+	vpshufb		XWORD(SHUFREG), XWORD(ZT3), XWORD(ZT3); \
+.elseif num_initial_blocks == 10;				\
+	vextracti32x4	$1, YWORD(ZT3), CTR;			\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		SHUFREG, ZT2, ZT2;			\
+	vpshufb		YWORD(SHUFREG), YWORD(ZT3), YWORD(ZT3); \
+.elseif num_initial_blocks == 11;				\
+	vextracti32x4	$2, ZT3, CTR;				\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		SHUFREG, ZT2, ZT2;			\
+	vpshufb		SHUFREG, ZT3, ZT3;			\
+.elseif num_initial_blocks == 12;				\
+	vextracti32x4	$3, ZT3, CTR;				\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		SHUFREG, ZT2, ZT2;			\
+	vpshufb		SHUFREG, ZT3, ZT3;			\
+.elseif num_initial_blocks == 13;				\
+	vmovdqa64	XWORD(ZT4), CTR;			\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		SHUFREG, ZT2, ZT2;			\
+	vpshufb		SHUFREG, ZT3, ZT3;			\
+	vpshufb		XWORD(SHUFREG), XWORD(ZT4), XWORD(ZT4);	\
+.elseif num_initial_blocks == 14;				\
+	vextracti32x4	$1, YWORD(ZT4), CTR;			\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		SHUFREG, ZT2, ZT2;			\
+	vpshufb		SHUFREG, ZT3, ZT3;			\
+	vpshufb		YWORD(SHUFREG), YWORD(ZT4), YWORD(ZT4);	\
+.elseif num_initial_blocks == 15;				\
+	vextracti32x4	$2, ZT4, CTR;				\
+	vpshufb		SHUFREG, ZT1, ZT1;			\
+	vpshufb		SHUFREG, ZT2, ZT2;			\
+	vpshufb		SHUFREG, ZT3, ZT3;			\
+	vpshufb		SHUFREG, ZT4, ZT4;			\
+.endif;
+
+/* AES rounds and XOR with plain/cipher text */
+#define AES_XOR_ROUNDS(ZT1, ZT2, ZT3, ZT4, ZKEY_0, ZKEY_1, ZKEY_2, ZKEY_3, ZKEY_4, ZKEY_5, ZKEY_6, ZKEY_7, ZKEY_8, ZKEY_9, ZKEY_10, ZKEY_11, ZKEY_12, ZKEY_13, ZKEY_14, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY0, 0, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY1, 1, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY2, 2, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY3, 3, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY4, 4, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY5, 5, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY6, 6, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY7, 7, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY8, 8, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY9, 9, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY10, 10, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+.if NROUNDS == 9;	\
+	jmp 28f;	\
+.else;			\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY11, 11, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY12, 12, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+.if NROUNDS == 11;	\
+	jmp 28f;	\
+.else;			\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY13, 13, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, ZT3, ZT4, ZKEY14, 14, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+.endif;			\
+.endif;			\
+28:;
+
+/* IV is updated to current counter + 1 and returned to the upper glue layers */
+#define UPDATE_IV(CTR)				\
+	vpaddd	ONE(%rip), CTR, CTR;		\
+	vptest	ddq_low_msk(%rip), CTR;		\
+	jnz 27f;				\
+	vpaddq	ddq_high_add_1(%rip), CTR, CTR;	\
+27:;						\
+	vpshufb SHUF_MASK(%rip), CTR, CTR;	\
+	vmovdqu CTR, (arg5);
+
+/*
+ * Macro with support for a partial final block. It may look similar to
+ * INITIAL_BLOCKS but its usage is different. It is not meant to cipher
+ * counter blocks for the main by16 loop. Just ciphers amount of blocks.
+ * Used for small packets (<256 bytes). num_initial_blocks is expected
+ * to include the partial final block in the count.
+ */
+#define INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, num_initial_blocks, CTR, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)      \
+	/* get load/store mask */				\
+	lea	byte64_len_to_mask_table(%rip), IA0;		\
+	mov	LENGTH, IA1;					\
+.if num_initial_blocks > 12;					\
+	sub	 $192, IA1;					\
+.elseif num_initial_blocks > 8;					\
+	sub	$128, IA1;					\
+.elseif num_initial_blocks > 4;					\
+	sub	$64, IA1;					\
+.endif;								\
+	kmovq	(IA0, IA1, 8), MASKREG;				\
+\
+	ZMM_LOAD_MASKED_BLOCKS_0_16(num_initial_blocks, PLAIN_CYPH_IN, 1, ZT5, ZT6, ZT7, ZT8, MASKREG)	\
+	PREPARE_COUNTER_BLOCKS(CTR, ZT1, ZT2, ZT3, ZT4, num_initial_blocks)	\
+	EXTRACT_CNTR_VAL(ZT1, ZT2, ZT3, ZT4, SHUFREG, CTR, num_initial_blocks)	\
+	AES_XOR_ROUNDS(ZT1, ZT2, ZT3, ZT4, ZKEY0, ZKEY1, ZKEY2, ZKEY3, ZKEY4, ZKEY5, ZKEY6, ZKEY7, ZKEY8, ZKEY9, ZKEY10, ZKEY11, ZKEY12, ZKEY13, ZKEY14, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)     \
+	/* write cipher/plain text back to output */		\
+	ZMM_STORE_MASKED_BLOCKS_0_16(num_initial_blocks, CYPH_PLAIN_OUT, 1, ZT1, ZT2, ZT3, ZT4, MASKREG)	\
+	UPDATE_IV(XWORD(CTR))
+
+/* This macro is used to "warm-up" pipeline for ENCRYPT_16_PARALLEL macro
+ * code. It is called only for data lengths 256 and above. The flow is as
+ * follows:
+ * - encrypt the initial num_initial_blocks blocks (can be 0)
+ * - encrypt the next 16 blocks
+ * - the last 16th block can be partial (lengths between 257 and 367)
+ * - partial block ciphering is handled within this macro
+ */
+#define INITIAL_BLOCKS(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, num_initial_blocks, CTR, CTR_1_4, CTR_5_8, CTR_9_12, CTR_13_16, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+.if num_initial_blocks > 0;						\
+	/* load plain/cipher text */					\
+	ZMM_LOAD_BLOCKS_0_16(num_initial_blocks, PLAIN_CYPH_IN, 1, ZT5, ZT6, ZT7, ZT8, load_4_instead_of_3)	\
+	PREPARE_COUNTER_BLOCKS(CTR, ZT1, ZT2, ZT3, ZT4, num_initial_blocks)	\
+	EXTRACT_CNTR_VAL(ZT1, ZT2, ZT3, ZT4, SHUFREG, CTR, num_initial_blocks)	\
+	 AES_XOR_ROUNDS(ZT1, ZT2, ZT3, ZT4, ZKEY0, ZKEY1, ZKEY2, ZKEY3, ZKEY4, ZKEY5, ZKEY6, ZKEY7, ZKEY8, ZKEY9, ZKEY10, ZKEY11, ZKEY12, ZKEY13, ZKEY14, ZT5, ZT6, ZT7, ZT8, num_initial_blocks, NROUNDS)	\
+	/* write cipher/plain text back to output */			\
+	ZMM_STORE_BLOCKS_0_16(num_initial_blocks, CYPH_PLAIN_OUT, 1, ZT1, ZT2, ZT3, ZT4)	\
+	/* adjust data offset and length */				\
+	sub	$(num_initial_blocks * 16), LENGTH;			\
+	add	$(num_initial_blocks * 16), DATA_OFFSET;		\
+.endif;									\
+\
+	/* - cipher of num_initial_blocks is done			\
+	 * - prepare counter blocks for the next 16 blocks (ZT5-ZT8)	\
+	 * - shuffle the blocks for AES					\
+	 * - encrypt the next 16 blocks					\
+	 */								\
+\
+	 /* get text load/store mask (assume full mask by default) */	\
+	 mov	 $~0, IA0;						\
+.if num_initial_blocks > 0;						\
+	/* This macro is executed for length 256 and up, zero length	\
+	 * is checked in CNTR_ENC_DEC. We know there is a partial block	\
+	 * if: LENGTH - 16*num_initial_blocks < 256			\
+	 */								\
+	cmp	$256, LENGTH;						\
+	jge	56f;							\
+	mov	%rcx, IA1;						\
+	mov	$256, %ecx;						\
+	sub	LENGTH, %rcx;						\
+	shr	%cl, IA0;						\
+	mov	IA1, %rcx;						\
+56:;									\
+.endif;									\
+	kmovq	IA0, MASKREG;						\
+	/* load plain or cipher text */					\
+	vmovdqu8	(PLAIN_CYPH_IN, DATA_OFFSET, 1), ZT5;		\
+	vmovdqu8	64(PLAIN_CYPH_IN, DATA_OFFSET), ZT6;		\
+	vmovdqu8	128(PLAIN_CYPH_IN, DATA_OFFSET), ZT7;		\
+	vmovdqu8	192(PLAIN_CYPH_IN, DATA_OFFSET), ZT8{MASKREG}{z};	\
+\
+	/* prepare next counter blocks */				\
+	vshufi64x2	$0, ZWORD(CTR), ZWORD(CTR), ZWORD(CTR);		\
+.if num_initial_blocks > 0;						\
+	vmovq	XWORD(CTR), TMP3;					\
+	cmp	$~16, TMP3;						\
+	jb	58f;							\
+	vpaddq	ddq_add_1(%rip), XWORD(CTR), XWORD(CTR);		\
+	vptest	ddq_low_msk(%rip), XWORD(CTR);				\
+	jnz 57f;							\
+	vpaddq	ddq_high_add_1(%rip), XWORD(CTR), XWORD(CTR);		\
+57:;	\
+	INCR_CNTR_NUM_BLOCKS(CTR, CTR_1_4, CTR_5_8, CTR_9_12, CTR_13_16, 4)	\
+	jmp 60f;							\
+58:;									\
+	vpaddd	      ddq_add_1_4(%rip), ZWORD(CTR), CTR_1_4;		\
+	vpaddd	      ddq_add_5_8(%rip), ZWORD(CTR), CTR_5_8;		\
+	vpaddd	      ddq_add_9_12(%rip), ZWORD(CTR), CTR_9_12;		\
+	vpaddd	      ddq_add_13_16(%rip), ZWORD(CTR), CTR_13_16;	\
+.else;									\
+	vmovq	XWORD(CTR), TMP3;					\
+	cmp	$~15, TMP3;						\
+	jb	59f;							\
+	INCR_CNTR_NUM_BLOCKS(CTR, CTR_1_4, CTR_5_8, CTR_9_12, CTR_13_16, 4)	\
+	jmp 60f;							\
+59:;									\
+	vpaddd	      ddq_add_0_3(%rip), ZWORD(CTR), CTR_1_4;		\
+	vpaddd	      ddq_add_4_7(%rip), ZWORD(CTR), CTR_5_8;		\
+	vpaddd	      ddq_add_8_11(%rip), ZWORD(CTR), CTR_9_12;		\
+	vpaddd	      ddq_add_12_15(%rip), ZWORD(CTR), CTR_13_16;	\
+.endif;									\
+60:;									\
+\
+	vpshufb		SHUFREG, CTR_1_4, ZT1;				\
+	vpshufb		SHUFREG, CTR_5_8, ZT2;				\
+	vpshufb		SHUFREG, CTR_9_12, ZT3;				\
+	vpshufb		SHUFREG, CTR_13_16, ZT4;			\
+\
+	AES_XOR_ROUNDS(ZT1, ZT2, ZT3, ZT4, ZKEY0, ZKEY1, ZKEY2, ZKEY3, ZKEY4, ZKEY5, ZKEY6, ZKEY7, ZKEY8, ZKEY9, ZKEY10, ZKEY11, ZKEY12, ZKEY13, ZKEY14, ZT5, ZT6, ZT7, ZT8, 16, NROUNDS)	\
+\
+	/* write cipher/plain text back to output */			\
+	vmovdqu8	ZT1, (CYPH_PLAIN_OUT, DATA_OFFSET);		\
+	vmovdqu8	ZT2, 64(CYPH_PLAIN_OUT, DATA_OFFSET, 1);	\
+	vmovdqu8	ZT3, 128(CYPH_PLAIN_OUT, DATA_OFFSET, 1);	\
+	vmovdqu8	ZT4, 192(CYPH_PLAIN_OUT, DATA_OFFSET, 1){MASKREG};	\
+\
+	vextracti32x4 $3, CTR_13_16, XWORD(CTR);			\
+	UPDATE_IV(CTR)							\
+\
+	/* check if there is partial block */				\
+	cmp	$256, LENGTH;						\
+	jl	61f;							\
+	/* adjust offset and length */					\
+	add	$256, DATA_OFFSET;					\
+	sub	$256, LENGTH;						\
+	jmp	62f;							\
+61:;									\
+	/* zero the length (all encryption is complete) */		\
+	xor	LENGTH, LENGTH;						\
+62:;
+
+/*
+ * This macro ciphers payloads shorter than 256 bytes. The number of blocks in
+ * the message comes as an argument. Depending on the number of blocks, an
+ * optimized variant of INITIAL_BLOCKS_PARTIAL is invoked
+ */
+#define CNTR_ENC_DEC_SMALL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, NUM_BLOCKS, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)  \
+	cmp	$8, NUM_BLOCKS;		\
+	je	38f;			\
+	jl	48f;			\
+	/* Initial blocks 9-16 */	\
+	cmp	$12, NUM_BLOCKS;	\
+	je	42f;			\
+	jl	49f;			\
+	/* Initial blocks 13-16 */	\
+	cmp	$16, NUM_BLOCKS;	\
+	je	46f;			\
+	cmp	$15, NUM_BLOCKS;	\
+	je	45f;			\
+	cmp	$14, NUM_BLOCKS;	\
+	je	44f;			\
+	cmp	$13, NUM_BLOCKS;	\
+	je	43f;			\
+49:;					\
+	cmp	$11, NUM_BLOCKS;	\
+	je	41f;			\
+	cmp	$10, NUM_BLOCKS;	\
+	je	40f;			\
+	cmp	$9, NUM_BLOCKS;		\
+	je	39f;			\
+48:;					\
+	cmp	$4, NUM_BLOCKS;		\
+	je	34f;			\
+	jl	47f;			\
+	/* Initial blocks 5-7 */	\
+	cmp	$7, NUM_BLOCKS;		\
+	je	37f;			\
+	cmp	$6, NUM_BLOCKS;		\
+	je	36f;			\
+	cmp	$5, NUM_BLOCKS;		\
+	je	35f;			\
+47:;					\
+	cmp	$3, NUM_BLOCKS;		\
+	je	33f;			\
+	cmp	$2, NUM_BLOCKS;		\
+	je	32f;			\
+	jmp	31f;			\
+46:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 16, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)      \
+	jmp	30f;	\
+45:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 15, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)      \
+	jmp	30f;	\
+44:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 14, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)      \
+	jmp	30f;	\
+43:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 13, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)      \
+	jmp	30f;	\
+42:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 12, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)      \
+	jmp	30f;	\
+41:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 11, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)      \
+	jmp	30f;	\
+40:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 10, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)      \
+	jmp	30f;	\
+39:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 9, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)	    \
+	jmp	30f;	\
+38:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 8, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)	    \
+	jmp	30f;	\
+37:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 7, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)	    \
+	jmp	30f;	\
+36:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 6, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)	    \
+	jmp	30f;	\
+35:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 5, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)	    \
+	jmp	30f;	\
+34:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 4, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)	    \
+	jmp	30f;	\
+33:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 3, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)	    \
+	jmp	30f;	\
+32:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 2, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)	    \
+	jmp	30f;	\
+31:;	\
+	INITIAL_BLOCKS_PARTIAL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, 1, CTR, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, IA0, IA1, MASKREG, SHUFREG, NROUNDS, RBITS)	    \
+30:;
+
+/*
+ * This is the main CNTR macro. It operates on single stream and encrypts 16
+ * blocks at a time
+ */
+#define ENCRYPT_16_PARALLEL(KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, DATA_OFFSET, CTR_1_4, CTR_5_8, CTR_9_12, CTR_13_16, FULL_PARTIAL, IA0, IA1, LENGTH, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, MASKREG, SHUFREG, ADD8REG, NROUNDS, RBITS)	\
+	/* load/store mask (partial case) and load the text data */	\
+.ifc FULL_PARTIAL, full;	\
+	vmovdqu8	(PLAIN_CYPH_IN, DATA_OFFSET, 1), ZT5;		\
+	vmovdqu8	64(PLAIN_CYPH_IN, DATA_OFFSET, 1), ZT6;		\
+	vmovdqu8	128(PLAIN_CYPH_IN, DATA_OFFSET, 1), ZT7;	\
+	vmovdqu8	192(PLAIN_CYPH_IN, DATA_OFFSET, 1), ZT8;	\
+.else;									\
+	lea		byte64_len_to_mask_table(%rip), IA0;		\
+	mov		LENGTH, IA1;					\
+	sub		$192, IA1;					\
+	kmovq		(IA0, IA1, 8), MASKREG;				\
+	vmovdqu8	(PLAIN_CYPH_IN, DATA_OFFSET, 1), ZT5;		\
+	vmovdqu8	64(PLAIN_CYPH_IN, DATA_OFFSET, 1), ZT6;		\
+	vmovdqu8	128(PLAIN_CYPH_IN, DATA_OFFSET, 1), ZT7;	\
+	vmovdqu8	192(PLAIN_CYPH_IN, DATA_OFFSET, 1), ZT8{MASKREG}{z};	\
+.endif;									\
+	/*								\
+	 * populate counter blocks					\
+	 * CTR is shuffled outside the scope of this macro		\
+	 * it has to be kept in unshuffled form				\
+	 */								\
+	vpaddd		ADD8REG, CTR_1_4, CTR_1_4;			\
+	vpaddd		ADD8REG, CTR_5_8, CTR_5_8;			\
+	vpaddd		ADD8REG, CTR_9_12, CTR_9_12;			\
+	vpaddd		ADD8REG, CTR_13_16, CTR_13_16;			\
+	vpshufb		SHUFREG, CTR_1_4, ZT1;				\
+	vpshufb		SHUFREG, CTR_5_8, ZT2;				\
+	vpshufb		SHUFREG, CTR_9_12, ZT3;				\
+	vpshufb		SHUFREG, CTR_13_16, ZT4;			\
+	AES_XOR_ROUNDS(ZT1, ZT2, ZT3, ZT4, ZKEY0, ZKEY1, ZKEY2, ZKEY3, ZKEY4, ZKEY5, ZKEY6, ZKEY7, ZKEY8, ZKEY9, ZKEY10, ZKEY11, ZKEY12, ZKEY13, ZKEY14, ZT5, ZT6, ZT7, ZT8, 16, NROUNDS)     \
+	/*store the text data */					\
+	vmovdqu8	ZT1, (CYPH_PLAIN_OUT, DATA_OFFSET);		\
+	vmovdqu8	ZT2, 64(CYPH_PLAIN_OUT, DATA_OFFSET, 1);	\
+	vmovdqu8	ZT3, 128(CYPH_PLAIN_OUT, DATA_OFFSET, 1);	\
+.ifc FULL_PARTIAL, full;						\
+	vmovdqu8	ZT4, 192(CYPH_PLAIN_OUT, DATA_OFFSET, 1);	\
+.else;									\
+	vmovdqu8	ZT4, 192(CYPH_PLAIN_OUT, DATA_OFFSET, 1){MASKREG};	\
+.endif;
+
+/*
+ * CNTR_ENC_DEC Encodes/Decodes given data. Requires the input data be
+ * at least 1 byte long because of READ_SMALL_INPUT_DATA.
+ */
+#define CNTR_ENC_DEC(KEYS, DST, SRC, LEN, IV, NROUNDS)		\
+	or	LEN, LEN;					\
+	je	22f;						\
+/*								\
+ * Macro flow:							\
+ * - calculate the number of 16byte blocks in the message	\
+ * - process (number of 16byte blocks) mod 16			\
+ * - process 16x16 byte blocks at a time until all are done	\
+ */								\
+	xor	DATA_OFFSET, DATA_OFFSET;			\
+/* Prepare round keys */					\
+	vbroadcastf64x2 16*0(KEYS), ZKEY0;			\
+	vbroadcastf64x2 16*1(KEYS), ZKEY1;			\
+	vbroadcastf64x2 16*2(KEYS), ZKEY2;			\
+	vbroadcastf64x2 16*3(KEYS), ZKEY3;			\
+	vbroadcastf64x2 16*4(KEYS), ZKEY4;			\
+	vbroadcastf64x2 16*5(KEYS), ZKEY5;			\
+	vbroadcastf64x2 16*6(KEYS), ZKEY6;			\
+	vbroadcastf64x2 16*7(KEYS), ZKEY7;			\
+	vbroadcastf64x2 16*8(KEYS), ZKEY8;			\
+	vbroadcastf64x2 16*9(KEYS), ZKEY9;			\
+	vbroadcastf64x2 16*10(KEYS), ZKEY10;			\
+.if NROUNDS == 9;						\
+	jmp 23f;						\
+.else;								\
+	vbroadcastf64x2 16*11(KEYS), ZKEY11;			\
+	vbroadcastf64x2 16*12(KEYS), ZKEY12;			\
+.if NROUNDS == 11;						\
+	jmp 23f;						\
+.else;								\
+	vbroadcastf64x2 16*13(KEYS), ZKEY13;			\
+	vbroadcastf64x2 16*14(KEYS), ZKEY14;			\
+.endif;								\
+.endif;								\
+23:;								\
+	mov	$16, TMP2;					\
+	/* Set mask to read 16 IV bytes */			\
+	mov	mask_16_bytes(%rip), TMP0;			\
+	kmovq	TMP0, MASKREG;					\
+	vmovdqu8	(IV), CTR_BLOCKx{MASKREG};		\
+	vmovdqa64	SHUF_MASK(%rip), SHUFREG;		\
+	/* store IV as counter in LE format */			\
+	vpshufb XWORD(SHUFREG), CTR_BLOCKx, CTR_BLOCKx;		\
+	/* Determine how many blocks to process in INITIAL */	\
+	mov	LEN, TMP1;					\
+	shr	$4, TMP1;					\
+	and	$0xf, TMP1;					\
+	/*							\
+	 * Process one additional block in INITIAL_ macros if	\
+	 * there is a partial block.				\
+	 */							\
+	mov	LEN, TMP0;					\
+	and	$0xf, TMP0;					\
+	add	$0xf, TMP0;					\
+	shr	$4, TMP0;					\
+	add	TMP0, TMP1;					\
+	/* IA1 can be in the range from 0 to 16 */		\
+\
+	/* Less than 256B will be handled by the small message	\
+	 * code, which can process up to 16x blocks (16 bytes)	\
+	 */							\
+	cmp	$256, LEN;					\
+	jge	20f;						\
+	CNTR_ENC_DEC_SMALL(KEYS, DST, SRC, LEN, TMP1, CTR_BLOCKx, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP2, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	22f;						\
+20:;								\
+	and	$0xf, TMP1;					\
+	je	16f;						\
+	cmp	$15, TMP1;					\
+	je	15f;						\
+	cmp	$14, TMP1;					\
+	je	14f;						\
+	cmp	$13, TMP1;					\
+	je	13f;						\
+	cmp	$12, TMP1;					\
+	je	12f;						\
+	cmp	$11, TMP1;					\
+	je	11f;						\
+	cmp	$10, TMP1;					\
+	je	10f;						\
+	cmp	$9, TMP1;					\
+	je	9f;						\
+	cmp	$8, TMP1;					\
+	je	8f;						\
+	cmp	$7, TMP1;					\
+	je	7f;						\
+	cmp	$6, TMP1;					\
+	je	6f;						\
+	cmp	$5, TMP1;					\
+	je	5f;						\
+	cmp	$4, TMP1;					\
+	je	4f;						\
+	cmp	$3, TMP1;					\
+	je	3f;						\
+	cmp	$2, TMP1;					\
+	je	2f;						\
+	jmp	1f;						\
+\
+	and	$0xf, TMP1;					\
+	je	16f;						\
+	cmp	$8, TMP1;					\
+	je	8f;						\
+	jl	18f;						\
+	/* Initial blocks 9-15 */				\
+	cmp	$12, TMP1;					\
+	je	12f;						\
+	jl	19f;						\
+	/* Initial blocks 13-15 */				\
+	cmp	$15, TMP1;					\
+	je	15f;						\
+	cmp	$14, TMP1;					\
+	je	14f;						\
+	cmp	$13, TMP1;					\
+	je	13f;						\
+19:;								\
+	cmp	$11, TMP1;					\
+	je	11f;						\
+	cmp	$10, TMP1;					\
+	je	10f;						\
+	cmp	$9, TMP1;					\
+	je	9f;						\
+18:;								\
+	cmp	$4, TMP1;					\
+	je	4f;						\
+	jl	17f;						\
+	/* Initial blocks 5-7 */				\
+	cmp	$7, TMP1;					\
+	je	7f;						\
+	cmp	$6, TMP1;					\
+	je	6f;						\
+	cmp	$5, TMP1;					\
+	je	5f;						\
+17:;								\
+	cmp	$3, TMP1;					\
+	je	3f;						\
+	cmp	$2, TMP1;					\
+	je	2f;						\
+	jmp	1f;						\
+15:;								\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 15, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+14:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 14, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+13:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 13, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+12:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 12, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+11:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 11, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+10:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 10, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+9:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 9, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+8:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 8, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+7:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 7, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+6:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 6, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+5:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 5, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+4:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 4, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+3:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 3, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+2:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 2, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+1:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 1, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+	jmp	21f;	\
+16:;	\
+	INITIAL_BLOCKS(KEYS, DST, SRC, LEN, DATA_OFFSET, 0, CTR_BLOCKx, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, TMP0, TMP1, MASKREG, SHUFREG, NROUNDS, RBITS)	\
+21:;	\
+	or		LEN, LEN;				\
+	je		22f;					\
+	vmovdqa64	ddq_add_16(%rip), ADD8REG;		\
+	/* Process 15 full blocks plus a partial block */	\
+	cmp		$256, LEN;				\
+	jl		24f;					\
+25:;								\
+	ENCRYPT_16_PARALLEL(KEYS, DST, SRC, DATA_OFFSET, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, full, TMP0, TMP1, LEN, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, MASKREG, SHUFREG, ADD8REG, NROUNDS, RBITS)		\
+	add	$256, DATA_OFFSET;				\
+	sub	$256, LEN;					\
+	vextracti32x4 $3, CTR_BLOCK_13_16, CTR_BLOCKx;		\
+	UPDATE_IV(CTR_BLOCKx)					\
+	cmp	$256, LEN;					\
+	jge	25b;						\
+26:;								\
+	/*							\
+	 * Test to see if we need a by 16 with partial block.	\
+	 * At this point bytes remaining should be either zero	\
+	 * or between 241-255.					\
+	 */							\
+	or	LEN, LEN;					\
+	je	22f;						\
+24:;								\
+	ENCRYPT_16_PARALLEL(KEYS, DST, SRC, DATA_OFFSET, CTR_BLOCK_1_4, CTR_BLOCK_5_8, CTR_BLOCK_9_12, CTR_BLOCK_13_16, partial, TMP0, TMP1, LEN, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, MASKREG, SHUFREG, ADD8REG, NROUNDS, RBITS)	\
+	vextracti32x4 $3, CTR_BLOCK_13_16, CTR_BLOCKx;		\
+	UPDATE_IV(CTR_BLOCKx)					\
+22:;
+
+#define AES_CNTR_ENC_AVX512_BY16(keys, out, in, len, iv, NROUNDS)	\
+	FUNC_SAVE_CTR()							\
+	/* call the aes main loop */					\
+	CNTR_ENC_DEC(keys, out, in, len, iv, NROUNDS)			\
+	FUNC_RESTORE_CTR()							\
+	ret;
+
+/* Routine to do AES128/192/256 CTR enc/decrypt "by16"
+ * void aes_ctr_enc_128_avx512_by16/ aes_ctr_enc_192_avx512_by16/
+ *	aes_ctr_enc_256_avx512_by16/
+ *		(void *keys,
+ *		 u8 *out,
+ *		 const u8 *in,
+ *		 unsigned int num_bytes,
+ *		 u8 *iv);
+ */
+SYM_FUNC_START(aes_ctr_enc_128_avx512_by16)
+	AES_CNTR_ENC_AVX512_BY16(arg1, arg2, arg3, arg4, arg5, 9)
+SYM_FUNC_END(aes_ctr_enc_128_avx512_by16)
+
+SYM_FUNC_START(aes_ctr_enc_192_avx512_by16)
+	AES_CNTR_ENC_AVX512_BY16(arg1, arg2, arg3, arg4, arg5, 11)
+SYM_FUNC_END(aes_ctr_enc_192_avx512_by16)
+
+SYM_FUNC_START(aes_ctr_enc_256_avx512_by16)
+	AES_CNTR_ENC_AVX512_BY16(arg1, arg2, arg3, arg4, arg5, 13)
+SYM_FUNC_END(aes_ctr_enc_256_avx512_by16)
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index ad8a718..f45059e 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -46,6 +46,10 @@
 #define CRYPTO_AES_CTX_SIZE (sizeof(struct crypto_aes_ctx) + AESNI_ALIGN_EXTRA)
 #define XTS_AES_CTX_SIZE (sizeof(struct aesni_xts_ctx) + AESNI_ALIGN_EXTRA)
 
+static bool use_avx512;
+module_param(use_avx512, bool, 0644);
+MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
+
 /* This data is stored at the end of the crypto_tfm struct.
  * It's a type of per "session" data storage location.
  * This needs to be 16 byte aligned.
@@ -191,6 +195,35 @@ asmlinkage void aes_ctr_enc_192_avx_by8(const u8 *in, u8 *iv,
 		void *keys, u8 *out, unsigned int num_bytes);
 asmlinkage void aes_ctr_enc_256_avx_by8(const u8 *in, u8 *iv,
 		void *keys, u8 *out, unsigned int num_bytes);
+
+#ifdef CONFIG_CRYPTO_AES_CTR_AVX512
+asmlinkage void aes_ctr_enc_128_avx512_by16(void *keys, u8 *out,
+					    const u8 *in,
+					    unsigned int num_bytes,
+					    u8 *iv);
+asmlinkage void aes_ctr_enc_192_avx512_by16(void *keys, u8 *out,
+					    const u8 *in,
+					    unsigned int num_bytes,
+					    u8 *iv);
+asmlinkage void aes_ctr_enc_256_avx512_by16(void *keys, u8 *out,
+					    const u8 *in,
+					    unsigned int num_bytes,
+					    u8 *iv);
+#else
+static inline void aes_ctr_enc_128_avx512_by16(void *keys, u8 *out,
+					       const u8 *in,
+					       unsigned int num_bytes,
+					       u8 *iv) {}
+static inline void aes_ctr_enc_192_avx512_by16(void *keys, u8 *out,
+					       const u8 *in,
+					       unsigned int num_bytes,
+					       u8 *iv) {}
+static inline void aes_ctr_enc_256_avx512_by16(void *keys, u8 *out,
+					       const u8 *in,
+					       unsigned int num_bytes,
+					       u8 *iv) {}
+#endif
+
 /*
  * asmlinkage void aesni_gcm_init_avx_gen2()
  * gcm_data *my_ctx_data, context data
@@ -487,6 +520,23 @@ static void aesni_ctr_enc_avx_tfm(struct crypto_aes_ctx *ctx, u8 *out,
 		aes_ctr_enc_256_avx_by8(in, iv, (void *)ctx, out, len);
 }
 
+static void aesni_ctr_enc_avx512_tfm(struct crypto_aes_ctx *ctx, u8 *out,
+				     const u8 *in, unsigned int len, u8 *iv)
+{
+	/*
+	 * based on key length, override with the by16 version
+	 * of ctr mode encryption/decryption for improved performance.
+	 * aes_set_key_common() ensures that key length is one of
+	 * {128,192,256}
+	 */
+	if (ctx->key_length == AES_KEYSIZE_128)
+		aes_ctr_enc_128_avx512_by16((void *)ctx, out, in, len, iv);
+	else if (ctx->key_length == AES_KEYSIZE_192)
+		aes_ctr_enc_192_avx512_by16((void *)ctx, out, in, len, iv);
+	else
+		aes_ctr_enc_256_avx512_by16((void *)ctx, out, in, len, iv);
+}
+
 static int ctr_crypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
@@ -1076,7 +1126,12 @@ static int __init aesni_init(void)
 		aesni_gcm_tfm = &aesni_gcm_tfm_sse;
 	}
 	aesni_ctr_enc_tfm = aesni_ctr_enc;
-	if (boot_cpu_has(X86_FEATURE_AVX)) {
+	if (use_avx512 && IS_ENABLED(CONFIG_CRYPTO_AES_CTR_AVX512) &&
+	    cpu_feature_enabled(X86_FEATURE_VAES)) {
+		/* Ctr mode performance optimization using AVX512 */
+		aesni_ctr_enc_tfm = aesni_ctr_enc_avx512_tfm;
+		pr_info("AES CTR mode by16 optimization enabled\n");
+	} else if (boot_cpu_has(X86_FEATURE_AVX)) {
 		/* optimize performance of ctr mode encryption transform */
 		aesni_ctr_enc_tfm = aesni_ctr_enc_avx_tfm;
 		pr_info("AES CTR mode by8 optimization enabled\n");
diff --git a/arch/x86/crypto/avx512_vaes_common.S b/arch/x86/crypto/avx512_vaes_common.S
index f3ee898..a499e22 100644
--- a/arch/x86/crypto/avx512_vaes_common.S
+++ b/arch/x86/crypto/avx512_vaes_common.S
@@ -348,6 +348,96 @@ POLY2:
 byteswap_const:
 .octa	0x000102030405060708090A0B0C0D0E0F
 
+.align 16
+ddq_low_msk:
+.octa  0x0000000000000000FFFFFFFFFFFFFFFF
+
+.align 16
+ddq_high_add_1:
+.octa  0x00000000000000010000000000000000
+
+.align 16
+ddq_add_1:
+.octa  0x00000000000000000000000000000001
+
+.align 16
+ddq_add_2:
+.octa  0x00000000000000000000000000000002
+
+.align 16
+ddq_add_3:
+.octa  0x00000000000000000000000000000003
+
+.align 16
+ddq_add_4:
+.octa  0x00000000000000000000000000000004
+
+.align 64
+ddq_add_12_15:
+.octa  0x0000000000000000000000000000000c
+.octa  0x0000000000000000000000000000000d
+.octa  0x0000000000000000000000000000000e
+.octa  0x0000000000000000000000000000000f
+
+.align 64
+ddq_add_8_11:
+.octa  0x00000000000000000000000000000008
+.octa  0x00000000000000000000000000000009
+.octa  0x0000000000000000000000000000000a
+.octa  0x0000000000000000000000000000000b
+
+.align 64
+ddq_add_4_7:
+.octa  0x00000000000000000000000000000004
+.octa  0x00000000000000000000000000000005
+.octa  0x00000000000000000000000000000006
+.octa  0x00000000000000000000000000000007
+
+.align 64
+ddq_add_0_3:
+.octa  0x00000000000000000000000000000000
+.octa  0x00000000000000000000000000000001
+.octa  0x00000000000000000000000000000002
+.octa  0x00000000000000000000000000000003
+
+.align 64
+ddq_add_13_16:
+.octa  0x0000000000000000000000000000000d
+.octa  0x0000000000000000000000000000000e
+.octa  0x0000000000000000000000000000000f
+.octa  0x00000000000000000000000000000010
+
+.align 64
+ddq_add_9_12:
+.octa  0x00000000000000000000000000000009
+.octa  0x0000000000000000000000000000000a
+.octa  0x0000000000000000000000000000000b
+.octa  0x0000000000000000000000000000000c
+
+.align 64
+ddq_add_5_8:
+.octa  0x00000000000000000000000000000005
+.octa  0x00000000000000000000000000000006
+.octa  0x00000000000000000000000000000007
+.octa  0x00000000000000000000000000000008
+
+.align 64
+ddq_add_1_4:
+.octa  0x00000000000000000000000000000001
+.octa  0x00000000000000000000000000000002
+.octa  0x00000000000000000000000000000003
+.octa  0x00000000000000000000000000000004
+
+.align 64
+ddq_add_16:
+.octa  0x00000000000000000000000000000010
+.octa  0x00000000000000000000000000000010
+.octa  0x00000000000000000000000000000010
+.octa  0x00000000000000000000000000000010
+
+mask_16_bytes:
+.octa  0x000000000000ffff
+
 .text
 
 /* Save register content for the caller */
@@ -1209,3 +1299,335 @@ byteswap_const:
 	vpshufb XWORD(ZT13), XWORD(ZT1), XWORD(ZT1); \
 	GHASH_1_TO_16(GDATA_KEY, OFFSET, ZWORD(AAD_HASH), ZT0, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZWORD(AAD_HASH), ZT1, no_zmm, no_zmm, no_zmm, 1, 1, single_call, NULL, NULL, NULL, NULL, NULL, NULL) \
 0:;
+
+/*
+ * Generic macro to produce code that executes OPCODE instruction
+ * on selected number of AES blocks (16 bytes long ) between 0 and 16.
+ * All three operands of the instruction come from registers.
+ */
+#define ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(NUM_BLOCKS, OPCODE, DST0, DST1, DST2, DST3, SRC1_0, SRC1_1, SRC1_2, SRC1_3, SRC2_0, SRC2_1, SRC2_2, SRC2_3) \
+.set blocks_left,NUM_BLOCKS;					       \
+.if NUM_BLOCKS < 4;						       \
+       .if blocks_left == 1;					       \
+	       OPCODE	     XWORD(SRC2_0), XWORD(SRC1_0), XWORD(DST0);\
+       .elseif blocks_left == 2;				       \
+	       OPCODE	     YWORD(SRC2_0), YWORD(SRC1_0), YWORD(DST0);\
+       .elseif blocks_left == 3;				       \
+	       OPCODE	     SRC2_0, SRC1_0, DST0;		       \
+       .endif;							       \
+.elseif NUM_BLOCKS >= 4 && NUM_BLOCKS < 8;			       \
+       OPCODE  SRC2_0, SRC1_0, DST0;				       \
+       .set blocks_left, blocks_left - 4;			       \
+       .if blocks_left == 1;					       \
+	       OPCODE	     XWORD(SRC2_1), XWORD(SRC1_1), XWORD(DST1);\
+       .elseif blocks_left == 2;				       \
+	       OPCODE	     YWORD(SRC2_1), YWORD(SRC1_1), YWORD(DST1);\
+       .elseif blocks_left == 3;				       \
+	       OPCODE	     SRC2_1, SRC1_1, DST1;		       \
+       .endif;							       \
+.elseif NUM_BLOCKS >= 8 && NUM_BLOCKS < 12;			       \
+       OPCODE  SRC2_0, SRC1_0, DST0;				       \
+       .set blocks_left, blocks_left - 4;			       \
+       OPCODE  SRC2_1, SRC1_1, DST1;				       \
+       .set blocks_left, blocks_left - 4;			       \
+       .if blocks_left == 1;					       \
+	       OPCODE	     XWORD(SRC2_2), XWORD(SRC1_2), XWORD(DST2);\
+       .elseif blocks_left == 2;				       \
+	       OPCODE	     YWORD(SRC2_2), YWORD(SRC1_2), YWORD(DST2);\
+       .elseif blocks_left == 3;				       \
+	       OPCODE	     SRC2_2, SRC1_2, DST2;		       \
+       .endif;							       \
+.elseif NUM_BLOCKS >= 12 && NUM_BLOCKS < 16;			       \
+       OPCODE  SRC2_0, SRC1_0, DST0;				       \
+       .set blocks_left, blocks_left - 4;			       \
+       OPCODE  SRC2_1, SRC1_1, DST1;				       \
+       .set blocks_left, blocks_left - 4;			       \
+       OPCODE  SRC2_2, SRC1_2, DST2;				       \
+       .set blocks_left, blocks_left - 4;			       \
+       .if blocks_left == 1;					       \
+	       OPCODE	     XWORD(SRC2_3), XWORD(SRC1_3), XWORD(DST3);\
+       .elseif blocks_left == 2;				       \
+	       OPCODE	     YWORD(SRC2_3), YWORD(SRC1_3), YWORD(DST3);\
+       .elseif blocks_left == 3;				       \
+	       OPCODE	     SRC2_3, SRC1_3, DST3;		       \
+       .endif;							       \
+.else;								       \
+       OPCODE  SRC2_0, SRC1_0, DST0;				       \
+       .set blocks_left, blocks_left - 4;			       \
+       OPCODE  SRC2_1, SRC1_1, DST1;				       \
+       .set blocks_left, blocks_left - 4;			       \
+       OPCODE  SRC2_2, SRC1_2, DST2;				       \
+       .set blocks_left, blocks_left - 4;			       \
+       OPCODE  SRC2_3, SRC1_3, DST3;				       \
+       .set blocks_left, blocks_left - 4;			       \
+.endif;
+
+/*
+ * Handles AES encryption rounds. It handles special cases: the last and
+ * first rounds. Optionally, it performs XOR with data after the last AES
+ * round. Uses NROUNDS parameter to check what needs to be done for the
+ * current round.
+ */
+#define ZMM_AESENC_ROUND_BLOCKS_0_16(L0B0_3, L0B4_7, L0B8_11, L0B12_15, KEY, ROUND, D0_3, D4_7, D8_11, D12_15, NUMBL, NROUNDS) \
+.if ROUND < 1;			       \
+       ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(NUMBL, vpxorq, L0B0_3, L0B4_7, L0B8_11, L0B12_15, L0B0_3, L0B4_7, L0B8_11, L0B12_15, KEY, KEY, KEY, KEY)       \
+.endif;					       \
+.if (ROUND >= 1) && (ROUND <= NROUNDS);        \
+       ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(NUMBL, vaesenc, L0B0_3, L0B4_7, L0B8_11, L0B12_15, L0B0_3, L0B4_7, L0B8_11, L0B12_15, KEY, KEY, KEY, KEY)      \
+.endif;					       \
+.if ROUND > NROUNDS;		       \
+       ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(NUMBL, vaesenclast, L0B0_3, L0B4_7, L0B8_11, L0B12_15, L0B0_3, L0B4_7, L0B8_11, L0B12_15, KEY, KEY, KEY, KEY)  \
+       .ifnc D0_3, no_data;	       \
+       .ifnc D4_7, no_data;	       \
+       .ifnc D8_11, no_data;	       \
+       .ifnc D12_15, no_data;	       \
+	       ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(NUMBL, vpxorq, L0B0_3, L0B4_7, L0B8_11, L0B12_15, L0B0_3, L0B4_7, L0B8_11, L0B12_15, D0_3, D4_7, D8_11, D12_15)        \
+       .endif;			       \
+       .endif;			       \
+       .endif;			       \
+       .endif;			       \
+.endif;
+
+/*
+ * Loads specified number of AES blocks into ZMM registers using mask register
+ * for the last loaded register (xmm, ymm or zmm). Loads take place at 1 byte
+ * granularity.
+ */
+#define ZMM_LOAD_MASKED_BLOCKS_0_16(NUM_BLOCKS, INP, DATA_OFFSET, DST0, DST1, DST2, DST3, MASK)        \
+.set src_offset,0;						       \
+.set blocks_left, NUM_BLOCKS;					       \
+.if NUM_BLOCKS <= 4;						       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), XWORD(DST0){MASK}{z};     \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST0){MASK}{z};     \
+       .elseif (blocks_left == 3 || blocks_left == 4);		       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), DST0{MASK}{z};    \
+       .endif;							       \
+.elseif NUM_BLOCKS > 4 && NUM_BLOCKS <= 8;			       \
+       vmovdqu8        src_offset(INP, DATA_OFFSET), DST0;	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set src_offset, src_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), XWORD(DST1){MASK}{z};     \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST1){MASK}{z};     \
+       .elseif (blocks_left == 3 || blocks_left == 4);		       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), DST1{MASK}{z};    \
+       .endif;							       \
+.elseif NUM_BLOCKS > 8 && NUM_BLOCKS <= 12;			       \
+	vmovdqu8	src_offset(INP, DATA_OFFSET), DST0;	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set src_offset, src_offset + 64;			       \
+	vmovdqu8	src_offset(INP, DATA_OFFSET), DST1;	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set src_offset, src_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), XWORD(DST2){MASK}{z};     \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST2){MASK}{z};     \
+       .elseif (blocks_left == 3 || blocks_left == 4);		       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), DST2{MASK}{z};    \
+       .endif;							       \
+.else;								       \
+	vmovdqu8	src_offset(INP, DATA_OFFSET), DST0;	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set src_offset, src_offset + 64;			       \
+	vmovdqu8	src_offset(INP, DATA_OFFSET), DST1;	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set src_offset, src_offset + 64;			       \
+       vmovdqu8        src_offset(INP, DATA_OFFSET), DST2;	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set src_offset, src_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), XWORD(DST3){MASK}{z};     \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST3){MASK}{z};     \
+       .elseif (blocks_left == 3 || blocks_left == 4);		       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), DST3{MASK}{z};    \
+       .endif;							       \
+.endif;
+
+/*
+ * Stores specified number of AES blocks from ZMM registers with mask register
+ * for the last loaded register (xmm, ymm or zmm). Stores take place at 1 byte
+ * granularity.
+ */
+#define ZMM_STORE_MASKED_BLOCKS_0_16(NUM_BLOCKS, OUTP, DATA_OFFSET, SRC0, SRC1, SRC2, SRC3, MASK) \
+.set blocks_left, NUM_BLOCKS;					       \
+.set dst_offset, 0;						       \
+.if NUM_BLOCKS <= 4;						       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        XWORD(SRC0), dst_offset(OUTP, DATA_OFFSET){MASK};\
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        YWORD(SRC0), dst_offset(OUTP, DATA_OFFSET){MASK};\
+       .elseif (blocks_left == 3 || blocks_left == 4);		       \
+	       vmovdqu8        SRC0, dst_offset(OUTP, DATA_OFFSET){MASK};      \
+       .endif;							       \
+.elseif NUM_BLOCKS > 4 && NUM_BLOCKS <=8;			       \
+       vmovdqu8        SRC0, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set dst_offset, dst_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        XWORD(SRC1), dst_offset(OUTP, DATA_OFFSET){MASK};\
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        YWORD(SRC1), dst_offset(OUTP, DATA_OFFSET){MASK};\
+       .elseif (blocks_left == 3 || blocks_left == 4);		       \
+	       vmovdqu8        SRC1, dst_offset(OUTP, DATA_OFFSET){MASK};      \
+       .endif;							       \
+.elseif NUM_BLOCKS > 8 && NUM_BLOCKS <= 12;			       \
+       vmovdqu8        SRC0, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set dst_offset, dst_offset + 64;			       \
+       vmovdqu8        SRC1, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set dst_offset, dst_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        XWORD(SRC2), dst_offset(OUTP, DATA_OFFSET){MASK};\
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        YWORD(SRC2), dst_offset(OUTP, DATA_OFFSET){MASK};\
+       .elseif (blocks_left == 3 || blocks_left == 4);		       \
+	       vmovdqu8        SRC2, dst_offset(OUTP, DATA_OFFSET){MASK};      \
+       .endif;							       \
+.else;								       \
+       vmovdqu8        SRC0, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set dst_offset, dst_offset + 64;			       \
+       vmovdqu8        SRC1, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set dst_offset, dst_offset + 64;			       \
+       vmovdqu8        SRC2, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set blocks_left, blocks_left - 4;			       \
+       .set dst_offset, dst_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        XWORD(SRC3), dst_offset(OUTP, DATA_OFFSET){MASK};\
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        YWORD(SRC3), dst_offset(OUTP, DATA_OFFSET){MASK};\
+       .elseif (blocks_left == 3 || blocks_left == 4);		       \
+	       vmovdqu8        SRC3, dst_offset(OUTP, DATA_OFFSET){MASK};      \
+       .endif;							       \
+.endif;
+
+/* Loads specified number of AES blocks into ZMM registers */
+#define ZMM_LOAD_BLOCKS_0_16(NUM_BLOCKS, INP, DATA_OFFSET, DST0, DST1, DST2, DST3, FLAGS) \
+.set src_offset, 0;						       \
+.set blocks_left, NUM_BLOCKS % 4;				       \
+.if NUM_BLOCKS < 4;						       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), XWORD(DST0);      \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST0);      \
+       .elseif blocks_left == 3;				       \
+	       .ifc FLAGS, load_4_instead_of_3;			       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), DST0;     \
+	       .else;						       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST0);      \
+	       vinserti64x2    $2, src_offset + 32(INP, DATA_OFFSET), DST0, DST0;      \
+	       .endif;						       \
+       .endif;							       \
+.elseif NUM_BLOCKS >= 4 && NUM_BLOCKS < 8;			       \
+       vmovdqu8        src_offset(INP, DATA_OFFSET), DST0;	       \
+       .set src_offset, src_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), XWORD(DST1);      \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST1);      \
+       .elseif blocks_left == 3;				       \
+	       .ifc FLAGS, load_4_instead_of_3;			       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), DST1;     \
+	       .else;						       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST1);      \
+	       vinserti64x2    $2, src_offset + 32(INP, DATA_OFFSET), DST1, DST1;      \
+	       .endif;						       \
+       .endif;							       \
+.elseif NUM_BLOCKS >= 8 && NUM_BLOCKS < 12;			       \
+       vmovdqu8        src_offset(INP, DATA_OFFSET), DST0;	       \
+       .set src_offset, src_offset + 64;			       \
+       vmovdqu8        src_offset(INP, DATA_OFFSET), DST1;	       \
+       .set src_offset, src_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), XWORD(DST2);      \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST2);      \
+       .elseif blocks_left == 3;				       \
+	       .ifc FLAGS, load_4_instead_of_3;			       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), DST2;     \
+	       .else;						       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST2);      \
+	       vinserti64x2    $2, src_offset + 32(INP, DATA_OFFSET), DST2, DST2; \
+	       .endif;						       \
+       .endif;							       \
+.else;								       \
+       vmovdqu8        src_offset(INP, DATA_OFFSET), DST0;	       \
+       .set src_offset, src_offset + 64;			       \
+       vmovdqu8        src_offset(INP, DATA_OFFSET), DST1;	       \
+       .set src_offset, src_offset + 64;			       \
+       vmovdqu8        src_offset(INP, DATA_OFFSET), DST2;	       \
+       .set src_offset, src_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), XWORD(DST3);      \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST3);      \
+       .elseif blocks_left == 3;				       \
+	       .ifc FLAGS, load_4_instead_of_3;			       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), DST3;     \
+	       .else;						       \
+	       vmovdqu8        src_offset(INP, DATA_OFFSET), YWORD(DST3);      \
+	       vinserti64x2    $2, src_offset + 32(INP, DATA_OFFSET), DST3, DST3; \
+	       .endif;						       \
+       .endif;							       \
+.endif;
+
+/* Stores specified number of AES blocks from ZMM registers */
+#define ZMM_STORE_BLOCKS_0_16(NUM_BLOCKS, OUTP, DATA_OFFSET, SRC0, SRC1, SRC2, SRC3)   \
+.set dst_offset, 0;						       \
+.set blocks_left, NUM_BLOCKS % 4;				       \
+.if NUM_BLOCKS < 4;						       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        XWORD(SRC0), dst_offset(OUTP, DATA_OFFSET);     \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        YWORD(SRC0), dst_offset(OUTP, DATA_OFFSET);     \
+       .elseif blocks_left == 3;				       \
+	       vmovdqu8        YWORD(SRC0), dst_offset(OUTP, DATA_OFFSET);     \
+	       vextracti32x4   $2, SRC0, dst_offset + 32(OUTP, DATA_OFFSET);   \
+       .endif;							       \
+.elseif NUM_BLOCKS >= 4 && NUM_BLOCKS < 8;			       \
+       vmovdqu8        SRC0, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set dst_offset, dst_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        XWORD(SRC1), dst_offset(OUTP, DATA_OFFSET);     \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        YWORD(SRC1), dst_offset(OUTP, DATA_OFFSET);     \
+       .elseif blocks_left == 3;				       \
+	       vmovdqu8        YWORD(SRC1), dst_offset(OUTP, DATA_OFFSET);     \
+	       vextracti32x4   $2, SRC1, dst_offset + 32(OUTP, DATA_OFFSET);   \
+       .endif;							       \
+.elseif NUM_BLOCKS >= 8 && NUM_BLOCKS < 12;			       \
+       vmovdqu8        SRC0, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set dst_offset, dst_offset + 64;			       \
+       vmovdqu8        SRC1, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set dst_offset, dst_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        XWORD(SRC2), dst_offset(OUTP, DATA_OFFSET);     \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        YWORD(SRC2), dst_offset(OUTP, DATA_OFFSET);     \
+       .elseif blocks_left == 3;				       \
+	       vmovdqu8        YWORD(SRC2), dst_offset(OUTP, DATA_OFFSET);     \
+	       vextracti32x4   $2, SRC2, dst_offset + 32(OUTP, DATA_OFFSET);   \
+       .endif;							       \
+.else;								       \
+       vmovdqu8        SRC0, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set dst_offset, dst_offset + 64;			       \
+       vmovdqu8        SRC1, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set dst_offset, dst_offset + 64;			       \
+       vmovdqu8        SRC2, dst_offset(OUTP, DATA_OFFSET);	       \
+       .set dst_offset, dst_offset + 64;			       \
+       .if blocks_left == 1;					       \
+	       vmovdqu8        XWORD(SRC3), dst_offset(OUTP, DATA_OFFSET);     \
+       .elseif blocks_left == 2;				       \
+	       vmovdqu8        YWORD(SRC3), dst_offset(OUTP, DATA_OFFSET);     \
+       .elseif blocks_left == 3;				       \
+	       vmovdqu8        YWORD(SRC3), dst_offset(OUTP, DATA_OFFSET);     \
+	       vextracti32x4   $2, SRC3, dst_offset + 32(OUTP, DATA_OFFSET);   \
+       .endif;							       \
+.endif;
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 1192dea..251c652 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -62,6 +62,12 @@
 # define DISABLE_VPCLMULQDQ	(1 << (X86_FEATURE_VPCLMULQDQ & 31))
 #endif
 
+#if defined(CONFIG_AS_VAES_AVX512)
+# define DISABLE_VAES		0
+#else
+# define DISABLE_VAES		(1 << (X86_FEATURE_VAES & 31))
+#endif
+
 #ifdef CONFIG_IOMMU_SUPPORT
 # define DISABLE_ENQCMD	0
 #else
@@ -88,7 +94,7 @@
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-			 DISABLE_ENQCMD|DISABLE_VPCLMULQDQ)
+			 DISABLE_ENQCMD|DISABLE_VPCLMULQDQ|DISABLE_VAES)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 70d1d35..3043849 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -649,6 +649,18 @@ config CRYPTO_GHASH_CLMUL_NI_AVX512
 	depends on CRYPTO_GHASH_CLMUL_NI_INTEL
 	depends on AS_VPCLMULQDQ
 
+# We default CRYPTO_AES_CTR_AVX512 to Y but depend on CRYPTO_AVX512 in
+# order to have a singular option (CRYPTO_AVX512) select multiple algorithms
+# when supported. Specifically, if the platform and/or toolset does not
+# support VPLMULQDQ. Then this algorithm should not be supported as part of
+# the set that CRYPTO_AVX512 selects.
+config CRYPTO_AES_CTR_AVX512
+	bool
+	default y
+	depends on CRYPTO_AVX512
+	depends on CRYPTO_AES_NI_INTEL
+	depends on AS_VAES_AVX512
+
 config CRYPTO_CRC32C_SPARC64
 	tristate "CRC32c CRC algorithm (SPARC64)"
 	depends on SPARC64
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC V1 6/7] crypto: aesni - fix coding style for if/else block
  2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
                   ` (4 preceding siblings ...)
  2020-12-18 21:11 ` [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization Megha Dey
@ 2020-12-18 21:11 ` Megha Dey
  2020-12-18 21:11 ` [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ Megha Dey
  2020-12-21 23:20 ` [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Eric Biggers
  7 siblings, 0 replies; 28+ messages in thread
From: Megha Dey @ 2020-12-18 21:11 UTC (permalink / raw)
  To: herbert, davem
  Cc: linux-crypto, linux-kernel, ravi.v.shankar, tim.c.chen,
	andi.kleen, dave.hansen, megha.dey, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	tony.luck, ira.weiny

The if else block in aesni_init does not follow required coding
conventions. If other conditionals are added to the block, it
becomes very difficult to parse. Use the correct coding style
instead.

Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 arch/x86/crypto/aesni-intel_glue.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index f45059e..9e56cdf 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1117,8 +1117,7 @@ static int __init aesni_init(void)
 	if (boot_cpu_has(X86_FEATURE_AVX2)) {
 		pr_info("AVX2 version of gcm_enc/dec engaged.\n");
 		aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen4;
-	} else
-	if (boot_cpu_has(X86_FEATURE_AVX)) {
+	} else if (boot_cpu_has(X86_FEATURE_AVX)) {
 		pr_info("AVX version of gcm_enc/dec engaged.\n");
 		aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen2;
 	} else {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ
  2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
                   ` (5 preceding siblings ...)
  2020-12-18 21:11 ` [RFC V1 6/7] crypto: aesni - fix coding style for if/else block Megha Dey
@ 2020-12-18 21:11 ` Megha Dey
  2021-01-16 17:16   ` Ard Biesheuvel
  2020-12-21 23:20 ` [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Eric Biggers
  7 siblings, 1 reply; 28+ messages in thread
From: Megha Dey @ 2020-12-18 21:11 UTC (permalink / raw)
  To: herbert, davem
  Cc: linux-crypto, linux-kernel, ravi.v.shankar, tim.c.chen,
	andi.kleen, dave.hansen, megha.dey, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	tony.luck, ira.weiny

Introduce the AVX512 implementation that optimizes the AESNI-GCM encode
and decode routines using VPCLMULQDQ.

The glue code in AESNI module overrides the existing AVX2 GCM mode
encryption/decryption routines with the AX512 AES GCM mode ones when the
following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform
   (currently only Icelake)
2. aesni_intel.use_avx512 module parameter is set at boot time. For this
   algorithm, switching from AVX512 optimized version is not possible
   once set at boot time because of how the code is structured today.(Can
   be changed later if required)

The functions aesni_gcm_init_avx_512, aesni_gcm_enc_update_avx_512,
aesni_gcm_dec_update_avx_512 and aesni_gcm_finalize_avx_512 are adapted
from the Intel Optimized IPSEC Cryptographic library.

On a Icelake desktop, with turbo disabled and all CPUs running at
maximum frequency, the AVX512 GCM mode optimization shows better
performance across data & key sizes as measured by tcrypt.

The average performance improvement of the AVX512 version over the AVX2
version is as follows:
For all key sizes(128/192/256 bits),
        data sizes < 128 bytes/block, negligible improvement (~7.5%)
        data sizes > 128 bytes/block, there is an average improvement of
        40% for both encryption and decryption.

A typical run of tcrypt with AES GCM mode encryption/decryption of the
AVX2 and AVX512 optimization on a Icelake desktop shows the following
results:

  ----------------------------------------------------------------------
  |   key  | bytes | cycles/op (lower is better)   | Percentage gain/  |
  | length |   per |   encryption  |  decryption   |      loss         |
  | (bits) | block |-------------------------------|-------------------|
  |        |       | avx2 | avx512 | avx2 | avx512 | Encrypt | Decrypt |
  |---------------------------------------------------------------------
  |  128   | 16    | 689  |  701   | 689  |  707   |  -1.7   |  -2.61  |
  |  128   | 64    | 731  |  660   | 771  |  649   |   9.7   |  15.82  |
  |  128   | 256   | 911  |  750   | 900  |  721   |  17.67  |  19.88  |
  |  128   | 512   | 1181 |  814   | 1161 |  782   |  31.07  |  32.64  |
  |  128   | 1024  | 1676 |  1052  | 1685 |  1030  |  37.23  |  38.87  |
  |  128   | 2048  | 2475 |  1447  | 2456 |  1419  |  41.53  |  42.22  |
  |  128   | 4096  | 3806 |  2154  | 3820 |  2119  |  43.41  |  44.53  |
  |  128   | 8192  | 9169 |  3806  | 6997 |  3718  |  58.49  |  46.86  |
  |  192   | 16    | 754  |  683   | 737  |  672   |   9.42  |   8.82  |
  |  192   | 64    | 735  |  686   | 715  |  640   |   6.66  |  10.49  |
  |  192   | 256   | 949  |  738   | 2435 |  729   |  22.23  |  70     |
  |  192   | 512   | 1235 |  854   | 1200 |  833   |  30.85  |  30.58  |
  |  192   | 1024  | 1777 |  1084  | 1763 |  1051  |  38.99  |  40.39  |
  |  192   | 2048  | 2574 |  1497  | 2592 |  1459  |  41.84  |  43.71  |
  |  192   | 4096  | 4086 |  2317  | 4091 |  2244  |  43.29  |  45.14  |
  |  192   | 8192  | 7481 |  4054  | 7505 |  3953  |  45.81  |  47.32  |
  |  256   | 16    | 755  |  682   | 720  |  683   |   9.68  |   5.14  |
  |  256   | 64    | 744  |  677   | 719  |  658   |   9     |   8.48  |
  |  256   | 256   | 962  |  758   | 948  |  749   |  21.21  |  21     |
  |  256   | 512   | 1297 |  862   | 1276 |  836   |  33.54  |  34.48  |
  |  256   | 1024  | 1831 |  1114  | 1819 |  1095  |  39.16  |  39.8   |
  |  256   | 2048  | 2767 |  1566  | 2715 |  1524  |  43.4   |  43.87  |
  |  256   | 4096  | 4378 |  2382  | 4368 |  2354  |  45.6   |  46.11  |
  |  256   | 8192  | 8075 |  4262  | 8080 |  4186  |  47.22  |  48.19  |
  ----------------------------------------------------------------------

This work was inspired by the AES GCM mode optimization published in
Intel Optimized IPSEC Cryptographic library.
https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/gcm_vaes_avx512.asm

Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Megha Dey <megha.dey@intel.com>
---
 arch/x86/crypto/Makefile                    |    1 +
 arch/x86/crypto/aesni-intel_avx512-x86_64.S | 1788 +++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c          |   62 +-
 crypto/Kconfig                              |   12 +
 4 files changed, 1858 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 5fd9b35..320d4cc 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
 aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o
 aesni-intel-$(CONFIG_CRYPTO_AES_CTR_AVX512) += aes_ctrby16_avx512-x86_64.o
+aesni-intel-$(CONFIG_CRYPTO_AES_GCM_AVX512) += aesni-intel_avx512-x86_64.o
 
 obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
 sha1-ssse3-y := sha1_avx2_x86_64_asm.o sha1_ssse3_asm.o sha1_ssse3_glue.o
diff --git a/arch/x86/crypto/aesni-intel_avx512-x86_64.S b/arch/x86/crypto/aesni-intel_avx512-x86_64.S
new file mode 100644
index 0000000..270a9e4
--- /dev/null
+++ b/arch/x86/crypto/aesni-intel_avx512-x86_64.S
@@ -0,0 +1,1788 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright © 2020 Intel Corporation.
+ *
+ * Implement AES GCM mode optimization with VAES instructions. (x86_64)
+ *
+ * This is AES128/192/256 GCM mode optimization implementation. It requires
+ * the support of Intel(R) AVX512F, VPCLMULQDQ and VAES instructions.
+ */
+
+#include "avx512_vaes_common.S"
+
+#define HashSubKey	(16*6)
+#define PBlockLen	(16*5)
+#define CurCount	(16*4)
+#define OrigIV		(16*3)
+#define PBlockEncKey	(16*2)
+#define InLen		((16*1)+8)
+#define AadLen		(16*1)
+#define AadHash		(16*0)
+#define big_loop_nblocks	48
+
+.text
+
+#define ENCRYPT_SINGLE_BLOCK(GDATA, XMM0, NROUNDS)	\
+	vpxorq (GDATA), XMM0, XMM0;			\
+.set i, 1;						\
+.rept 9;						\
+	vaesenc 16 * i(GDATA), XMM0, XMM0;		\
+	.set i, i+1;					\
+.endr;							\
+.if NROUNDS == 9;					\
+	vaesenclast 16 * 10(GDATA), XMM0, XMM0;		\
+.else;							\
+	vaesenc 16 * 10(GDATA), XMM0, XMM0;		\
+	vaesenc 16 * 11(GDATA), XMM0, XMM0;		\
+	.if NROUNDS == 11;				\
+		vaesenclast 16 * 12(GDATA), XMM0, XMM0;	\
+	.else;						\
+		vaesenc 16 * 12(GDATA), XMM0, XMM0;	\
+		vaesenc 16 * 13(GDATA), XMM0, XMM0;	\
+		vaesenclast 16 * 14(GDATA), XMM0, XMM0;	\
+	.endif;						\
+.endif;
+
+/* schoolbook multiply - 1st step */
+#define VCLMUL_STEP1(HS, HI, TMP, TH, TM, TL, HKEY)		\
+.ifc HKEY, NULL;						\
+	vmovdqu64	HashKey_4 + HashSubKey(HS), TMP;	\
+.else;								\
+	vmovdqa64	HKEY , TMP;				\
+.endif;								\
+	vpclmulqdq	$0x11, TMP, HI, TH;			\
+	vpclmulqdq	$0x00, TMP, HI, TL;			\
+	vpclmulqdq	$0x01, TMP, HI, TM;			\
+	vpclmulqdq	$0x10, TMP, HI, TMP;			\
+	vpxorq		TMP, TM, TM;
+
+/* Horizontal XOR - 2 x 128bits xored together */
+#define VHPXORI2x128(REG, TMP)					\
+	vextracti32x4	$1, REG, XWORD(TMP);			\
+	vpxorq		XWORD(TMP), XWORD(REG), XWORD(REG);
+
+/* schoolbook multiply - 2nd step */
+#define VCLMUL_STEP2(HS, HI, LO, TMP0, TMP1, TMP2, TH, TM, TL, HKEY, HXOR)	\
+.ifc HKEY, NULL;								\
+	vmovdqu64	HashKey_8 + HashSubKey(HS), TMP0;			\
+.else;										\
+	vmovdqa64	HKEY, TMP0;						\
+.endif;										\
+	vpclmulqdq	$0x10, TMP0, LO, TMP1;					\
+	vpclmulqdq	$0x11, TMP0, LO, TMP2;					\
+	vpxorq		TMP2, TH, TH;						\
+	vpclmulqdq	$0x00, TMP0, LO, TMP2;					\
+	vpxorq		TMP2, TL, TL;						\
+	vpclmulqdq	$0x01, TMP0, LO, TMP0;					\
+	vpternlogq	$0x96, TMP0, TMP1, TM;					\
+	/* finish multiplications */						\
+	vpsrldq		$8, TM, TMP2;						\
+	vpxorq		TMP2, TH, HI;						\
+	vpslldq		$8, TM, TMP2;						\
+	vpxorq		TMP2, TL, LO;						\
+	/* XOR 128 bits horizontally and compute [(X8*H1) + (X7*H2) + ... ((X1+Y0)*H8] */	\
+.ifc HXOR, NULL;								\
+	VHPXORI4x128(HI, TMP2)							\
+	VHPXORI4x128(LO, TMP1)							\
+.else;										\
+	.if HXOR == 4;								\
+		VHPXORI4x128(HI, TMP2)						\
+		VHPXORI4x128(LO, TMP1)						\
+	.elseif HXOR == 2;							\
+		VHPXORI2x128(HI, TMP2)						\
+		VHPXORI2x128(LO, TMP1)						\
+	.endif;									\
+	/* for HXOR == 1 there is nothing to be done */				\
+.endif;
+
+/* schoolbook multiply (1 to 8 blocks) - 1st step */
+#define VCLMUL_1_TO_8_STEP1(HS, HI, TMP1, TMP2, TH, TM, TL, NBLOCKS)	\
+	.if NBLOCKS == 8;						\
+		VCLMUL_STEP1(HS, HI, TMP1, TH, TM, TL, NULL)		\
+	.elseif NBLOCKS == 7;						\
+		vmovdqu64	HashKey_3 + HashSubKey(HS), TMP2;	\
+		vmovdqa64	mask_out_top_block(%rip), TMP1;		\
+		vpandq		TMP1, TMP2, TMP2;			\
+		vpandq		TMP1, HI, HI;				\
+		VCLMUL_STEP1(NULL, HI, TMP1, TH, TM, TL, TMP2)		\
+	.elseif NBLOCKS == 6;						\
+		vmovdqu64	HashKey_2 + HashSubKey(HS), YWORD(TMP2);\
+		VCLMUL_STEP1(NULL, YWORD(HI), YWORD(TMP1), YWORD(TH), YWORD(TM), YWORD(TL), YWORD(TMP2))	\
+	.elseif NBLOCKS == 5;						\
+		vmovdqu64	HashKey_1 + HashSubKey(HS), XWORD(TMP2);\
+		VCLMUL_STEP1(NULL, XWORD(HI), XWORD(TMP1), XWORD(TH), XWORD(TM), XWORD(TL), XWORD(TMP2))	\
+	.else;								\
+		vpxorq		TH, TH, TH;				\
+		vpxorq		TM, TM, TM;				\
+		vpxorq		TL, TL, TL;				\
+	.endif;
+
+/* schoolbook multiply (1 to 8 blocks) - 2nd step */
+#define VCLMUL_1_TO_8_STEP2(HS, HI, LO, TMP0, TMP1, TMP2, TH, TM, TL, NBLOCKS)		\
+	.if NBLOCKS == 8;								\
+		VCLMUL_STEP2(HS, HI, LO, TMP0, TMP1, TMP2, TH, TM, TL, NULL, NULL)	\
+	.elseif NBLOCKS == 7;								\
+		vmovdqu64	HashKey_7 + HashSubKey(HS), TMP2;			\
+		VCLMUL_STEP2(NULL, HI, LO, TMP0, TMP1, TMP2, TH, TM, TL, TMP2, 4)	\
+	.elseif NBLOCKS == 6;								\
+		vmovdqu64	HashKey_6 + HashSubKey(HS), TMP2;			\
+		VCLMUL_STEP2(NULL, HI, LO, TMP0, TMP1, TMP2, TH, TM, TL, TMP2, 4)	\
+	.elseif NBLOCKS == 5;								\
+		vmovdqu64	HashKey_5 + HashSubKey(HS), TMP2;			\
+		VCLMUL_STEP2(NULL, HI, LO, TMP0, TMP1, TMP2, TH, TM, TL, TMP2, 4)	\
+	.elseif NBLOCKS == 4;								\
+		vmovdqu64	HashKey_4 + HashSubKey(HS), TMP2;			\
+		VCLMUL_STEP2(NULL, HI, LO, TMP0, TMP1, TMP2, TH, TM, TL, TMP2, 4)	\
+	.elseif NBLOCKS == 3;								\
+		vmovdqu64	HashKey_3 + HashSubKey(HS), TMP2;			\
+		vmovdqa64	mask_out_top_block(%rip), TMP1;				\
+		vpandq		TMP1, TMP2, TMP2;					\
+		vpandq		TMP1, LO, LO;						\
+		VCLMUL_STEP2(NULL, HI, LO, TMP0, TMP1, TMP2, TH, TM, TL, TMP2, 4)	\
+	.elseif NBLOCKS == 2;								\
+		vmovdqu64	HashKey_2 + HashSubKey(HS), YWORD(TMP2);		\
+		VCLMUL_STEP2(NULL, YWORD(HI), YWORD(LO), YWORD(TMP0), YWORD(TMP1), YWORD(TMP2), YWORD(TH), YWORD(TM), YWORD(TL), YWORD(TMP2), 2)	\
+	.elseif NBLOCKS == 1;								\
+		vmovdqu64	HashKey_1 + HashSubKey(HS), XWORD(TMP2);		\
+		VCLMUL_STEP2(NULL, XWORD(HI), XWORD(LO), XWORD(TMP0), XWORD(TMP1), XWORD(TMP2), XWORD(TH), XWORD(TM), XWORD(TL), XWORD(TMP2), 1)	\
+	.else;										\
+		vpxorq		HI, HI, HI;						\
+		vpxorq		LO, LO, LO;						\
+	.endif;
+
+/* Initialize a gcm_context_data struct to prepare for encoding/decoding. */
+#define GCM_INIT(GDATA_CTX, IV, HASH_SUBKEY, A_IN, A_LEN, GPR1, GPR2, GPR3, MASKREG, AAD_HASH, CUR_COUNT, ZT0, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9)	  \
+	vpxorq           AAD_HASH, AAD_HASH, AAD_HASH;			\
+	CALC_AAD_HASH(A_IN, A_LEN, AAD_HASH, GDATA_CTX, ZT0, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %zmm17, %zmm18, %zmm19, GPR1, GPR2, GPR3, MASKREG, 96)	\
+	mov		A_LEN, GPR1;				\
+	vmovdqu64	AAD_HASH, (GDATA_CTX);			\
+	mov		GPR1, 16(GDATA_CTX);			\
+	xor		GPR1, GPR1;				\
+	mov		GPR1, InLen(GDATA_CTX);			\
+	mov		GPR1, PBlockLen(GDATA_CTX);		\
+	vmovdqu8	ONEf(%rip), CUR_COUNT;			\
+	mov		IV, GPR2;				\
+	mov		$0xfff, GPR1;				\
+	kmovq		GPR1, MASKREG;				\
+	vmovdqu8	(GPR2), CUR_COUNT{MASKREG};		\
+	vmovdqu64	CUR_COUNT, OrigIV(GDATA_CTX);		\
+	vpshufb		SHUF_MASK(%rip), CUR_COUNT, CUR_COUNT;	\
+	vmovdqu		CUR_COUNT, CurCount(GDATA_CTX);
+
+/* Packs xmm register with data when data input is less or equal to 16 bytes */
+#define READ_SMALL_DATA_INPUT(OUTPUT, INPUT, LEN ,TMP1, MASK)	\
+	cmp		$16, LEN;				\
+	jge		49f;					\
+	lea		byte_len_to_mask_table(%rip), TMP1;	\
+	kmovw		(TMP1, LEN, 2), MASK;			\
+	vmovdqu8	(INPUT), OUTPUT{MASK}{z};		\
+	jmp		50f;					\
+49:;								\
+	vmovdqu8	(INPUT), OUTPUT;			\
+	mov		$0xffff, TMP1;				\
+	kmovq		TMP1, MASK;				\
+50:;
+
+/*
+ * Handles encryption/decryption and the tag partial blocks between update calls.
+ * Requires the input data be at least 1 byte long. The output is a cipher/plain
+ * of the first partial block (CYPH_PLAIN_OUT), AAD_HASH and updated GDATA_CTX
+ */
+#define PARTIAL_BLOCK(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, PLAIN_CYPH_LEN, DATA_OFFSET, AAD_HASH, ENC_DEC, GPTMP0, GPTMP1, GPTMP2, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, MASKREG)	\
+	mov		PBlockLen(GDATA_CTX), GPTMP0;			\
+	or		GPTMP0, GPTMP0;					\
+	je		48f;						\
+	READ_SMALL_DATA_INPUT(XWORD(ZTMP0), PLAIN_CYPH_IN, PLAIN_CYPH_LEN, GPTMP1, MASKREG)	\
+	vmovdqu64	PBlockEncKey(GDATA_CTX), XWORD(ZTMP1);		\
+	vmovdqu64	HashKey + HashSubKey(GDATA_CTX), XWORD(ZTMP2);	\
+	lea		SHIFT_MASK(%rip), GPTMP1;			\
+	add		GPTMP0, GPTMP1;					\
+	vmovdqu64	(GPTMP1), XWORD(ZTMP3);				\
+	vpshufb		XWORD(ZTMP3), XWORD(ZTMP1), XWORD(ZTMP1);	\
+	.ifc ENC_DEC, DEC;						\
+	vmovdqa64	XWORD(ZTMP0), XWORD(ZTMP4);			\
+	.endif;								\
+	vpxorq		XWORD(ZTMP0), XWORD(ZTMP1), XWORD(ZTMP1);	\
+	/* Determine if partial block is being filled & shift mask */	\
+	mov		PLAIN_CYPH_LEN, GPTMP2;				\
+	add		GPTMP0, GPTMP2;					\
+	sub		$16, GPTMP2;					\
+	jge		45f;						\
+	sub		GPTMP2, GPTMP1;					\
+45:;									\
+	/* get the mask to mask out bottom GPTMP0 bytes of XTMP1 */	\
+	vmovdqu64	(ALL_F - SHIFT_MASK)(GPTMP1), XWORD(ZTMP0);	\
+	vpand		XWORD(ZTMP0), XWORD(ZTMP1),  XWORD(ZTMP1);	\
+	.ifc ENC_DEC, DEC;						\
+	vpand		XWORD(ZTMP0), XWORD(ZTMP4), XWORD(ZTMP4);	\
+	vpshufb		SHUF_MASK(%rip), XWORD(ZTMP4), XWORD(ZTMP4);	\
+	vpshufb		XWORD(ZTMP3), XWORD(ZTMP4), XWORD(ZTMP4);	\
+	vpxorq		XWORD(ZTMP4), AAD_HASH, AAD_HASH;		\
+	.else;								\
+	vpshufb		SHUF_MASK(%rip), XWORD(ZTMP1), XWORD(ZTMP1);	\
+	vpshufb		XWORD(ZTMP3), XWORD(ZTMP1), XWORD(ZTMP1);	\
+	vpxorq		XWORD(ZTMP1), AAD_HASH, AAD_HASH;		\
+	.endif;								\
+	cmp		$0, GPTMP2;					\
+	jl		46f;						\
+	/* GHASH computation for the last <16 Byte block */		\
+	GHASH_MUL(AAD_HASH, XWORD(ZTMP2), XWORD(ZTMP5), XWORD(ZTMP6), XWORD(ZTMP7), XWORD(ZTMP8), XWORD(ZTMP9)) \
+	movq		$0, PBlockLen(GDATA_CTX);			\
+	mov		GPTMP0, GPTMP1;					\
+	mov		$16, GPTMP0;					\
+	sub		GPTMP1, GPTMP0;					\
+	jmp		47f;						\
+46:;									\
+	add		PLAIN_CYPH_LEN, PBlockLen(GDATA_CTX);		\
+	mov		PLAIN_CYPH_LEN, GPTMP0;				\
+47:;									\
+	lea		byte_len_to_mask_table(%rip), GPTMP1;		\
+	kmovw		(GPTMP1, GPTMP0, 2), MASKREG;			\
+	vmovdqu64	AAD_HASH, (GDATA_CTX);				\
+	.ifc ENC_DEC, ENC;						\
+	/* shuffle XTMP1 back to output as ciphertext */		\
+	vpshufb		SHUF_MASK(%rip), XWORD(ZTMP1), XWORD(ZTMP1);	\
+	vpshufb		XWORD(ZTMP3), XWORD(ZTMP1), XWORD(ZTMP1);	\
+	.endif;								\
+	vmovdqu8	XWORD(ZTMP1), (CYPH_PLAIN_OUT, DATA_OFFSET, 1){MASKREG};	\
+	add		GPTMP0, DATA_OFFSET;				\
+48:;
+
+/* Encrypt/decrypt the initial 16 blocks */
+#define INITIAL_BLOCKS_16(IN, OUT, KP, DATA_OFFSET, GHASH, CTR, CTR_CHECK, ADDBE_4x4, ADDBE_1234, T0, T1, T2, T3, T4, T5, T6, T7, T8, SHUF_MASK, ENC_DEC, BLK_OFFSET, DATA_DISPL, NROUNDS) \
+	cmp		$(256 - 16), BYTE(CTR_CHECK);		\
+	jae		37f;					\
+	vpaddd		ADDBE_1234 ,CTR, T5;			\
+	vpaddd		ADDBE_4x4, T5, T6;			\
+	vpaddd		ADDBE_4x4, T6, T7;			\
+	vpaddd		ADDBE_4x4, T7, T8;			\
+	jmp		38f;					\
+37:;								\
+	vpshufb		SHUF_MASK, CTR, CTR;			\
+	vmovdqa64	ddq_add_4444(%rip), T8;			\
+	vpaddd		ddq_add_1234(%rip), CTR, T5;		\
+	vpaddd		T8, T5, T6;				\
+	vpaddd		T8, T6, T7;				\
+	vpaddd		T8, T7, T8;				\
+	vpshufb		SHUF_MASK, T5, T5;			\
+	vpshufb		SHUF_MASK, T6, T6;			\
+	vpshufb		SHUF_MASK, T7, T7;			\
+	vpshufb		SHUF_MASK, T8, T8;			\
+38:;								\
+	vshufi64x2	$0xff, T8, T8, CTR;			\
+	add		$16, BYTE(CTR_CHECK);			\
+	/* load 16 blocks of data */				\
+	vmovdqu8	DATA_DISPL(IN, DATA_OFFSET), T0;	\
+	vmovdqu8	64 + DATA_DISPL(DATA_OFFSET, IN), T1;	\
+	vmovdqu8	128 + DATA_DISPL(DATA_OFFSET, IN), T2;	\
+	vmovdqu8	192 + DATA_DISPL(DATA_OFFSET, IN), T3;	\
+	/* move to AES encryption rounds */			\
+	vbroadcastf64x2 (KP), T4;				\
+	vpxorq		T4, T5, T5;				\
+	vpxorq		T4, T6, T6;				\
+	vpxorq		T4, T7, T7;				\
+	vpxorq		T4, T8, T8;				\
+.set i, 1;							\
+.rept 9;							\
+	vbroadcastf64x2 16*i(KP), T4;				\
+	vaesenc		T4, T5, T5;				\
+	vaesenc		T4, T6, T6;				\
+	vaesenc		T4, T7, T7;				\
+	vaesenc		T4, T8, T8;				\
+	.set i, i+1;						\
+.endr;								\
+.if NROUNDS==9;							\
+	vbroadcastf64x2 16*i(KP), T4;				\
+.else;								\
+	.rept 2;						\
+		vbroadcastf64x2 16*i(KP), T4;			\
+		vaesenc		T4, T5, T5;			\
+		vaesenc		T4, T6, T6;			\
+		vaesenc		T4, T7, T7;			\
+		vaesenc		T4, T8, T8;			\
+		.set i, i+1;					\
+	.endr;							\
+	.if NROUNDS==11;					\
+		vbroadcastf64x2 16*i(KP), T4;			\
+	.else;							\
+		.rept 2;					\
+			vbroadcastf64x2 16*i(KP), T4;		\
+			vaesenc		T4, T5, T5;		\
+			vaesenc		T4, T6, T6;		\
+			vaesenc		T4, T7, T7;		\
+			vaesenc		T4, T8, T8;		\
+		.set i, i+1;					\
+		.endr;						\
+		vbroadcastf64x2 16*i(KP), T4;			\
+	.endif;							\
+.endif;								\
+	vaesenclast	T4, T5, T5;				\
+	vaesenclast	T4, T6, T6;				\
+	vaesenclast	T4, T7, T7;				\
+	vaesenclast	T4, T8, T8;				\
+	vpxorq		T0, T5, T5;				\
+	vpxorq		T1, T6, T6;				\
+	vpxorq		T2, T7, T7;				\
+	vpxorq		T3, T8, T8;				\
+	vmovdqu8	T5, DATA_DISPL(OUT, DATA_OFFSET);	\
+	vmovdqu8	T6, 64 + DATA_DISPL(DATA_OFFSET, OUT);	\
+	vmovdqu8	T7, 128 + DATA_DISPL(DATA_OFFSET, OUT);	\
+	vmovdqu8	T8, 192 + DATA_DISPL(DATA_OFFSET, OUT);	\
+.ifc  ENC_DEC, DEC;						\
+	vpshufb		SHUF_MASK, T0, T5;			\
+	vpshufb		SHUF_MASK, T1, T6;			\
+	vpshufb		SHUF_MASK, T2, T7;			\
+	vpshufb		SHUF_MASK, T3, T8;			\
+.else;								\
+	vpshufb		SHUF_MASK, T5, T5;			\
+	vpshufb		SHUF_MASK, T6, T6;			\
+	vpshufb		SHUF_MASK, T7, T7;			\
+	vpshufb		SHUF_MASK, T8, T8;			\
+.endif;								\
+.ifnc GHASH, no_ghash;						\
+	/* xor cipher block0 with GHASH for next GHASH round */	\
+	vpxorq		GHASH, T5, T5;				\
+.endif;								\
+	vmovdqa64	T5, BLK_OFFSET(%rsp);			\
+	vmovdqa64	T6, 64 + BLK_OFFSET(%rsp);		\
+	vmovdqa64	T7, 128 + BLK_OFFSET(%rsp);		\
+	vmovdqa64	T8, 192 + BLK_OFFSET(%rsp);
+
+/*
+ * Main GCM macro stitching cipher with GHASH
+ * - operates on single stream
+ * - encrypts 16 blocks at a time
+ * - ghash the 16 previously encrypted ciphertext blocks
+ * - no partial block or multi_call handling here
+ */
+#define GHASH_16_ENCRYPT_16_PARALLEL(GDATA, GCTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, DATA_OFFSET, CTR_BE, CTR_CHECK, HASHKEY_OFFSET, AESOUT_BLK_OFFSET, GHASHIN_BLK_OFFSET, SHFMSK, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZT13, ZT14, ZT15, ZT16, ZT17, ZT18, ZT19, ZT20, ZT21, ZT22, ZT23, ADDBE_4x4, ADDBE_1234, TO_REDUCE_L, TO_REDUCE_H, TO_REDUCE_M, DO_REDUCTION, ENC_DEC, DATA_DISPL, GHASH_IN, NROUNDS)   \
+	cmp		$240, BYTE(CTR_CHECK);		\
+	jae		28f;				\
+	vpaddd		ADDBE_1234, CTR_BE, ZT1;	\
+	vpaddd		ADDBE_4x4, ZT1, ZT2;		\
+	vpaddd		 ADDBE_4x4, ZT2, ZT3;		\
+	vpaddd		 ADDBE_4x4, ZT3, ZT4;		\
+	jmp		29f;				\
+28:;							\
+	vpshufb		SHFMSK, CTR_BE, CTR_BE;		\
+	vmovdqa64	ddq_add_4444(%rip), ZT4;	\
+	vpaddd		ddq_add_1234(%rip), CTR_BE, ZT1;\
+	vpaddd		ZT4, ZT1, ZT2;			\
+	vpaddd		ZT4, ZT2, ZT3;			\
+	vpaddd		ZT4, ZT3, ZT4;			\
+	vpshufb		SHFMSK, ZT1, ZT1;		\
+	vpshufb		SHFMSK, ZT2, ZT2;		\
+	vpshufb		SHFMSK, ZT3, ZT3;		\
+	vpshufb		SHFMSK, ZT4, ZT4;		\
+29:;							\
+	vbroadcastf64x2 (GDATA), ZT17;			\
+.ifnc GHASH_IN,no_ghash_in;				\
+	vpxorq		GHASHIN_BLK_OFFSET(%rsp), GHASH_IN, ZT21;	\
+.else;							\
+	vmovdqa64	GHASHIN_BLK_OFFSET(%rsp), ZT21;	\
+.endif;							\
+	vmovdqu64	HASHKEY_OFFSET(GCTX), ZT19;	\
+	/*						\
+	 * Save counter for the next round, increment	\
+	 * counter overflow check register.		\
+	 */						\
+	vshufi64x2	$0xff, ZT4, ZT4, CTR_BE;	\
+	add		$16, BYTE(CTR_CHECK);		\
+	vbroadcastf64x2 16*1(GDATA), ZT18;		\
+	vmovdqu64	HASHKEY_OFFSET + 64(GCTX), ZT20;\
+	vmovdqa64	GHASHIN_BLK_OFFSET + 64(%rsp), ZT22;	\
+	vpxorq		ZT17, ZT1, ZT1;			\
+	vpxorq		ZT17, ZT2, ZT2;			\
+	vpxorq		ZT17, ZT3, ZT3;			\
+	vpxorq		ZT17, ZT4, ZT4;			\
+	vbroadcastf64x2 16*2(GDATA), ZT17;		\
+	/* GHASH 4 blocks (15 to 12) */			\
+	vpclmulqdq	$0x11, ZT19, ZT21, ZT5;		\
+	vpclmulqdq	$0x00, ZT19, ZT21, ZT6;		\
+	vpclmulqdq	$0x01, ZT19, ZT21, ZT7;		\
+	vpclmulqdq	$0x10, ZT19, ZT21, ZT8;		\
+	vmovdqu64	HASHKEY_OFFSET + 64*2(GCTX), ZT19;	\
+	vmovdqa64	GHASHIN_BLK_OFFSET + 64*2(%rsp), ZT21;	\
+	/* AES round 1 */				\
+	vaesenc		ZT18, ZT1, ZT1;			\
+	vaesenc		ZT18, ZT2, ZT2;			\
+	vaesenc		ZT18, ZT3, ZT3;			\
+	vaesenc		ZT18, ZT4, ZT4;			\
+	vbroadcastf64x2 16*3(GDATA), ZT18;		\
+	/* GHASH 4 blocks (11 to 8) */			\
+	vpclmulqdq	$0x10, ZT20, ZT22, ZT11;	\
+	vpclmulqdq	$0x01, ZT20, ZT22, ZT12;	\
+	vpclmulqdq	$0x11, ZT20, ZT22, ZT9;		\
+	vpclmulqdq	$0x00, ZT20, ZT22, ZT10;	\
+	vmovdqu64	HASHKEY_OFFSET + 64*3(GCTX), ZT20;	\
+	vmovdqa64	GHASHIN_BLK_OFFSET + 64*3(%rsp), ZT22;	\
+	/* AES round 2 */				\
+	vaesenc		ZT17, ZT1, ZT1;			\
+	vaesenc		ZT17, ZT2, ZT2;			\
+	vaesenc		ZT17, ZT3, ZT3;			\
+	vaesenc		ZT17, ZT4, ZT4;			\
+	vbroadcastf64x2 16*4(GDATA), ZT17;		\
+	/* GHASH 4 blocks (7 to 4) */			\
+	vpclmulqdq	$0x10, ZT19, ZT21, ZT15;	\
+	vpclmulqdq	$0x01, ZT19, ZT21, ZT16;	\
+	vpclmulqdq	$0x11, ZT19, ZT21, ZT13;	\
+	vpclmulqdq	$0x00, ZT19, ZT21, ZT14;	\
+	/* AES round 3 */				\
+	vaesenc		ZT18, ZT1, ZT1;			\
+	vaesenc		ZT18, ZT2, ZT2;			\
+	vaesenc		ZT18, ZT3, ZT3;			\
+	vaesenc		ZT18, ZT4, ZT4;			\
+	vbroadcastf64x2 16*5(GDATA), ZT18;		\
+	/* Gather (XOR) GHASH for 12 blocks */		\
+	vpternlogq	$0x96, ZT13, ZT9, ZT5;		\
+	vpternlogq	$0x96, ZT14, ZT10, ZT6;		\
+	vpternlogq	$0x96, ZT16, ZT12, ZT8;		\
+	vpternlogq	$0x96, ZT15, ZT11, ZT7;		\
+	/* AES round 4 */				\
+	vaesenc		ZT17, ZT1, ZT1;			\
+	vaesenc		ZT17, ZT2, ZT2;			\
+	vaesenc		ZT17, ZT3, ZT3;			\
+	vaesenc		ZT17, ZT4, ZT4;			\
+	vbroadcastf64x2 16*6(GDATA), ZT17;		\
+	/* Load plain/cipher test */			\
+	vmovdqu8	DATA_DISPL(DATA_OFFSET, PLAIN_CYPH_IN), ZT13;	\
+	vmovdqu8	64*1 + DATA_DISPL(DATA_OFFSET, PLAIN_CYPH_IN), ZT14;	\
+	vmovdqu8	64*2 + DATA_DISPL(DATA_OFFSET, PLAIN_CYPH_IN), ZT15;	\
+	vmovdqu8	64*3 + DATA_DISPL(DATA_OFFSET, PLAIN_CYPH_IN), ZT16;	\
+	/* AES round 5 */				\
+	vaesenc		ZT18, ZT1, ZT1;			\
+	vaesenc		ZT18, ZT2, ZT2;			\
+	vaesenc		ZT18, ZT3, ZT3;			\
+	vaesenc		ZT18, ZT4, ZT4;			\
+	vbroadcastf64x2 16*7(GDATA), ZT18;		\
+	/* GHASH 4 blocks (3 to 0) */			\
+	vpclmulqdq	$0x10, ZT20, ZT22, ZT11;	\
+	vpclmulqdq	$0x01, ZT20, ZT22, ZT12;	\
+	vpclmulqdq	$0x11, ZT20, ZT22, ZT9;		\
+	vpclmulqdq	$0x00, ZT20, ZT22, ZT10;	\
+	/* AES round 6 */				\
+	vaesenc		ZT17, ZT1, ZT1;			\
+	vaesenc		ZT17, ZT2, ZT2;			\
+	vaesenc		ZT17, ZT3, ZT3;			\
+	vaesenc		ZT17, ZT4, ZT4;			\
+	vbroadcastf64x2 16*8(GDATA), ZT17;		\
+	/* gather GHASH in GH1L (low) and GH1H (high) */\
+	.ifc DO_REDUCTION, first_time;			\
+		vpternlogq	$0x96, ZT12, ZT8, ZT7;	\
+		vpxorq		ZT11, ZT7, TO_REDUCE_M; \
+		vpxorq		ZT9, ZT5, TO_REDUCE_H;	\
+		vpxorq		ZT10, ZT6, TO_REDUCE_L; \
+	.endif;						\
+	.ifc DO_REDUCTION, no_reduction;		\
+		vpternlogq	$0x96, ZT12, ZT8, ZT7;	\
+		vpternlogq	$0x96, ZT11, ZT7, TO_REDUCE_M;	\
+		vpternlogq	$0x96, ZT9, ZT5, TO_REDUCE_H;	\
+		vpternlogq	$0x96, ZT10, ZT6, TO_REDUCE_L;	\
+	.endif;						\
+	.ifc DO_REDUCTION, final_reduction;		\
+		/*					\
+		 * phase 1: add mid products together,	\
+		 * load polynomial constant for reduction	\
+		 */					\
+		vpternlogq	$0x96, ZT12, ZT8, ZT7;	\
+		vpternlogq	$0x96, ZT11, TO_REDUCE_M, ZT7;	\
+		vpsrldq		$8, ZT7, ZT11;		\
+		vpslldq		$8, ZT7, ZT7;		\
+		vmovdqa64	POLY2(%rip), XWORD(ZT12);	\
+	.endif;						\
+	/* AES round 7 */				\
+	vaesenc		ZT18, ZT1, ZT1;			\
+	vaesenc		ZT18, ZT2, ZT2;			\
+	vaesenc		ZT18, ZT3, ZT3;			\
+	vaesenc		ZT18, ZT4, ZT4;			\
+	vbroadcastf64x2 16*9(GDATA), ZT18;		\
+	/* Add mid product to high and low */		\
+	.ifc DO_REDUCTION, final_reduction;		\
+		vpternlogq	$0x96, ZT11, ZT9, ZT5;	\
+		vpxorq		TO_REDUCE_H, ZT5, ZT5;	\
+		vpternlogq	$0x96, ZT7, ZT10, ZT6;	\
+		vpxorq		TO_REDUCE_L, ZT6, ZT6;	\
+	.endif;						\
+	/* AES round 8 */				\
+	vaesenc		ZT17, ZT1, ZT1;			\
+	vaesenc		ZT17, ZT2, ZT2;			\
+	vaesenc		ZT17, ZT3, ZT3;			\
+	vaesenc		ZT17, ZT4, ZT4;			\
+	vbroadcastf64x2 16*10(GDATA), ZT17;		\
+	/* horizontal xor of low and high 4x128 */	\
+	.ifc DO_REDUCTION, final_reduction;		\
+		VHPXORI4x128(ZT5, ZT9)			\
+		VHPXORI4x128(ZT6, ZT10)			\
+	.endif;						\
+	/* AES round 9 */				\
+	vaesenc		ZT18, ZT1, ZT1;			\
+	vaesenc		ZT18, ZT2, ZT2;			\
+	vaesenc		ZT18, ZT3, ZT3;			\
+	vaesenc		ZT18, ZT4, ZT4;			\
+	.if NROUNDS >= 11;				\
+		vbroadcastf64x2 16*11(GDATA), ZT18;	\
+	.endif;						\
+	/* First phase of reduction */			\
+	.ifc DO_REDUCTION, final_reduction;		\
+		vpclmulqdq	$0x01, XWORD(ZT6), XWORD(ZT12), XWORD(ZT10);	\
+		vpslldq		$8, XWORD(ZT10), XWORD(ZT10);		\
+		vpxorq		XWORD(ZT10), XWORD(ZT6), XWORD(ZT10);	\
+	.endif;						\
+	/* AES128 done. Continue for AES192 & AES256*/	\
+	.if NROUNDS >= 11;				\
+		vaesenc		ZT17, ZT1, ZT1;		\
+		vaesenc		ZT17, ZT2, ZT2;		\
+		vaesenc		ZT17, ZT3, ZT3;		\
+		vaesenc		ZT17, ZT4, ZT4;		\
+		vbroadcastf64x2 16*12(GDATA), ZT17;	\
+		vaesenc		ZT18, ZT1, ZT1;		\
+		vaesenc		ZT18, ZT2, ZT2;		\
+		vaesenc		ZT18, ZT3, ZT3;		\
+		vaesenc		ZT18, ZT4, ZT4;		\
+		.if NROUNDS == 13;			\
+			vbroadcastf64x2 16*13(GDATA), ZT18;	\
+			vaesenc		ZT17, ZT1, ZT1;	\
+			vaesenc		ZT17, ZT2, ZT2;	\
+			vaesenc		ZT17, ZT3, ZT3; \
+			vaesenc		ZT17, ZT4, ZT4; \
+			vbroadcastf64x2 16*14(GDATA), ZT17;	\
+			vaesenc		ZT18, ZT1, ZT1;		\
+			vaesenc		ZT18, ZT2, ZT2;		\
+			vaesenc		ZT18, ZT3, ZT3;		\
+			vaesenc		ZT18, ZT4, ZT4;		\
+		.endif;						\
+	.endif;							\
+	/* second phase of the reduction */			\
+	.ifc DO_REDUCTION, final_reduction;					\
+		vpclmulqdq	$0, XWORD(ZT10), XWORD(ZT12), XWORD(ZT9);	\
+		vpsrldq		$4, XWORD(ZT9), XWORD(ZT9);			\
+		vpclmulqdq	$0x10, XWORD(ZT10), XWORD(ZT12), XWORD(ZT11);	\
+		vpslldq		$4, XWORD(ZT11), XWORD(ZT11);			\
+		vpternlogq	$0x96, XWORD(ZT9), XWORD(ZT11), XWORD(ZT5);	\
+	.endif;									\
+	/* Last AES round */			\
+	vaesenclast	    ZT17, ZT1, ZT1;	\
+	vaesenclast	    ZT17, ZT2, ZT2;	\
+	vaesenclast	    ZT17, ZT3, ZT3;	\
+	vaesenclast	    ZT17, ZT4, ZT4;	\
+	/* XOR against plain/cipher text */	\
+	vpxorq		    ZT13, ZT1, ZT1;	\
+	vpxorq		 ZT14, ZT2, ZT2;	\
+	vpxorq	       ZT15, ZT3, ZT3;		\
+	vpxorq	       ZT16, ZT4, ZT4;		\
+	/* Store cipher/plain text */		\
+	vmovdqu8	ZT1, DATA_DISPL(DATA_OFFSET, CYPH_PLAIN_OUT);		\
+	vmovdqu8	ZT2, 64*1 + DATA_DISPL(DATA_OFFSET, CYPH_PLAIN_OUT);	\
+	vmovdqu8	ZT3, 64*2 + DATA_DISPL(DATA_OFFSET, CYPH_PLAIN_OUT);	\
+	vmovdqu8	ZT4, 64*3 + DATA_DISPL(DATA_OFFSET, CYPH_PLAIN_OUT);	\
+	/* Shuffle cipher text blocks for GHASH computation */	\
+	.ifc ENC_DEC, ENC;				\
+		vpshufb		SHFMSK, ZT1, ZT1;	\
+		vpshufb		SHFMSK, ZT2, ZT2;	\
+		vpshufb		SHFMSK, ZT3, ZT3;	\
+		vpshufb		SHFMSK, ZT4, ZT4;	\
+	.else;						\
+		vpshufb		SHFMSK, ZT13, ZT1;	\
+		vpshufb		SHFMSK, ZT14, ZT2;	\
+		vpshufb		SHFMSK, ZT15, ZT3;	\
+		vpshufb		SHFMSK, ZT16, ZT4;	\
+	.endif;						\
+	/* Store shuffled cipher text for ghashing */	\
+	vmovdqa64 ZT1, 0*64 + AESOUT_BLK_OFFSET(%rsp);	\
+	vmovdqa64 ZT2, 1*64 + AESOUT_BLK_OFFSET(%rsp);	\
+	vmovdqa64 ZT3, 2*64 + AESOUT_BLK_OFFSET(%rsp);	\
+	vmovdqa64 ZT4, 3*64 + AESOUT_BLK_OFFSET(%rsp);
+
+/* Encrypt the initial N x 16 blocks */
+#define INITIAL_BLOCKS_Nx16(IN, OUT, KP, CTX, DATA_OFFSET, GHASH, CTR, CTR_CHECK, T0, T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12, T13, T14, T15, T16, T17, T18, T19, T20, T21, T22, GH, GL, GM, ADDBE_4x4, ADDBE_1234, SHUF_MASK, ENC_DEC, NBLOCKS, DEPTH_BLK, NROUNDS) \
+	/* set up CTR_CHECK */				\
+	vmovd		XWORD(CTR), DWORD(CTR_CHECK);	\
+	and		$255, DWORD(CTR_CHECK);		\
+	/* In LE format after init, convert to BE */	\
+	vshufi64x2	$0, CTR, CTR, CTR;		\
+	vpshufb		SHUF_MASK, CTR, CTR;		\
+	/* first 16 blocks - just cipher */		\
+	INITIAL_BLOCKS_16(IN, OUT, KP, DATA_OFFSET, GHASH, CTR, CTR_CHECK, ADDBE_4x4, ADDBE_1234, T0, T1, T2, T3, T4, T5, T6, T7, T8, SHUF_MASK, ENC_DEC, STACK_LOCAL_OFFSET, 0, NROUNDS)	\
+	INITIAL_BLOCKS_16(IN, OUT, KP, DATA_OFFSET, no_ghash, CTR, CTR_CHECK, ADDBE_4x4, ADDBE_1234, T0, T1, T2, T3, T4, T5, T6, T7, T8, SHUF_MASK, ENC_DEC, STACK_LOCAL_OFFSET + 256, 256, NROUNDS)	\
+	/* GHASH + AES follows */			\
+	GHASH_16_ENCRYPT_16_PARALLEL(KP, CTX, OUT, IN, DATA_OFFSET, CTR, CTR_CHECK, HashSubKey, STACK_LOCAL_OFFSET + 512, STACK_LOCAL_OFFSET, SHUF_MASK, T0, T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12, T13, T14, T15, T16, T17, T18, T19, T20, T21, T22, ADDBE_4x4, ADDBE_1234, GL, GH, GM, first_time, ENC_DEC, 512, no_ghash_in, NROUNDS)	\
+	add		$(48 * 16), DATA_OFFSET;
+
+/* Encrypt & ghash multiples of 16 blocks */
+#define GHASH_ENCRYPT_Nx16_PARALLEL(IN, OUT, GDATA_KEY, GCTX, DATA_OFFSET, CTR_BE, SHFMSK, ZT0, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZT13, ZT14, ZT15, ZT16, ZT17, ZT18, ZT19, ZT20, ZT21, ZT22, GTH, GTL, GTM, ADDBE_4x4, ADDBE_1234, GHASH, ENC_DEC, NUM_BLOCKS, DEPTH_BLK, CTR_CHECK, NROUNDS)	\
+	GHASH_16_ENCRYPT_16_PARALLEL(GDATA_KEY, GCTX, OUT, IN, DATA_OFFSET, CTR_BE, CTR_CHECK, HashSubKey + HashKey_32, STACK_LOCAL_OFFSET, STACK_LOCAL_OFFSET + (16 * 16), SHFMSK, ZT0, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZT13, ZT14, ZT15, ZT16, ZT17, ZT18, ZT19, ZT20, ZT21, ZT22, ADDBE_4x4, ADDBE_1234, GTL, GTH, GTM, no_reduction, ENC_DEC, 0, no_ghash_in, NROUNDS)	\
+	GHASH_16_ENCRYPT_16_PARALLEL(GDATA_KEY, GCTX, OUT, IN, DATA_OFFSET, CTR_BE, CTR_CHECK, HashSubKey + HashKey_16, STACK_LOCAL_OFFSET + 256, STACK_LOCAL_OFFSET + (16 * 16) + 256, SHFMSK, ZT0, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZT13, ZT14, ZT15, ZT16, ZT17, ZT18, ZT19, ZT20, ZT21, ZT22, ADDBE_4x4, ADDBE_1234, GTL, GTH, GTM, final_reduction, ENC_DEC, 256, no_ghash_in, NROUNDS)	\
+	vmovdqa64	ZT4, GHASH;	\
+	GHASH_16_ENCRYPT_16_PARALLEL(GDATA_KEY, GCTX, OUT, IN, DATA_OFFSET, CTR_BE, CTR_CHECK, HashSubKey + HashKey_48, STACK_LOCAL_OFFSET + 512, STACK_LOCAL_OFFSET, SHFMSK, ZT0, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZT13, ZT14, ZT15, ZT16, ZT17, ZT18, ZT19, ZT20, ZT21, ZT22, ADDBE_4x4, ADDBE_1234, GTL, GTH, GTM, first_time, ENC_DEC, 512, GHASH, NROUNDS)	\
+	add	$(NUM_BLOCKS * 16), DATA_OFFSET;	\
+
+/* GHASH the last 16 blocks of cipher text */
+#define GHASH_LAST_Nx16(KP, GHASH, T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12, T13, T14, T15, T16, GH, GL,GM, LOOP_BLK, DEPTH_BLK)	     \
+	/* load cipher blocks and ghash keys */		\
+	vmovdqa64	(((LOOP_BLK - DEPTH_BLK) * 16) + STACK_LOCAL_OFFSET)(%rsp), T13;	\
+	vmovdqa64	(((LOOP_BLK - DEPTH_BLK) * 16) + 64 + STACK_LOCAL_OFFSET)(%rsp), T14;	\
+	vmovdqu64	HashKey_32 + HashSubKey(KP), T15;	\
+	vmovdqu64	HashKey_32 + 64 + HashSubKey(KP), T16;	\
+	/* ghash blocks 0-3 */				\
+	vpclmulqdq	$0x11, T15, T13, T1;		\
+	vpclmulqdq	$0x00, T15, T13, T2;		\
+	vpclmulqdq	$0x01, T15, T13, T3;		\
+	vpclmulqdq	$0x10, T15, T13, T4;		\
+	/* ghash blocks 4-7 */				\
+	vpclmulqdq	$0x11, T16, T14, T5;		\
+	vpclmulqdq	$0x00, T16, T14, T6;		\
+	vpclmulqdq	$0x01, T16, T14, T7;		\
+	vpclmulqdq	$0x10, T16, T14, T8;		\
+	vpternlogq	$0x96, GH, T5, T1;		\
+	vpternlogq	$0x96, GL, T6, T2;		\
+	vpternlogq	$0x96, GM, T7, T3;		\
+	vpxorq		T8, T4, T4;			\
+	\
+.set i, 0;						\
+.rept 3;						\
+	/* Remaining blocks; load next 8 cipher blocks and corresponding ghash keys */			\
+	vmovdqa64	(((LOOP_BLK - DEPTH_BLK) * 16) + STACK_LOCAL_OFFSET + 128)(%rsp), T13;		\
+	vmovdqa64	(((LOOP_BLK - DEPTH_BLK) * 16) + 64 + STACK_LOCAL_OFFSET + 128)(%rsp), T14;	\
+	vmovdqu64	HashKey_32 + 128 + i*128 + HashSubKey(KP), T15;	\
+	vmovdqu64	HashKey_32 + 64 + 128 + i*128 + HashSubKey(KP), T16;	\
+	/* ghash blocks 0-3 */				\
+	vpclmulqdq	$0x11, T15, T13, T5;		\
+	vpclmulqdq	$0x00, T15, T13, T6;		\
+	vpclmulqdq	$0x01, T15, T13, T7;		\
+	vpclmulqdq	$0x10, T15, T13, T8;		\
+	/* ghash blocks 4-7 */				\
+	vpclmulqdq	$0x11, T16, T14, T9;		\
+	vpclmulqdq	$0x00, T16, T14, T10;		\
+	vpclmulqdq	$0x01, T16, T14, T11;		\
+	vpclmulqdq	$0x10, T16, T14, T12;		\
+	/* update sums */				\
+	vpternlogq	$0x96, T9, T5, T1;		\
+	vpternlogq	$0x96, T10, T6, T2;		\
+	vpternlogq	$0x96, T11, T7, T3;		\
+	vpternlogq	$0x96, T12, T8, T4;		\
+	.set		i, i+1;				\
+.endr;							\
+	vpxorq		T4, T3, T3;			\
+	vpsrldq		$8, T3, T7;			\
+	vpslldq		$8, T3, T8;			\
+	vpxorq		T7, T1, T1;			\
+	vpxorq		T8, T2, T2;			\
+	\
+	/* add TH and TL 128-bit words horizontally */	\
+	VHPXORI4x128(T1, T11)				\
+	VHPXORI4x128(T2, T12)				\
+	\
+	/* Reduction */					\
+	vmovdqa64	POLY2(%rip), T15;		\
+	VCLMUL_REDUCE(GHASH, T15, T1, T2, T3, T4);
+
+/*
+ * INITIAL_BLOCKS_PARTIAL macro with support for a partial final block.
+ * It may look similar to INITIAL_BLOCKS but its usage is different:
+ * - first encrypts/decrypts and then ghash these blocks
+ * - Small packets or left over data chunks (<256 bytes)
+ * - Remaining data chunks below 256 bytes (multi buffer code)
+ * num_initial_blocks is expected to include the partial final block
+ * in the count.
+ */
+#define INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, num_initial_blocks, CTR, HASH_IN_OUT, ENC_DEC, ZT0, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, ZT13, ZT14, ZT15, ZT16, ZT17, ZT18, ZT19, ZT20, ZT21, ZT22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS) \
+	/* Copy ghash to temp reg */					\
+	vmovdqa64	HASH_IN_OUT, XWORD(ZT2);			\
+	/* prepare AES counter blocks */				\
+.if num_initial_blocks == 1;						\
+	vpaddd		ONE(%rip), CTR, XWORD(ZT3);			\
+.elseif num_initial_blocks == 2;					\
+	vshufi64x2	$0, YWORD(CTR), YWORD(CTR), YWORD(ZT3);		\
+	vpaddd		ddq_add_1234(%rip), YWORD(ZT3), YWORD(ZT3);	\
+.else;									\
+	vshufi64x2	$0, ZWORD(CTR), ZWORD(CTR), ZWORD(CTR);		\
+	vpaddd		ddq_add_1234(%rip), ZWORD(CTR), ZT3;		\
+.if num_initial_blocks > 4;						\
+	vpaddd		ddq_add_5678(%rip), ZWORD(CTR), ZT4;		\
+.endif;									\
+.if num_initial_blocks > 8;						\
+	vpaddd		ddq_add_8888(%rip), ZT3, ZT8;			\
+.endif;									\
+.if num_initial_blocks > 12;						\
+	vpaddd		ddq_add_8888(%rip), ZT4, ZT9;			\
+.endif;									\
+.endif;									\
+	/* Get load/store mask */					\
+	lea		byte64_len_to_mask_table(%rip), IA0;		\
+	mov		LENGTH, IA1;					\
+.if num_initial_blocks > 12;						\
+	sub		$(3 * 64), IA1;					\
+.elseif num_initial_blocks > 8;						\
+	sub		$(2 * 64), IA1;					\
+.elseif num_initial_blocks > 4;						\
+	sub		$64, IA1;					\
+.endif;									\
+	kmovq		(IA0, IA1, 8), MASKREG;				\
+	/* Extract new counter value. Shuffle counters for AES rounds */\
+.if num_initial_blocks <= 4;						\
+	vextracti32x4	$(num_initial_blocks - 1), ZT3, CTR;		\
+.elseif num_initial_blocks <= 8;					\
+	vextracti32x4	$(num_initial_blocks - 5), ZT4, CTR;		\
+.elseif num_initial_blocks <= 12;					\
+	vextracti32x4	$(num_initial_blocks - 9), ZT8, CTR;		\
+.else;									\
+	vextracti32x4	$(num_initial_blocks - 13), ZT9, CTR;		\
+.endif;									\
+	ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(num_initial_blocks, vpshufb, ZT3, ZT4, ZT8, ZT9, ZT3, ZT4, ZT8, ZT9, SHUFMASK, SHUFMASK, SHUFMASK, SHUFMASK)	\
+	/* Load plain/cipher text */					\
+	ZMM_LOAD_MASKED_BLOCKS_0_16(num_initial_blocks, PLAIN_CYPH_IN, DATA_OFFSET, ZT5, ZT6, ZT10, ZT11, MASKREG)	\
+	/* AES rounds and XOR with plain/cipher text */			\
+.set i, 0;								\
+.rept 11;								\
+	vbroadcastf64x2 16*i(GDATA_KEY), ZT1;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT3, ZT4, ZT8, ZT9, ZT1, i, ZT5, ZT6, ZT10, ZT11, num_initial_blocks, NROUNDS)	\
+	.set i, i+1;							\
+.endr;									\
+.if NROUNDS > 9;							\
+.rept 2;								\
+	vbroadcastf64x2 16*i(GDATA_KEY), ZT1;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT3, ZT4, ZT8, ZT9, ZT1, i, ZT5, ZT6, ZT10, ZT11, num_initial_blocks, NROUNDS)	\
+	.set i, i+1;							\
+.endr;									\
+.endif;									\
+.if NROUNDS > 11;							\
+.rept 2;								\
+	vbroadcastf64x2 16*i(GDATA_KEY), ZT1;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT3, ZT4, ZT8, ZT9, ZT1, i, ZT5, ZT6, ZT10, ZT11, num_initial_blocks, NROUNDS)	\
+	.set i, i+1;							\
+.endr;									\
+.endif;									\
+/*
+ * Retrieve the last cipher counter block (Partially XOR'ed with text	\
+ * This is needed for partial block case				\
+ */									\
+.if num_initial_blocks <= 4;						\
+	vextracti32x4	$(num_initial_blocks - 1), ZT3, XWORD(ZT1);	\
+.elseif num_initial_blocks <= 8;					\
+	vextracti32x4	$(num_initial_blocks - 5), ZT4, XWORD(ZT1);	\
+.elseif num_initial_blocks <= 12;					\
+	vextracti32x4	$(num_initial_blocks - 9), ZT8, XWORD(ZT1);	\
+.else;									\
+	vextracti32x4	$(num_initial_blocks - 13), ZT9, XWORD(ZT1);	\
+.endif;									\
+	/* Write cipher/plain text back to output */			\
+	ZMM_STORE_MASKED_BLOCKS_0_16(num_initial_blocks, CYPH_PLAIN_OUT,DATA_OFFSET, ZT3, ZT4, ZT8, ZT9, MASKREG)	\
+	/* Zero bytes outside the mask before hashing */		\
+.if num_initial_blocks <= 4;						\
+	vmovdqu8	ZT3, ZT3{MASKREG}{z};				\
+.elseif num_initial_blocks <= 8;					\
+	vmovdqu8	ZT4, ZT4{MASKREG}{z};				\
+.elseif num_initial_blocks <= 12;					\
+	vmovdqu8	ZT8, ZT8{MASKREG}{z};				\
+.else;									\
+	vmovdqu8	ZT9, ZT9{MASKREG}{z};				\
+.endif;									\
+/* Shuffle the cipher text blocks for hashing part */			\
+.ifc  ENC_DEC, DEC;							\
+	ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(num_initial_blocks, vpshufb,	\
+			ZT5, ZT6, ZT10, ZT11,				\
+			ZT5, ZT6, ZT10, ZT11,				\
+			SHUFMASK, SHUFMASK, SHUFMASK, SHUFMASK)		\
+.else;									\
+	 ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(num_initial_blocks, vpshufb,	\
+			ZT5, ZT6, ZT10, ZT11,				\
+			ZT3, ZT4, ZT8, ZT9,				\
+			SHUFMASK, SHUFMASK, SHUFMASK, SHUFMASK)		\
+.endif;									\
+/* Extract the last block for partial cases */				\
+.if num_initial_blocks <= 4;						\
+	vextracti32x4	$(num_initial_blocks - 1), ZT5, XWORD(ZT7);	\
+.elseif num_initial_blocks <= 8;					\
+	vextracti32x4	$(num_initial_blocks - 5), ZT6, XWORD(ZT7);	\
+.elseif num_initial_blocks <= 12;					\
+	vextracti32x4	$(num_initial_blocks - 9), ZT10, XWORD(ZT7);	\
+.else;									\
+	vextracti32x4	$(num_initial_blocks - 13), ZT11, XWORD(ZT7);	\
+.endif;									\
+/* Hash all but the last block of data */				\
+.if num_initial_blocks > 1;						\
+	add	$(16 * (num_initial_blocks - 1)), DATA_OFFSET;		\
+	sub	$(16 * (num_initial_blocks - 1)), LENGTH;		\
+.endif;									\
+.if num_initial_blocks < 16;						\
+	cmp	$16, LENGTH;						\
+	jl	25f;							\
+	/* Handle a full length final blk; encrypt & hash all blocks */	\
+	sub	$16, LENGTH;						\
+	add	$16, DATA_OFFSET;					\
+	mov	LENGTH, PBlockLen(GDATA_CTX);				\
+	/* Hash all of the data */					\
+	GHASH_1_TO_16(GDATA_CTX, 96, HASH_IN_OUT, ZT12, ZT13, ZT14, ZT15, ZT16, ZT17, ZT18, ZT19, ZT20, ZT2, ZT5, ZT6, ZT10, ZT11, num_initial_blocks, 1, single_call, null, null, null, null, null, null)	\
+	jmp	26f;							\
+.endif;									\
+25:;									\
+	/* Handle ghash for a <16B final block */			\
+	mov	LENGTH, PBlockLen(GDATA_CTX);				\
+	vmovdqu64	XWORD(ZT1), PBlockEncKey(GDATA_CTX);		\
+.if num_initial_blocks > 1;						\
+	GHASH_1_TO_16(GDATA_CTX, 96, HASH_IN_OUT, ZT12, ZT13, ZT14, ZT15, ZT16, ZT17, ZT18, ZT19, ZT20, ZT2, ZT5, ZT6, ZT10, ZT11, num_initial_blocks - 1, 0, single_call, null, null, null, null, null, null)	\
+.else;									\
+	vpxorq		XWORD(ZT7), XWORD(ZT2), HASH_IN_OUT;	\
+	jmp		27f;						\
+.endif;									\
+/* After GHASH reduction */						\
+26:;									\
+.if num_initial_blocks > 1;						\
+	.if num_initial_blocks != 16;					\
+		or	LENGTH, LENGTH;					\
+		je	27f;						\
+	.endif;								\
+	vpxorq	    XWORD(ZT7), HASH_IN_OUT, HASH_IN_OUT;		\
+	/* Final hash is now in HASH_IN_OUT */				\
+.endif;									\
+27:;
+
+/* Cipher and ghash of payloads shorter than 256 bytes */
+#define GCM_ENC_DEC_SMALL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, PLAIN_CYPH_LEN, ENC_DEC, DATA_OFFSET, LENGTH, NUM_BLOCKS, CTR, HASH_IN_OUT, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	cmp	$8, NUM_BLOCKS;	\
+	je	58f;		\
+	jl	68f;		\
+	cmp	$12, NUM_BLOCKS;\
+	je	62f;		\
+	jl	67f;		\
+	cmp	$16, NUM_BLOCKS;\
+	je	66f;		\
+	cmp	$15, NUM_BLOCKS;\
+	je	65f;		\
+	cmp	$14, NUM_BLOCKS;\
+	je	64f;		\
+	jmp	63f;		\
+67:;				\
+	cmp	$11, NUM_BLOCKS;\
+	je	61f;		\
+	cmp	$10, NUM_BLOCKS;\
+	je	60f; 		\
+	jmp	59f;		\
+68:;				\
+	cmp	$4, NUM_BLOCKS;	\
+	je	54f;		\
+	jl	69f;		\
+	cmp	$7, NUM_BLOCKS;	\
+	je	57f;		\
+	cmp	$6, NUM_BLOCKS;	\
+	je	56f;		\
+	jmp	55f;		\
+69:;				\
+	cmp	$3, NUM_BLOCKS;	\
+	je	53f;		\
+	cmp	$2, NUM_BLOCKS;	\
+	je	52f;		\
+51:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 1, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+52:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 2, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+53:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 3, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+54:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 4, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+55:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 5, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+56:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 6, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+57:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 7, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+58:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 8, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+59:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 9, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+60:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 10, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+61:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 11, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+62:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 12, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS)	\
+	jmp	70f;		\
+63:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 13, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS) \
+	jmp	70f;		\
+64:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 14, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS) \
+	jmp	70f;		\
+65:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 15, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS) \
+	jmp	70f;		\
+66:;				\
+	INITIAL_BLOCKS_PARTIAL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, 16, CTR, HASH_IN_OUT, ENC_DEC, ZTMP0, ZTMP1, ZTMP2, ZTMP3, ZTMP4, ZTMP5, ZTMP6, ZTMP7, ZTMP8, ZTMP9, ZTMP10, ZTMP11, ZTMP12, ZTMP13, ZTMP14, ZTMP15, ZTMP16, ZTMP17, ZTMP18, ZTMP19, ZTMP20, ZTMP21, ZTMP22, IA0, IA1, MASKREG, SHUFMASK, NROUNDS) \
+70:;
+
+/*
+ * This macro is used to "warm-up" pipeline for GHASH_8_ENCRYPT_8_PARALLEL
+ * macro code. It is called only for data lengths 128 and above.
+ * The flow is as follows:
+ * - encrypt the initial num_initial_blocks blocks (can be 0)
+ * - encrypt the next 8 blocks and stitch with GHASH for the first num_initial_blocks
+ * - the last 8th block can be partial (lengths between 129 and 239)
+ * - partial block ciphering is handled within this macro
+ * - top bytes of such block are cleared for the subsequent GHASH calculations
+ * - PBlockEncKey needs to be setup
+ * - top bytes of the block need to include encrypted counter block so that
+ *   when handling partial block case text is read and XOR'ed against it.
+ *   This needs to be in un-shuffled format.
+ */
+#define INITIAL_BLOCKS(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, LENGTH, DATA_OFFSET, num_initial_blocks, CTR, AAD_HASH, ZT1, ZT2, ZT3, ZT4, ZT5, ZT6, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, IA0, IA1, ENC_DEC, MASKREG, SHUFMASK, PARTIAL_PRESENT, NROUNDS)	\
+.set partial_block_possible, 1;							\
+.ifc PARTIAL_PRESENT, no_partial_block;						\
+	.set partial_block_possible, 0;						\
+.endif;										\
+.if num_initial_blocks > 0;							\
+	/* Prepare AES counter blocks */					\
+	.if num_initial_blocks == 1;						\
+		vpaddd		ONE(%rip), CTR, XWORD(ZT3);			\
+	.elseif num_initial_blocks == 2;					\
+		vshufi64x2	$0, YWORD(CTR), YWORD(CTR), YWORD(ZT3);		\
+		vpaddd		ddq_add_1234(%rip), YWORD(ZT3), YWORD(ZT3);	\
+	.else;									\
+		vshufi64x2	$0, ZWORD(CTR), ZWORD(CTR), ZWORD(CTR);		\
+		vpaddd		ddq_add_1234(%rip), ZWORD(CTR), ZT3;		\
+		vpaddd		ddq_add_5678(%rip), ZWORD(CTR), ZT4;		\
+	.endif;									\
+	/* Extract new counter value; shuffle counters for AES rounds */	\
+	.if num_initial_blocks <= 4;						\
+		vextracti32x4   $(num_initial_blocks - 1), ZT3, CTR;		\
+	.else;									\
+		vextracti32x4   $(num_initial_blocks - 5), ZT4, CTR;		\
+	.endif;									\
+	ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(num_initial_blocks, vpshufb, ZT3, ZT4, no_zmm, no_zmm, ZT3, ZT4, no_zmm, no_zmm, SHUFMASK, SHUFMASK, SHUFMASK, SHUFMASK)	\
+	/* load plain/cipher text */						\
+	ZMM_LOAD_BLOCKS_0_16(num_initial_blocks, PLAIN_CYPH_IN, DATA_OFFSET, ZT5, ZT6, no_zmm, no_zmm, NULL)	\
+	/* AES rounds and XOR with plain/cipher text */				\
+.set i, 0;									\
+.rept 11;									\
+	vbroadcastf64x2 16*i(GDATA_KEY), ZT1;					\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT3, ZT4, no_zmm, no_zmm, ZT1, i, ZT5, ZT6, no_zmm, no_zmm, num_initial_blocks, NROUNDS)	\
+	.set i, i+1;								\
+.endr;										\
+.if NROUNDS > 9;								\
+.rept 2;									\
+	vbroadcastf64x2 16*i(GDATA_KEY), ZT1;					\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT3, ZT4, no_zmm, no_zmm, ZT1, i, ZT5, ZT6, no_zmm, no_zmm, num_initial_blocks, NROUNDS)	\
+	.set i, i+1;								\
+.endr;										\
+.endif;										\
+.if NROUNDS > 11;								\
+.rept 2;									\
+	vbroadcastf64x2 16*i(GDATA_KEY), ZT1;					\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT3, ZT4, no_zmm, no_zmm, ZT1, i, ZT5, ZT6, no_zmm, no_zmm, num_initial_blocks, NROUNDS)	\
+	.set i, i+1;								\
+.endr;										\
+.endif;										\
+	/* Write cipher/plain text back to output */				\
+	ZMM_STORE_BLOCKS_0_16(num_initial_blocks, CYPH_PLAIN_OUT, DATA_OFFSET, ZT3, ZT4, no_zmm, no_zmm)	\
+	/* Shuffle the cipher text blocks for hashing part */			\
+	.ifc ENC_DEC, DEC;							\
+	ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(num_initial_blocks, vpshufb, ZT5, ZT6, no_zmm, no_zmm, ZT5, ZT6, no_zmm, no_zmm, SHUFMASK, SHUFMASK, SHUFMASK, SHUFMASK) \
+	.else;									\
+	ZMM_OPCODE3_DSTR_SRC1R_SRC2R_BLOCKS_0_16(num_initial_blocks, vpshufb, ZT5, ZT6, no_zmm, no_zmm, ZT3, ZT4, no_zmm, no_zmm, SHUFMASK, SHUFMASK, SHUFMASK, SHUFMASK) \
+	.endif;									\
+	/* Adjust data offset and length */					\
+	sub		$(num_initial_blocks * 16), LENGTH;			\
+	add		$(num_initial_blocks * 16), DATA_OFFSET;		\
+.endif;										\
+	/*									\
+	 * Cipher of num_initial_blocks is done					\
+	 * prepare counter blocks for the next 8 blocks (ZT3 & ZT4)		\
+	 *   - save the last block in %%CTR					\
+	 *   - shuffle the blocks for AES					\
+	 *   - stitch encryption of new blocks with GHASHING previous blocks	\
+	 */									\
+	vshufi64x2	$0, ZWORD(CTR), ZWORD(CTR), ZWORD(CTR);			\
+	vpaddd	    	ddq_add_1234(%rip), ZWORD(CTR), ZT3;			\
+	vpaddd	    	ddq_add_5678(%rip), ZWORD(CTR), ZT4;			\
+	vextracti32x4	$3, ZT4, CTR;						\
+	vpshufb		SHUFMASK, ZT3, ZT3;					\
+	vpshufb		SHUFMASK, ZT4, ZT4;					\
+.if partial_block_possible != 0;						\
+	/* get text load/store mask (assume full mask by default) */		\
+	mov	0xffffffffffffffff, IA0;					\
+	.if num_initial_blocks > 0;						\
+		cmp	$128, LENGTH;						\
+		jge	22f;							\
+		mov	%rcx, IA1;						\
+		mov	$128, %rcx;						\
+		sub	LENGTH, %rcx;						\
+		shr	cl, IA0;						\
+		mov	IA1, %rcx;						\
+22:;										\
+	.endif;									\
+	kmovq	IA0, MASKREG;							\
+	/* load plain or cipher text */						\
+	ZMM_LOAD_MASKED_BLOCKS_0_16(8, PLAIN_CYPH_IN, DATA_OFFSET, ZT1, ZT2, no_zmm, no_zmm, MASKREG)			\
+.else;										\
+	ZMM_LOAD_BLOCKS_0_16(8, PLAIN_CYPH_IN, DATA_OFFSET, ZT1, ZT2, no_zmm, no_zmm, NULL)				\
+.endif;										\
+.set aes_round, 0;								\
+	vbroadcastf64x2 (aes_round * 16)(GDATA_KEY), ZT8;			\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT3, ZT4, no_zmm, no_zmm, ZT8, aes_round, ZT1, ZT2, no_zmm, no_zmm, 8, NROUNDS)	\
+.set aes_round, aes_round + 1;							\
+/* GHASH blocks 4-7 */			\
+.if num_initial_blocks > 0;							\
+	vpxorq	AAD_HASH, ZT5, ZT5;						\
+	VCLMUL_1_TO_8_STEP1(GDATA_CTX, ZT6, ZT8, ZT9, ZT10, ZT11, ZT12, num_initial_blocks);				\
+.endif;										\
+/* 1/3 of AES rounds */		\
+.rept ((NROUNDS + 1) / 3);							\
+	vbroadcastf64x2 (aes_round * 16)(GDATA_KEY), ZT8;			\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT3, ZT4, no_zmm, no_zmm, ZT8, aes_round, ZT1, ZT2, no_zmm, no_zmm, 8, NROUNDS)	\
+.set aes_round, aes_round + 1;							\
+.endr;										\
+/* GHASH blocks 0-3 and gather */	\
+.if num_initial_blocks > 0;							\
+	VCLMUL_1_TO_8_STEP2(GDATA_CTX, ZT6, ZT5, ZT7, ZT8, ZT9, ZT10, ZT11, ZT12, num_initial_blocks);			\
+.endif;										\
+/* 2/3 of AES rounds */			\
+.rept ((NROUNDS + 1) / 3);							\
+	vbroadcastf64x2		(aes_round * 16)(GDATA_KEY), ZT8;		\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT3, ZT4, no_zmm, no_zmm, ZT8, aes_round, ZT1, ZT2, no_zmm, no_zmm, 8, NROUNDS);	\
+	.set aes_round, aes_round + 1;						\
+.endr;										\
+.if num_initial_blocks > 0;							\
+	vmovdqu64	POLY2(%rip), XWORD(ZT8);				\
+	VCLMUL_REDUCE(XWORD(AAD_HASH), XWORD(ZT8), XWORD(ZT6), XWORD(ZT5), XWORD(ZT7), XWORD(ZT9))			\
+.endif;										\
+/* 3/3 of AES rounds */			\
+.rept (((NROUNDS + 1) / 3) + 2);						\
+.if aes_round < (NROUNDS + 2);							\
+	vbroadcastf64x2		(aes_round * 16)(GDATA_KEY), ZT8;		\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT3, ZT4, no_zmm, no_zmm, ZT8, aes_round, ZT1, ZT2, no_zmm, no_zmm, 8, NROUNDS)	\
+.set aes_round, aes_round + 1;							\
+.endif;										\
+.endr;										\
+.if partial_block_possible != 0;						\
+	/* write cipher/plain text back to output */				\
+	ZMM_STORE_MASKED_BLOCKS_0_16(8, CYPH_PLAIN_OUT, DATA_OFFSET, ZT3, ZT4, no_zmm, no_zmm, MASKREG)			\
+	/* Check if there is a partial block */					\
+	cmp		$128, LENGTH;						\
+	jl		23f;							\
+	/* Adjust offset and length */						\
+	add		$128, DATA_OFFSET;					\
+	sub		$128, LENGTH;						\
+	jmp		24f;							\
+23:;										\
+	/* partial block case							\
+	 * - save the partial block in unshuffled format			\
+	 * - ZT4 is partially XOR'ed with data and top bytes contain		\
+	 *   encrypted counter block only					\
+	 * - save number of bytes process in the partial block			\
+	 * - adjust offset and zero the length					\
+	 * - clear top bytes of partial block for subsequent GHASH calculations	\
+	 */									\
+	vextracti32x4	$3, ZT4, PBlockEncKey(GDATA_CTX);			\
+	add		LENGTH, DATA_OFFSET;					\
+	sub		$(128 - 16), LENGTH;					\
+	mov		LENGTH, PBlockLen(GDATA_CTX);				\
+	xor		LENGTH, LENGTH;						\
+	vmovdqu8	ZT4, ZT4{MASKREG}{z};					\
+24:;										\
+.else;										\
+	ZMM_STORE_BLOCKS_0_16(8, CYPH_PLAIN_OUT, DATA_OFFSET, ZT3, ZT4, no_zmm, no_zmm)					\
+	add		$128, DATA_OFFSET;					\
+	sub		$128, LENGTH;						\
+.endif;										\
+	/* Shuffle AES result for GHASH */					\
+.ifc  ENC_DEC, DEC;								\
+	vpshufb		SHUFMASK, ZT1, ZT1;					\
+	vpshufb		SHUFMASK, ZT2, ZT2;					\
+.else;										\
+	vpshufb		SHUFMASK, ZT3, ZT1;					\
+	vpshufb		SHUFMASK, ZT4, ZT2;					\
+.endif;										\
+	/* Current hash value in AAD_HASH */					\
+	vpxorq		AAD_HASH, ZT1, ZT1;
+
+/*
+ * Main GCM macro stitching cipher with GHASH
+ * - operates on single stream
+ * - encrypts 8 blocks at a time
+ * - ghash the 8 previously encrypted ciphertext blocks
+ * For partial block case, AES_PARTIAL_BLOCK on output contains encrypted the	\
+ * counter block.
+ */
+#define GHASH_8_ENCRYPT_8_PARALLEL(GDATA, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, DATA_OFFSET, CTR1, CTR2, GHASHIN_AESOUT_B03, GHASHIN_AESOUT_B47, AES_PARTIAL_BLOCK, loop_idx, ENC_DEC, FULL_PARTIAL, IA0, IA1, LENGTH, GH4KEY, GH8KEY, SHFMSK, ZT1, ZT2, ZT3, ZT4, ZT5, ZT10, ZT11, ZT12, ZT13, ZT14, ZT15, ZT16, ZT17, MASKREG, DO_REDUCTION, TO_REDUCE_L, TO_REDUCE_H, TO_REDUCE_M, NROUNDS)	\
+.ifc loop_idx, in_order;						\
+	vpshufb		SHFMSK, CTR1, ZT1;				\
+	vpshufb		SHFMSK, CTR2, ZT2;				\
+.else;									\
+	vmovdqa64	CTR1, ZT1;					\
+	vmovdqa64	CTR2, ZT2;					\
+.endif;									\
+	/* stitch AES rounds with GHASH */				\
+	/* AES round 0 */						\
+	vbroadcastf64x2 16*0(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 0, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+	/* GHASH 4 blocks */						\
+	vpclmulqdq	$0x11, GH4KEY, GHASHIN_AESOUT_B47, ZT10;	\
+	vpclmulqdq	$0x00, GH4KEY, GHASHIN_AESOUT_B47, ZT11;	\
+	vpclmulqdq	$0x01, GH4KEY, GHASHIN_AESOUT_B47, ZT12;	\
+	vpclmulqdq	$0x10, GH4KEY, GHASHIN_AESOUT_B47, ZT13;	\
+	vbroadcastf64x2 16*1(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 1, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+	vbroadcastf64x2 16*2(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 2, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+	vbroadcastf64x2 16*3(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 3, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+	vpclmulqdq	$0x10, GH8KEY, GHASHIN_AESOUT_B03, ZT16;	\
+	vpclmulqdq	$0x01, GH8KEY, GHASHIN_AESOUT_B03, ZT17;	\
+	vpclmulqdq	$0x11, GH8KEY, GHASHIN_AESOUT_B03, ZT14;	\
+	vpclmulqdq	$0x00, GH8KEY, GHASHIN_AESOUT_B03, ZT15;	\
+	vbroadcastf64x2 16*4(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 4, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+	vbroadcastf64x2 16*5(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 5, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+	vbroadcastf64x2 16*6(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 6, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+.ifc DO_REDUCTION, no_reduction;					\
+	vpternlogq	$0x96, ZT16, ZT13, ZT12;			\
+	vpternlogq	$0x96, ZT17, ZT12, TO_REDUCE_M;			\
+	vpternlogq	$0x96, ZT14, ZT10, TO_REDUCE_H;			\
+	vpternlogq	$0x96, ZT15, ZT11, TO_REDUCE_L;			\
+.endif;									\
+.ifc DO_REDUCTION, do_reduction;					\
+	vpternlogq	$0x96, ZT16, ZT13, ZT12;			\
+	vpxorq		ZT17, ZT12, ZT12;				\
+	vpsrldq		$8, ZT12, ZT16;					\
+	vpslldq		$8, ZT12, ZT12;					\
+.endif;									\
+.ifc DO_REDUCTION, final_reduction;					\
+	vpternlogq	$0x96, ZT16, ZT13, ZT12;			\
+	vpternlogq	$0x96, ZT17, TO_REDUCE_M, ZT12;			\
+	vpsrldq		$8, ZT12, ZT16;					\
+	vpslldq		$8, ZT12, ZT12;					\
+.endif;									\
+	vbroadcastf64x2 16*7(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 7, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+	vbroadcastf64x2 16*8(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 8, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+.ifc DO_REDUCTION, final_reduction;					\
+	vpternlogq	$0x96, ZT16, ZT14, ZT10;			\
+	vpxorq		TO_REDUCE_H, ZT10;				\
+	vpternlogq	$0x96, ZT12, ZT15, ZT11;			\
+	vpxorq		TO_REDUCE_L, ZT11;				\
+.endif;									\
+.ifc DO_REDUCTION, do_reduction;					\
+	vpternlogq	$0x96, ZT16, ZT14, ZT10;			\
+	vpternlogq	$0x96, ZT12, ZT15, ZT11;			\
+.endif;									\
+.ifnc DO_REDUCTION, no_reduction;					\
+	VHPXORI4x128(ZT14, ZT10);					\
+	VHPXORI4x128(ZT15, ZT11);					\
+.endif;									\
+.if 9 < (NROUNDS + 1);							\
+.if NROUNDS == 9;							\
+	vbroadcastf64x2 16*9(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 9, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+.else;									\
+	vbroadcastf64x2 16*9(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 9, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+	vbroadcastf64x2 16*10(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 10, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+.endif;									\
+.endif;									\
+.ifnc DO_REDUCTION, no_reduction;					\
+	vmovdqu64	POLY2(%rip), XWORD(ZT17);			\
+	vpclmulqdq	$0x01, XWORD(ZT11), XWORD(ZT17), XWORD(ZT15);	\
+	vpslldq		$8, XWORD(ZT15), XWORD(ZT15);			\
+	vpxorq		XWORD(ZT15), XWORD(ZT11), XWORD(ZT15);		\
+.endif;									\
+.if 11 < (NROUNDS + 1);							\
+.if NROUNDS == 11;							\
+	vbroadcastf64x2 16*11(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 11, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+.else;									\
+	vbroadcastf64x2 16*11(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 11, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+	vbroadcastf64x2 16*12(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 12, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+.endif;									\
+.endif;									\
+.ifnc DO_REDUCTION, no_reduction;					\
+	vpclmulqdq	$0x00, XWORD(ZT15), XWORD(ZT17), XWORD(ZT16);	\
+	vpsrldq		$4, XWORD(ZT16), XWORD(ZT16);			\
+	vpclmulqdq	$0x10, XWORD(ZT15), XWORD(ZT17), XWORD(ZT13);	\
+	vpslldq		$4, XWORD(ZT13), XWORD(ZT13);			\
+	vpternlogq	$0x96, XWORD(ZT10), XWORD(ZT16), XWORD(ZT13);	\
+.endif;									\
+.if 13 < (NROUNDS + 1);							\
+	vbroadcastf64x2 16*13(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 13, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+.endif;									\
+/* load/store mask (partial case) and load the text data */		\
+.ifc FULL_PARTIAL, full;						\
+	vmovdqu8	(PLAIN_CYPH_IN, DATA_OFFSET), ZT4;		\
+	vmovdqu8	64(PLAIN_CYPH_IN, DATA_OFFSET), ZT5;		\
+.else;									\
+	lea		byte64_len_to_mask_table(%rip), IA0;		\
+	mov		LENGTH, IA1;					\
+	sub		$64, IA1;					\
+	kmovq		(IA0, IA1, 8), MASKREG;				\
+	vmovdqu8	(PLAIN_CYPH_IN, DATA_OFFSET), ZT4;		\
+	vmovdqu8	64(PLAIN_CYPH_IN, DATA_OFFSET), ZT5{MASKREG}{z};\
+.endif;									\
+.if NROUNDS == 9;							\
+	vbroadcastf64x2 16*10(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 10, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+.elseif NROUNDS == 11;							\
+	vbroadcastf64x2 16*12(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 12, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+.elseif NROUNDS == 13;							\
+	vbroadcastf64x2 16*14(GDATA), ZT3;				\
+	ZMM_AESENC_ROUND_BLOCKS_0_16(ZT1, ZT2, no_zmm, no_zmm, ZT3, 14, ZT4, ZT5, no_zmm, no_zmm, 8, NROUNDS)	\
+.endif;									\
+/* store the cipher/plain text data */					\
+.ifc FULL_PARTIAL, full;						\
+	vmovdqu8	ZT1, (CYPH_PLAIN_OUT, DATA_OFFSET);		\
+	vmovdqu8	ZT2, 64(CYPH_PLAIN_OUT, DATA_OFFSET);		\
+.else;									\
+	vmovdqu8	ZT1, (CYPH_PLAIN_OUT, DATA_OFFSET);		\
+	vmovdqu8	ZT2, 64(CYPH_PLAIN_OUT, DATA_OFFSET){MASKREG};	\
+.endif;									\
+/* prep cipher text blocks for the next ghash round */			\
+.ifnc FULL_PARTIAL, full;						\
+	vpxorq		ZT5, ZT2, ZT3;					\
+	vextracti32x4	$3, ZT3, AES_PARTIAL_BLOCK;			\
+/* for GHASH computation, clear the top bytes of the partial block */	\
+.ifc ENC_DEC, ENC;							\
+	vmovdqu8	ZT2, ZT2{MASKREG}{z};				\
+.else;									\
+	vmovdqu8	ZT5, ZT5{MASKREG}{z};				\
+.endif;									\
+.endif;									\
+/* shuffle cipher text blocks for GHASH computation */			\
+.ifc ENC_DEC, ENC;							\
+	vpshufb		SHFMSK, ZT1, GHASHIN_AESOUT_B03;		\
+	vpshufb		SHFMSK, ZT2, GHASHIN_AESOUT_B47;		\
+.else;									\
+	vpshufb		SHFMSK, ZT4, GHASHIN_AESOUT_B03;		\
+	vpshufb		SHFMSK, ZT5, GHASHIN_AESOUT_B47;		\
+.endif;									\
+.ifc DO_REDUCTION, do_reduction;					\
+	/* XOR current GHASH value (ZT13) into block 0 */		\
+	vpxorq		ZT13, GHASHIN_AESOUT_B03;			\
+.endif;									\
+.ifc DO_REDUCTION, final_reduction;					\
+	/* Return GHASH value (ZT13) in TO_REDUCE_L */			\
+	vmovdqa64	ZT13, TO_REDUCE_L;				\
+.endif;
+
+/*
+ * GHASH the last 7 cipher text blocks.
+ * - it uses same GHASH macros as GHASH_LAST_8 but with some twist
+ * - it loads GHASH keys for each of the data blocks, so that:
+ * - blocks 4, 5 and 6 will use GHASH keys 3, 2, 1 respectively
+ * - code ensures that unused block 7 and corresponding GHASH key are zeroed
+ * (clmul product is zero this way and will not affect the result)
+ * - blocks 0, 1, 2 and 3 will use USE GHASH keys 7, 6, 5 and 4 respectively
+ */
+#define GHASH_LAST_7(HASHSUBKEY, BL47, BL03, ZTH, ZTM, ZTL, ZT01, ZT02, ZT03, ZT04, AAD_HASH, MASKREG, IA0, GH, GL,GM)	\
+	vmovdqa64	POLY2(%rip), XWORD(ZT04);							\
+	VCLMUL_1_TO_8_STEP1(HASHSUBKEY, BL47, ZT01, ZT02, ZTH, ZTM, ZTL, 7)				\
+	vpxorq		GH, ZTH, ZTH;									\
+	vpxorq		GL, ZTL, ZTL;									\
+	vpxorq		GM, ZTM, ZTM;									\
+	VCLMUL_1_TO_8_STEP2(HASHSUBKEY, BL47, BL03, ZT01, ZT02, ZT03, ZTH, ZTM, ZTL, 7)			\
+	VCLMUL_REDUCE(AAD_HASH, XWORD(ZT04), XWORD(BL47), XWORD(BL03), XWORD(ZT01), XWORD(ZT02))	\
+
+/* GHASH the last 8 ciphertext blocks. */
+#define GHASH_LAST_8(HASHSUBKEY, BL47, BL03, ZTH, ZTM, ZTL, ZT01, ZT02, ZT03, AAD_HASH, GH, GL,GM)	\
+	VCLMUL_STEP1(HASHSUBKEY, BL47, ZT01, ZTH, ZTM, ZTL, NULL)					\
+	vpxorq		GH, ZTH, ZTH;									\
+	vpxorq		GL, ZTL, ZTL;									\
+	vpxorq		GM, ZTM, ZTM;									\
+	VCLMUL_STEP2(HASHSUBKEY, BL47, BL03, ZT01, ZT02, ZT03, ZTH, ZTM, ZTL, NULL, NULL)		\
+	vmovdqa64	POLY2(%rip), XWORD(ZT03);							\
+	VCLMUL_REDUCE(AAD_HASH, XWORD(ZT03), XWORD(BL47), XWORD(BL03), XWORD(ZT01), XWORD(ZT02))	\
+
+/*
+ * Encodes/Decodes given data. Assumes that the passed gcm_context_data struct
+ * has been initialized by GCM_INIT
+ * Requires the input data be at least 1 byte long because of READ_SMALL_INPUT_DATA.
+ * Clobbers rax, r10-r15, and zmm0-zmm31, k1
+ * Macro flow:
+ * calculate the number of 16byte blocks in the message
+ * process (number of 16byte blocks) mod 8
+ * process 8, 16 byte blocks at a time until all are done
+ */
+#define GCM_ENC_DEC(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, PLAIN_CYPH_LEN, ENC_DEC, NROUNDS)	 \
+	or		PLAIN_CYPH_LEN, PLAIN_CYPH_LEN;		\
+	je		21f;					\
+	xor		%r11, %r11;	  			\
+	add		PLAIN_CYPH_LEN, InLen(GDATA_CTX);	\
+	vmovdqu64	AadHash(GDATA_CTX), %xmm14;		\
+	/*							\
+	 * Used for the update flow - if there was a previous 	\
+	 * partial block fill the remaining bytes here.		\
+	 */							\
+	PARTIAL_BLOCK(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, PLAIN_CYPH_LEN, %r11, %xmm14, ENC_DEC, %r10, %r12, %r13, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %k1)	\
+	/* lift counter block from GCM_INIT to here */		\
+	vmovdqu64	CurCount(GDATA_CTX), %xmm9;		\
+	/* Save the amount of data left to process in %r13 */	\
+	mov	PLAIN_CYPH_LEN, %r13;				\
+	sub	%r11, %r13;					\
+	je	21f;						\
+	vmovdqa64	SHUF_MASK(%rip), %zmm29;		\
+	vmovdqa64	ddq_addbe_4444(%rip), %zmm27;		\
+	cmp		$(big_loop_nblocks * 16), %r13;		\
+	jl		12f;					\
+	vmovdqa64	ddq_addbe_1234(%rip), %zmm28;		\
+	INITIAL_BLOCKS_Nx16(PLAIN_CYPH_IN, CYPH_PLAIN_OUT, GDATA_KEY, GDATA_CTX, %r11, %zmm14, %zmm9, %r15, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %zmm17, %zmm19, %zmm20, %zmm21, %zmm30, %zmm31, %zmm1, %zmm2, %zmm8, %zmm22, %zmm23, %zmm24 , %zmm25, %zmm26, %zmm27, %zmm28, %zmm29, ENC_DEC, 48, 32, NROUNDS)	 \
+	sub		 $(big_loop_nblocks * 16), %r13;	\
+	cmp		$(big_loop_nblocks * 16), %r13;		\
+	jl		11f;					\
+10:;								\
+	GHASH_ENCRYPT_Nx16_PARALLEL(PLAIN_CYPH_IN, CYPH_PLAIN_OUT, GDATA_KEY, GDATA_CTX, %r11, %zmm9, %zmm29, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %zmm17, %zmm19, %zmm20, %zmm21, %zmm30, %zmm31, %zmm1, %zmm2, %zmm8, %zmm22, %zmm23, %zmm24, %zmm25, %zmm26, %zmm27, %zmm28, %zmm14, ENC_DEC, 48, 32, %r15, NROUNDS)	   \
+	sub		$(big_loop_nblocks * 16), %r13;		\
+	cmp		$(big_loop_nblocks * 16), %r13;		\
+	jge		10b;					\
+11:;								\
+	vpshufb		%xmm29, %xmm9, %xmm9;			\
+	vmovdqa64	%xmm9, XWORD(%zmm28);			\
+	GHASH_LAST_Nx16(GDATA_CTX, %zmm14, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %zmm17, %zmm19, %zmm20, %zmm21, %zmm24, %zmm25, %zmm26, 48, 32)					\
+	or		%r13, %r13;				\
+	jz		20f;					\
+12:;								\
+	/*							\
+	 * Less than 256 bytes will be handled by the small	\
+	 * message code, which can process up to 16 x blocks	\
+	 * (16 bytes each)					\
+	 */							\
+	cmp		$256, %r13;				\
+	jge		13f;					\
+	/*							\
+	 * Determine how many blocks to process; process one	\
+	 * additional block if there is a partial block		\
+	 */							\
+	mov		%r13, %r12;				\
+	add		$15, %r12;				\
+	shr		$4, %r12;				\
+	GCM_ENC_DEC_SMALL(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, PLAIN_CYPH_LEN, ENC_DEC, %r11, %r13, %r12, %xmm9, %xmm14, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %zmm17, %zmm19, %zmm20, %zmm21, %zmm30, %zmm31, %zmm1, %zmm2, %zmm8, %zmm22, %zmm23, %r10, %r15, %k1, %zmm29, NROUNDS)		\
+	vmovdqa64	%xmm9, %xmm28;				\
+	jmp	20f;						\
+13:;								\
+	mov		%r13, %r12;				\
+	and		$0xff, %r12;				\
+	add		$15, %r12;				\
+	shr		$4, %r12;				\
+	/*							\
+	 * Don't allow 8 INITIAL blocks since this will		\
+         * be handled by the x8 partial loop.			\
+	 */							\
+	and		$7, %r12;				\
+	je		8f;					\
+	cmp		$1, %r12;				\
+	je		1f;					\
+	cmp		$2, %r12;				\
+	je		2f;					\
+	cmp		$3, %r12;				\
+	je		3f;					\
+	cmp		$4, %r12;				\
+	je		4f;					\
+	cmp		$5, %r12;				\
+	je		5f;					\
+	cmp		$6, %r12;				\
+	je		6f;					\
+7:;								\
+	INITIAL_BLOCKS(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, %r13, %r11, 7, %xmm9, %zmm14, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %r10, %r12, ENC_DEC, %k1, %zmm29, no_partial_block, NROUNDS)	\
+	jmp	9f;						\
+6:;								\
+	INITIAL_BLOCKS(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, %r13, %r11, 6, %xmm9, %zmm14, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %r10, %r12, ENC_DEC, %k1, %zmm29, no_partial_block, NROUNDS)	\
+	jmp	9f;						\
+5:;								\
+	INITIAL_BLOCKS(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, %r13, %r11, 5, %xmm9, %zmm14, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %r10, %r12, ENC_DEC, %k1, %zmm29, no_partial_block, NROUNDS)	\
+	jmp	9f;						\
+4:;								\
+	INITIAL_BLOCKS(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, %r13, %r11, 4, %xmm9, %zmm14, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %r10, %r12, ENC_DEC, %k1, %zmm29, no_partial_block, NROUNDS)	\
+	jmp	9f;						\
+3:;								\
+	INITIAL_BLOCKS(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, %r13, %r11, 3, %xmm9, %zmm14, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %r10, %r12, ENC_DEC, %k1, %zmm29, no_partial_block, NROUNDS)	\
+	jmp	9f;						\
+2:;								\
+	INITIAL_BLOCKS(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, %r13, %r11, 2, %xmm9, %zmm14, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %r10, %r12, ENC_DEC, %k1, %zmm29, no_partial_block, NROUNDS)	\
+	jmp	9f;						\
+1:;								\
+	INITIAL_BLOCKS(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, %r13, %r11, 1, %xmm9, %zmm14, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %r10, %r12, ENC_DEC, %k1, %zmm29, no_partial_block, NROUNDS)	\
+	jmp	9f;						\
+8:;								\
+	INITIAL_BLOCKS(GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, %r13, %r11, 0, %xmm9, %zmm14, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %r10, %r12, ENC_DEC, %k1, %zmm29, no_partial_block, NROUNDS)	\
+9:;									\
+	vmovdqa64	%xmm9, XWORD(%zmm28);				\
+	/*								\
+	 * Move cipher blocks from initial blocks to input of by8 macro	\
+	 * and for GHASH_LAST_8/7					\
+	 */								\
+	vmovdqa64	%zmm0, %zmm1;					\
+	vmovdqa64	%zmm3, %zmm2;					\
+	/*								\
+	 * The entire message cannot get processed in INITIAL_BLOCKS	\
+	 * - GCM_ENC_DEC_SMALL handles up to 16 blocks			\
+	 * - INITIAL_BLOCKS processes up to 15 blocks			\
+	 * - no need to check for zero length at this stage		\
+	 * In order to have only one reduction at the end, the start	\
+	 * HASH KEY pointer needs to be determined based on length and	\
+	 * call type. Note that 8 blocks are already ciphered in	\
+	 * INITIAL_BLOCKS and subtracted from LENGTH(%r13)		\
+	 */								\
+	lea		128(%r13), %r12;				\
+	add		$15, %r12;					\
+	and		$0x3f0, %r12;					\
+	/* if partial block then change hash key start by one */	\
+	mov		%r13, %r10;					\
+	and		$15, %r10;					\
+	add		$15, %r10;					\
+	and		$16, %r10;					\
+	sub		%r10, %r12;					\
+	lea		(HashKey + 16 + HashSubKey)(GDATA_CTX), %rax;	\
+	sub		%r12, %rax;					\
+	/*								\
+	 * %rax points at the first hash key to start GHASH which	\
+	 * needs to be updated as the message is processed		\
+	 */								\
+	vmovdqa64	ddq_addbe_8888(%rip), %zmm27;			\
+	vmovdqa64	ddq_add_8888(%rip), %zmm19;			\
+	vpxorq		%zmm24, %zmm24, %zmm24;				\
+	vpxorq		%zmm25, %zmm25, %zmm25;				\
+	vpxorq		%zmm26, %zmm26, %zmm26;				\
+	/* prepare counter 8 blocks */					\
+	vshufi64x2	$0, %zmm9, %zmm9, %zmm9;			\
+	vpaddd		ddq_add_5678(%rip), %zmm9, %zmm18;		\
+	vpaddd		ddq_add_1234(%rip), %zmm9, %zmm9;		\
+	vpshufb		%zmm29, %zmm9, %zmm9;				\
+	vpshufb		%zmm29, %zmm18, %zmm18;				\
+	/* Process 7 full blocks plus a partial block */		\
+	cmp		$128, %r13;					\
+	jl		17f;						\
+14:;									\
+	/*								\
+	 * in_order vs. out_order is an optimization to increment the	\
+	 * counter without shuffling it back into little endian.	\
+	 * %r15 keeps track of when we need to increment in_order so	\
+	 * that the carry is handled correctly.				\
+	 */								\
+	vmovq		XWORD(%zmm28), %r15;				\
+15:;									\
+	and		$255, WORD(%r15);				\
+	add		$8, WORD(%r15);					\
+	vmovdqu64	64(%rax), %zmm31;				\
+	vmovdqu64	(%rax), %zmm30;					\
+	GHASH_8_ENCRYPT_8_PARALLEL(GDATA_KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, %r11, %zmm9, %zmm18, %zmm1, %zmm2, %xmm8, out_order, ENC_DEC, full, %r10, %r12, %r13, %zmm31, %zmm30, %zmm29, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %zmm17, %k1, no_reduction, %zmm25, %zmm24, %zmm26, NROUNDS)      \
+	add		$128, %rax;					\
+	add		$128, %r11;					\
+	sub		$128, %r13;					\
+	jz		18f;						\
+	cmp		$248, WORD(%r15);				\
+	jae		16f;						\
+	vpaddd		%zmm27, %zmm9, %zmm9;				\
+	vpaddd		%zmm27, %zmm18, %zmm18;				\
+	cmp		$128, %r13;					\
+	jl		17f;						\
+	jmp		15b;						\
+16:;									\
+	vpshufb		%zmm29, %zmm9, %zmm9;				\
+	vpshufb		%zmm29, %zmm18, %zmm18;				\
+	vpaddd		%zmm19, %zmm9, %zmm9;				\
+	vpaddd		%zmm19, %zmm18, %zmm18;				\
+	vpshufb		%zmm29, %zmm9, %zmm9;				\
+	vpshufb		%zmm29, %zmm18, %zmm18;				\
+	cmp		$128, %r13;					\
+	jge		15b;						\
+17:;									\
+	/*								\
+	 * Test to see if we need a by 8 with partial block. At this	\
+	 * point, bytes remaining should be either 0 or between 113-127.\
+	 * 'in_order' shuffle needed to align key for partial block xor.\
+	 * 'out_order' is faster because it avoids extra shuffles.	\
+	 * counter blocks prepared for the next 8 blocks in BE format	\
+	 * - we can go ahead with out_order scenario			\
+	 */								\
+	vmovdqu64	64(%rax), %zmm31;				\
+	vmovdqu64	(%rax), %zmm30;					\
+	GHASH_8_ENCRYPT_8_PARALLEL(GDATA_KEY, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, %r11, %zmm9, %zmm18, %zmm1, %zmm2, %xmm8, out_order, ENC_DEC, partial, %r10, %r12, %r13, %zmm31, %zmm30, %zmm29, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %zmm11, %zmm12, %zmm13, %zmm15, %zmm16, %zmm17, %k1, no_reduction, %zmm25, %zmm24, %zmm26, NROUNDS)	\
+	add		$128, %rax;					\
+	add		$112, %r11;					\
+	sub		$112, %r13;					\
+	mov		%r13, PBlockLen(GDATA_CTX);			\
+	vmovdqu64	%xmm8, PBlockEncKey(GDATA_CTX);			\
+18:;									\
+	/* Extract the last counter block in LE format */		\
+	vextracti32x4	$3, %zmm18, XWORD(%zmm28);			\
+	vpshufb		XWORD(%zmm29), XWORD(%zmm28), XWORD(%zmm28);	\
+	/*								\
+	 * GHASH last cipher text blocks in xmm1-xmm8			\
+	 * if block 8th is partial, then skip the block			\
+	 */								\
+	cmpq		$0, PBlockLen(GDATA_CTX);			\
+	jz		19f;						\
+	/* Save 8th partial block: GHASH_LAST_7 will clobber %zmm2 */	\
+	vextracti32x4	$3, %zmm2, XWORD(%zmm11);			\
+	GHASH_LAST_7(GDATA_CTX, %zmm2, %zmm1, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm10, %xmm14, %k1, %r10, %zmm24, %zmm25, %zmm26)	\
+	/* XOR the partial word into the hash */			\
+	vpxorq		%xmm11, %xmm14, %xmm14;				\
+	jmp		20f;						\
+19:;									\
+	GHASH_LAST_8(GDATA_CTX, %zmm2, %zmm1, %zmm0, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %xmm14, %zmm24, %zmm25, %zmm26)		\
+20:;									\
+	vmovdqu64	XWORD(%zmm28), CurCount(GDATA_CTX);		\
+21:;									\
+	vmovdqu64	%xmm14, (GDATA_CTX);				\
+
+# Store data from SIMD registers to memory
+#define simd_store_avx_15(DST, SRC, SIZE, TMP, IDX)			\
+	xor	IDX, IDX;						\
+	test	$8, SIZE;						\
+	jz	44f;							\
+	vmovq	SRC, (DST, IDX, 1);					\
+	vpsrldq $8, SRC, SRC;						\
+	add	$8, IDX;						\
+44:;									\
+	vmovq	SRC, TMP;						\
+	test	$4, SIZE;						\
+	jz	43f;							\
+	mov	DWORD(TMP), (DST, IDX, 1);				\
+	shr	$32, TMP;						\
+	add	$4, IDX;						\
+43:;									\
+	test	$2, SIZE;						\
+	jz	42f;							\
+	mov	WORD(TMP), (DST, IDX, 1);				\
+	shr	$16, TMP;						\
+	add	$2, IDX;						\
+42:;									\
+	test	$1, SIZE;						\
+	jz 41f;								\
+	mov	BYTE(TMP), (DST, IDX, 1);				\
+41:;
+
+/*
+ * Finishes Encryption/Decryption of last partial block after GCM_UPDATE finishes.
+ * Clobbers rax, r10-r12, and xmm0-xmm2, xmm5-xmm6, xmm9-xmm11, xmm13-xmm15
+ */
+#define GCM_COMPLETE(GDATA_KEY, GDATA_CTX, AUTH_TAG, AUTH_TAG_LEN, NROUNDS) \
+	vmovdqu HashKey + HashSubKey(GDATA_CTX), %xmm13;		\
+	vmovdqu OrigIV(GDATA_CTX), %xmm9;				\
+	ENCRYPT_SINGLE_BLOCK(GDATA_KEY, %xmm9, NROUNDS)			\
+	vmovdqu (GDATA_CTX), %xmm14;					\
+	/* Encrypt the final partial block */				\
+	mov PBlockLen(GDATA_CTX), %r12;					\
+	cmp $0, %r12;							\
+	je 36f;								\
+	/* GHASH computation for the last 16 byte block */		\
+	GHASH_MUL(%xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6)	\
+	vmovdqu %xmm14, (GDATA_CTX);					\
+36:;									\
+	mov AadLen(GDATA_CTX), %r12;					\
+	mov InLen(GDATA_CTX), %rax;					\
+	shl $3, %r12;							\
+	vmovd %r12d, %xmm15;						\
+	shl $3, %rax;							\
+	vmovq %rax, %xmm1;						\
+	vpslldq $8, %xmm15, %xmm15;					\
+	vpxor %xmm1, %xmm15, %xmm15;					\
+	vpxor %xmm15, %xmm14, %xmm14;					\
+	GHASH_MUL(%xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6)	\
+	vpshufb SHUF_MASK(%rip), %xmm14, %xmm14;			\
+	vpxor %xmm14, %xmm9, %xmm9;					\
+31:;									\
+	mov AUTH_TAG, %r10;						\
+	mov AUTH_TAG_LEN, %r11;						\
+	cmp $16, %r11;							\
+	je 34f;								\
+	cmp $12, %r11;							\
+	je 33f;								\
+	cmp $8, %r11;							\
+	je 32f;								\
+	simd_store_avx_15(%r10, %xmm9, %r11, %r12, %rax)		\
+	jmp 35f;							\
+32:;									\
+	vmovq %xmm9, %rax;						\
+	mov %rax, (%r10);						\
+	jmp 35f;							\
+33:;									\
+	vmovq %xmm9, %rax;						\
+	mov %rax, (%r10);						\
+	vpsrldq $8, %xmm9, %xmm9;					\
+	vmovd %xmm9, %eax;						\
+	mov %eax, 8(%r10);						\
+	jmp 35f;							\
+34:;									\
+	vmovdqu %xmm9, (%r10);						\
+35:;
+
+################################################################################################
+# void	aesni_gcm_init_avx_512
+#	 (gcm_data     *my_ctx_data,
+#	  gcm_context_data *data,
+#	  u8	  *iv, /* Pre-counter block j0: 4 byte salt
+#			(from Security Association) concatenated with 8 byte
+#			Initialisation Vector (from IPSec ESP Payload)
+#			concatenated with 0x00000001. 16-byte aligned pointer. */
+#	  u8	 *hash_subkey	/* Hash sub key input. Data starts on a 16-byte boundary. */
+#	  const   u8 *aad,	/* Additional Authentication Data (AAD)*/
+#	  u64	  aad_len)	/* Length of AAD in bytes. With RFC4106 this is 8 or 12 Bytes */
+################################################################################################
+SYM_FUNC_START(aesni_gcm_init_avx_512)
+	FUNC_SAVE_GHASH()
+
+	# memcpy(data.hash_keys, hash_subkey, 16 * 48)
+	pushq %rdi
+	pushq %rsi
+	pushq %rcx
+	lea HashSubKey(%rsi), %rdi
+	mov %rcx, %rsi
+	mov $16*48, %rcx
+	rep movsb
+	popq %rcx
+	popq %rsi
+	popq %rdi
+
+	GCM_INIT(arg2, arg3, arg4, arg5, arg6, %r10, %r11, %r12, %k1, %xmm14, %xmm2, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7, %zmm8, %zmm9, %zmm10)
+
+	FUNC_RESTORE_GHASH()
+	ret
+SYM_FUNC_END(aesni_gcm_init_avx_512)
+
+###############################################################################
+# void	aesni_gcm_enc_update_avx_512(
+#	 gcm_data	 *my_ctx_data,	   /* aligned to 16 Bytes */
+#	 gcm_context_data *data,
+#	 u8	 *out, /* Ciphertext output. Encrypt in-place is allowed.  */
+#	 const	 u8 *in, /* Plaintext input */
+#	 u64	 plaintext_len) /* Length of data in Bytes for encryption. */
+###############################################################################
+SYM_FUNC_START(aesni_gcm_enc_update_avx_512)
+	FUNC_SAVE_GHASH()
+
+	mov	2 * 15 * 16(arg1),%eax
+	cmp	$32, %eax
+	je	key_256_enc_update_2
+	cmp	$16, %eax
+	je	key_128_enc_update
+	# must be 192
+	GCM_ENC_DEC(arg1, arg2, arg3, arg4, arg5, ENC, 11)
+	FUNC_RESTORE_GHASH()
+	ret
+key_128_enc_update:
+	GCM_ENC_DEC(arg1, arg2, arg3, arg4, arg5, ENC, 9)
+	FUNC_RESTORE_GHASH()
+	ret
+key_256_enc_update_2:
+	GCM_ENC_DEC(arg1, arg2, arg3, arg4, arg5, ENC, 13)
+	FUNC_RESTORE_GHASH()
+	ret
+SYM_FUNC_END(aesni_gcm_enc_update_avx_512)
+
+###################################################################################
+# void	aesni_gcm_dec_update_avx_512(
+#	 gcm_data	 *my_ctx_data,	   /* aligned to 16 Bytes */
+#	 gcm_context_data *data,
+#	 u8	 *out,		/* Plaintext output. Decrypt in-place is allowed */
+#	 const	 u8 *in, 	/* Ciphertext input */
+#	 u64	 plaintext_len) /* Length of data in Bytes for encryption */
+###################################################################################
+SYM_FUNC_START(aesni_gcm_dec_update_avx_512)
+	FUNC_SAVE_GHASH()
+
+	mov	2 * 15 * 16(arg1),%eax
+	cmp	$32, %eax
+	je	key_256_dec_update
+	cmp	$16, %eax
+	je	key_128_dec_update
+	# must be 192
+	GCM_ENC_DEC(arg1, arg2, arg3, arg4, arg5, DEC, 11)
+	FUNC_RESTORE_GHASH()
+	ret
+key_128_dec_update:
+	GCM_ENC_DEC(arg1, arg2, arg3, arg4, arg5, DEC, 9)
+	FUNC_RESTORE_GHASH()
+	ret
+key_256_dec_update:
+	GCM_ENC_DEC(arg1, arg2, arg3, arg4, arg5, DEC, 13)
+	FUNC_RESTORE_GHASH()
+	ret
+SYM_FUNC_END(aesni_gcm_dec_update_avx_512)
+
+###############################################################################
+# void	aesni_gcm_finalize_avx_512(
+#	 gcm_data	 *my_ctx_data,	   /* aligned to 16 Bytes */
+#	 gcm_context_data *data,
+#	 u8	 *auth_tag,	/* Authenticated Tag output. */
+#	 u64	 auth_tag_len)	/* Authenticated Tag Length in bytes. */
+###############################################################################
+SYM_FUNC_START(aesni_gcm_finalize_avx_512)
+	FUNC_SAVE_GHASH()
+
+	mov	2 * 15 * 16(arg1),%eax
+	cmp	$32, %eax
+	je	key_256_complete
+	cmp	$16, %eax
+	je	key_128_complete
+	# must be 192
+	GCM_COMPLETE(arg1, arg2, arg3, arg4, 11)
+	FUNC_RESTORE_GHASH()
+	ret
+key_256_complete:
+	GCM_COMPLETE(arg1, arg2, arg3, arg4, 13)
+	FUNC_RESTORE_GHASH()
+	ret
+key_128_complete:
+	GCM_COMPLETE(arg1, arg2, arg3, arg4, 9)
+	FUNC_RESTORE_GHASH()
+	ret
+SYM_FUNC_END(aesni_gcm_finalize_avx_512)
+
+###############################################################################
+# void aes_gcm_precomp_avx_512(
+#	struct crypto_aes_ctx *ctx,	/* Context struct containing the key */
+#	u8 *hash_subkey);		/* Output buffer */
+###############################################################################
+SYM_FUNC_START(aes_gcm_precomp_avx_512)
+	FUNC_SAVE_GHASH()
+	vpxor	%xmm6, %xmm6, %xmm6
+	mov	2 * 15 * 16(arg1),%eax
+	cmp	$32, %eax
+	je	key_256_precomp
+	cmp	$16, %eax
+	je	key_128_precomp
+	ENCRYPT_SINGLE_BLOCK(%rdi, %xmm6, 11)
+	jmp	key_precomp
+key_128_precomp:
+	ENCRYPT_SINGLE_BLOCK(%rdi, %xmm6, 9)
+	jmp	key_precomp
+key_256_precomp:
+	ENCRYPT_SINGLE_BLOCK(%rdi, %xmm6, 13)
+key_precomp:
+	vpshufb SHUF_MASK(%rip), %xmm6, %xmm6
+	vmovdqa %xmm6, %xmm2
+	vpsllq	$1, %xmm6, %xmm6
+	vpsrlq	$63, %xmm2, %xmm2
+	vmovdqa %xmm2, %xmm1
+	vpslldq $8, %xmm2, %xmm2
+	vpsrldq $8, %xmm1, %xmm1
+	vpor	%xmm2, %xmm6, %xmm6
+
+	vpshufd  $0x24, %xmm1, %xmm2
+	vpcmpeqd TWOONE(%rip), %xmm2, %xmm2
+	vpand	 POLY(%rip), %xmm2, %xmm2
+	vpxor	 %xmm2, %xmm6, %xmm6
+
+	vmovdqu  %xmm6, HashKey(%rsi)
+
+	PRECOMPUTE(%rsi, %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm7, %xmm8)
+
+	FUNC_RESTORE_GHASH()
+	ret
+
+SYM_FUNC_END(aes_gcm_precomp_avx_512)
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index 9e56cdf..8fc5bac 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -55,13 +55,16 @@ MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
  * This needs to be 16 byte aligned.
  */
 struct aesni_rfc4106_gcm_ctx {
-	u8 hash_subkey[16] AESNI_ALIGN_ATTR;
+	/* AVX512 optimized algorithms use 48 hash keys to conduct
+	 * multiple PCLMULQDQ operations in parallel
+	 */
+	u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;
 	struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;
 	u8 nonce[4];
 };
 
 struct generic_gcmaes_ctx {
-	u8 hash_subkey[16] AESNI_ALIGN_ATTR;
+	u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;
 	struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;
 };
 
@@ -82,7 +85,7 @@ struct gcm_context_data {
 	u8 current_counter[GCM_BLOCK_LEN];
 	u64 partial_block_len;
 	u64 unused;
-	u8 hash_keys[GCM_BLOCK_LEN * 16];
+	u8 hash_keys[48 * 16];
 };
 
 asmlinkage int aesni_set_key(struct crypto_aes_ctx *ctx, const u8 *in_key,
@@ -266,6 +269,47 @@ static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_gen2 = {
 	.finalize = &aesni_gcm_finalize_avx_gen2,
 };
 
+#ifdef CONFIG_CRYPTO_AES_GCM_AVX512
+/*
+ * asmlinkage void aesni_gcm_init_avx_512()
+ * gcm_data *my_ctx_data, context data
+ * u8 *hash_subkey,  the Hash sub key input. Data starts on a 16-byte boundary.
+ */
+asmlinkage void aesni_gcm_init_avx_512(void *my_ctx_data,
+				       struct gcm_context_data *gdata,
+				       u8 *iv,
+				       u8 *hash_subkey,
+				       const u8 *aad,
+				       unsigned long aad_len);
+asmlinkage void aesni_gcm_enc_update_avx_512(void *ctx,
+					     struct gcm_context_data *gdata,
+					     u8 *out,
+					     const u8 *in,
+					     unsigned long plaintext_len);
+asmlinkage void aesni_gcm_dec_update_avx_512(void *ctx,
+					     struct gcm_context_data *gdata,
+					     u8 *out,
+					     const u8 *in,
+					     unsigned long ciphertext_len);
+asmlinkage void aesni_gcm_finalize_avx_512(void *ctx,
+					   struct gcm_context_data *gdata,
+					   u8 *auth_tag,
+					   unsigned long auth_tag_len);
+
+asmlinkage void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey);
+
+static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {
+	.init = &aesni_gcm_init_avx_512,
+	.enc_update = &aesni_gcm_enc_update_avx_512,
+	.dec_update = &aesni_gcm_dec_update_avx_512,
+	.finalize = &aesni_gcm_finalize_avx_512,
+};
+#else
+static void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey)
+{}
+static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {};
+#endif
+
 /*
  * asmlinkage void aesni_gcm_init_avx_gen4()
  * gcm_data *my_ctx_data, context data
@@ -669,7 +713,11 @@ rfc4106_set_hash_subkey(u8 *hash_subkey, const u8 *key, unsigned int key_len)
 	/* We want to cipher all zeros to create the hash sub key. */
 	memset(hash_subkey, 0, RFC4106_HASH_SUBKEY_SIZE);
 
-	aes_encrypt(&ctx, hash_subkey, hash_subkey);
+	if (IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) && use_avx512 &&
+	    cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ))
+		aes_gcm_precomp_avx_512(&ctx, hash_subkey);
+	else
+		aes_encrypt(&ctx, hash_subkey, hash_subkey);
 
 	memzero_explicit(&ctx, sizeof(ctx));
 	return 0;
@@ -1114,7 +1162,11 @@ static int __init aesni_init(void)
 	if (!x86_match_cpu(aesni_cpu_id))
 		return -ENODEV;
 #ifdef CONFIG_X86_64
-	if (boot_cpu_has(X86_FEATURE_AVX2)) {
+	if (use_avx512 && IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) &&
+	    cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ)) {
+		pr_info("AVX512 version of gcm_enc/dec engaged.\n");
+		aesni_gcm_tfm = &aesni_gcm_tfm_avx_512;
+	} else if (boot_cpu_has(X86_FEATURE_AVX2)) {
 		pr_info("AVX2 version of gcm_enc/dec engaged.\n");
 		aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen4;
 	} else if (boot_cpu_has(X86_FEATURE_AVX)) {
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 3043849..8c8a68d 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -661,6 +661,18 @@ config CRYPTO_AES_CTR_AVX512
 	depends on CRYPTO_AES_NI_INTEL
 	depends on AS_VAES_AVX512
 
+# We default CRYPTO_AES_GCM_AVX512 to Y but depend on CRYPTO_AVX512 in
+# order to have a singular option (CRYPTO_AVX512) select multiple algorithms
+# when supported. Specifically, if the platform and/or toolset does not
+# support VPLMULQDQ. Then this algorithm should not be supported as part of
+# the set that CRYPTO_AVX512 selects.
+config CRYPTO_AES_GCM_AVX512
+	bool
+	default y
+	depends on CRYPTO_AVX512
+	depends on CRYPTO_AES_NI_INTEL
+	depends on AS_VPCLMULQDQ
+
 config CRYPTO_CRC32C_SPARC64
 	tristate "CRC32c CRC algorithm (SPARC64)"
 	depends on SPARC64
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
  2020-12-18 21:11 ` [RFC V1 3/7] crypto: ghash - Optimized GHASH computations Megha Dey
@ 2020-12-19 17:03   ` Ard Biesheuvel
  2021-01-16  0:14     ` Dey, Megha
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2020-12-19 17:03 UTC (permalink / raw)
  To: Megha Dey
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, dave.hansen, wajdi.k.feghali, greg.b.tucker,
	robert.a.kasten, rajendrakumar.chinnaiyan, tomasz.kantecki,
	ryan.d.saffores, ilya.albrekht, kyung.min.park, Tony Luck,
	ira.weiny

On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@intel.com> wrote:
>
> From: Kyung Min Park <kyung.min.park@intel.com>
>
> Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.
> The new instruction allows to work on 4 x 16 byte blocks at the time.
> For best parallelism and deeper out of order execution, the main loop of
> the code works on 16 x 16 byte blocks at the time and performs reduction
> every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys
> and the precompute operation has been optimized as well to leverage 512 bit
> registers, parallel carry less multiply and reduction.
>
> VPCLMULQDQ instruction is used to accelerate the most time-consuming
> part of GHASH, carry-less multiplication. VPCLMULQDQ instruction
> with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
>
> The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ
> version with the VPCLMULQDQ version when the following criteria are met:
> At compile time:
> 1. CONFIG_CRYPTO_AVX512 is enabled
> 2. toolchain(assembler) supports VPCLMULQDQ instructions
> At runtime:
> 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
>    only Icelake)
> 2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at
>    boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set
>    to 1 after boot.
>    If compiled as loadable module, use_avx512 module parameter must be set:
>    modprobe ghash_clmulni_intel use_avx512=1
>
> With new implementation, tcrypt ghash speed test shows about 4x to 10x
> speedup improvement for GHASH calculation compared to the original
> implementation with PCLMULQDQ when the bytes per update size is 256 Bytes
> or above. Detailed results for a variety of block sizes and update
> sizes are in the table below. The test was performed on Icelake based
> platform with constant frequency set for CPU.
>
> The average performance improvement of the AVX512 version over the current
> implementation is as follows:
> For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).
> For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).
>
> A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction
> and VPCLMULQDQ instruction shows the following results.
>
> ---------------------------------------------------------------------------
> |            |            |         cycles/operation         |            |
> |            |            |       (the lower the better)     |            |
> |    byte    |   bytes    |----------------------------------| percentage |
> |   blocks   | per update |   GHASH test   |   GHASH test    | loss/gain  |
> |            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
> |------------|------------|----------------|-----------------|------------|
> |      16    |     16     |       144      |        233      |   -38.0    |
> |      64    |     16     |       535      |        709      |   -24.5    |
> |      64    |     64     |       210      |        146      |    43.8    |
> |     256    |     16     |      1808      |       1911      |    -5.4    |
> |     256    |     64     |       865      |        581      |    48.9    |
> |     256    |    256     |       682      |        170      |   301.0    |
> |    1024    |     16     |      6746      |       6935      |    -2.7    |
> |    1024    |    256     |      2829      |        714      |   296.0    |
> |    1024    |   1024     |      2543      |        341      |   645.0    |
> |    2048    |     16     |     13219      |      13403      |    -1.3    |
> |    2048    |    256     |      5435      |       1408      |   286.0    |
> |    2048    |   1024     |      5218      |        685      |   661.0    |
> |    2048    |   2048     |      5061      |        565      |   796.0    |
> |    4096    |     16     |     40793      |      27615      |    47.8    |
> |    4096    |    256     |     10662      |       2689      |   297.0    |
> |    4096    |   1024     |     10196      |       1333      |   665.0    |
> |    4096    |   4096     |     10049      |       1011      |   894.0    |
> |    8192    |     16     |     51672      |      54599      |    -5.3    |
> |    8192    |    256     |     21228      |       5284      |   301.0    |
> |    8192    |   1024     |     20306      |       2556      |   694.0    |
> |    8192    |   4096     |     20076      |       2044      |   882.0    |
> |    8192    |   8192     |     20071      |       2017      |   895.0    |
> ---------------------------------------------------------------------------
>
> This work was inspired by the AES GCM mode optimization published
> in Intel Optimized IPSEC Cryptographic library.
> https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm
>
> Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>
> Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
> Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
> Co-developed-by: Megha Dey <megha.dey@intel.com>
> Signed-off-by: Megha Dey <megha.dey@intel.com>

Hello Megha,

What is the purpose of this separate GHASH module? GHASH is only used
in combination with AES-CTR to produce GCM, and this series already
contains a GCM driver.

Do cores exist that implement PCLMULQDQ but not AES-NI?

If not, I think we should be able to drop this patch (and remove the
existing PCLMULQDQ GHASH driver as well)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms
  2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
                   ` (6 preceding siblings ...)
  2020-12-18 21:11 ` [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ Megha Dey
@ 2020-12-21 23:20 ` Eric Biggers
  2020-12-28 19:10   ` Dey, Megha
  7 siblings, 1 reply; 28+ messages in thread
From: Eric Biggers @ 2020-12-21 23:20 UTC (permalink / raw)
  To: Megha Dey
  Cc: herbert, davem, linux-crypto, linux-kernel, ravi.v.shankar,
	tim.c.chen, andi.kleen, dave.hansen, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	tony.luck, ira.weiny, x86

On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote:
> Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions
> (first implemented on Intel's Icelake client and Xeon CPUs).
> 
> These algorithms take advantage of the AVX512 registers to keep the CPU
> busy and increase memory bandwidth utilization. They provide substantial
> (2-10x) improvements over existing crypto algorithms when update data size
> is greater than 128 bytes and do not have any significant impact when used
> on small amounts of data.
> 
> However, these algorithms may also incur a frequency penalty and cause
> collateral damage to other workloads running on the same core(co-scheduled
> threads). These frequency drops are also known as bin drops where 1 bin
> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
> are observed on the Icelake server.
> 

Do these new algorithms all pass the self-tests, including the fuzz tests that
are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?

- Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms
  2020-12-21 23:20 ` [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Eric Biggers
@ 2020-12-28 19:10   ` Dey, Megha
  2021-01-16 16:52     ` Ard Biesheuvel
  0 siblings, 1 reply; 28+ messages in thread
From: Dey, Megha @ 2020-12-28 19:10 UTC (permalink / raw)
  To: Eric Biggers
  Cc: herbert, davem, linux-crypto, linux-kernel, ravi.v.shankar,
	tim.c.chen, andi.kleen, dave.hansen, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	tony.luck, ira.weiny, x86

Hi Eric,

On 12/21/2020 3:20 PM, Eric Biggers wrote:
> On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote:
>> Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions
>> (first implemented on Intel's Icelake client and Xeon CPUs).
>>
>> These algorithms take advantage of the AVX512 registers to keep the CPU
>> busy and increase memory bandwidth utilization. They provide substantial
>> (2-10x) improvements over existing crypto algorithms when update data size
>> is greater than 128 bytes and do not have any significant impact when used
>> on small amounts of data.
>>
>> However, these algorithms may also incur a frequency penalty and cause
>> collateral damage to other workloads running on the same core(co-scheduled
>> threads). These frequency drops are also known as bin drops where 1 bin
>> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
>> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
>> are observed on the Icelake server.
>>
> Do these new algorithms all pass the self-tests, including the fuzz tests that
> are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?

I had tested these algorithms with CRYPTO_MANAGER_DISABLE_TESTS=n and 
tcrypt, not with
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (I wasn't aware this existed, my bad).
I see a couple of errors after enabling it and am working on fixing those.

Megha

>
> - Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
  2020-12-19 17:03   ` Ard Biesheuvel
@ 2021-01-16  0:14     ` Dey, Megha
  2021-01-16  0:20       ` Dave Hansen
  2021-01-16  1:43       ` Eric Biggers
  0 siblings, 2 replies; 28+ messages in thread
From: Dey, Megha @ 2021-01-16  0:14 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, dave.hansen, wajdi.k.feghali, greg.b.tucker,
	robert.a.kasten, rajendrakumar.chinnaiyan, tomasz.kantecki,
	ryan.d.saffores, ilya.albrekht, kyung.min.park, Tony Luck,
	ira.weiny

Hi Ard,

On 12/19/2020 9:03 AM, Ard Biesheuvel wrote:
> On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@intel.com> wrote:
>> From: Kyung Min Park <kyung.min.park@intel.com>
>>
>> Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.
>> The new instruction allows to work on 4 x 16 byte blocks at the time.
>> For best parallelism and deeper out of order execution, the main loop of
>> the code works on 16 x 16 byte blocks at the time and performs reduction
>> every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys
>> and the precompute operation has been optimized as well to leverage 512 bit
>> registers, parallel carry less multiply and reduction.
>>
>> VPCLMULQDQ instruction is used to accelerate the most time-consuming
>> part of GHASH, carry-less multiplication. VPCLMULQDQ instruction
>> with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
>>
>> The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ
>> version with the VPCLMULQDQ version when the following criteria are met:
>> At compile time:
>> 1. CONFIG_CRYPTO_AVX512 is enabled
>> 2. toolchain(assembler) supports VPCLMULQDQ instructions
>> At runtime:
>> 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
>>     only Icelake)
>> 2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at
>>     boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set
>>     to 1 after boot.
>>     If compiled as loadable module, use_avx512 module parameter must be set:
>>     modprobe ghash_clmulni_intel use_avx512=1
>>
>> With new implementation, tcrypt ghash speed test shows about 4x to 10x
>> speedup improvement for GHASH calculation compared to the original
>> implementation with PCLMULQDQ when the bytes per update size is 256 Bytes
>> or above. Detailed results for a variety of block sizes and update
>> sizes are in the table below. The test was performed on Icelake based
>> platform with constant frequency set for CPU.
>>
>> The average performance improvement of the AVX512 version over the current
>> implementation is as follows:
>> For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).
>> For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).
>>
>> A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction
>> and VPCLMULQDQ instruction shows the following results.
>>
>> ---------------------------------------------------------------------------
>> |            |            |         cycles/operation         |            |
>> |            |            |       (the lower the better)     |            |
>> |    byte    |   bytes    |----------------------------------| percentage |
>> |   blocks   | per update |   GHASH test   |   GHASH test    | loss/gain  |
>> |            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
>> |------------|------------|----------------|-----------------|------------|
>> |      16    |     16     |       144      |        233      |   -38.0    |
>> |      64    |     16     |       535      |        709      |   -24.5    |
>> |      64    |     64     |       210      |        146      |    43.8    |
>> |     256    |     16     |      1808      |       1911      |    -5.4    |
>> |     256    |     64     |       865      |        581      |    48.9    |
>> |     256    |    256     |       682      |        170      |   301.0    |
>> |    1024    |     16     |      6746      |       6935      |    -2.7    |
>> |    1024    |    256     |      2829      |        714      |   296.0    |
>> |    1024    |   1024     |      2543      |        341      |   645.0    |
>> |    2048    |     16     |     13219      |      13403      |    -1.3    |
>> |    2048    |    256     |      5435      |       1408      |   286.0    |
>> |    2048    |   1024     |      5218      |        685      |   661.0    |
>> |    2048    |   2048     |      5061      |        565      |   796.0    |
>> |    4096    |     16     |     40793      |      27615      |    47.8    |
>> |    4096    |    256     |     10662      |       2689      |   297.0    |
>> |    4096    |   1024     |     10196      |       1333      |   665.0    |
>> |    4096    |   4096     |     10049      |       1011      |   894.0    |
>> |    8192    |     16     |     51672      |      54599      |    -5.3    |
>> |    8192    |    256     |     21228      |       5284      |   301.0    |
>> |    8192    |   1024     |     20306      |       2556      |   694.0    |
>> |    8192    |   4096     |     20076      |       2044      |   882.0    |
>> |    8192    |   8192     |     20071      |       2017      |   895.0    |
>> ---------------------------------------------------------------------------
>>
>> This work was inspired by the AES GCM mode optimization published
>> in Intel Optimized IPSEC Cryptographic library.
>> https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm
>>
>> Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>
>> Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
>> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
>> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
>> Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
>> Co-developed-by: Megha Dey <megha.dey@intel.com>
>> Signed-off-by: Megha Dey <megha.dey@intel.com>
> Hello Megha,
>
> What is the purpose of this separate GHASH module? GHASH is only used
> in combination with AES-CTR to produce GCM, and this series already
> contains a GCM driver.
>
> Do cores exist that implement PCLMULQDQ but not AES-NI?
>
> If not, I think we should be able to drop this patch (and remove the
> existing PCLMULQDQ GHASH driver as well)

AFAIK, dm-verity (authenticated but not encrypted file system) is one 
use case for authentication only.

Although I am not sure if GHASH is specifically used for this or SHA?

Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.

Megha


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
  2021-01-16  0:14     ` Dey, Megha
@ 2021-01-16  0:20       ` Dave Hansen
  2021-01-16  2:04         ` Eric Biggers
  2021-01-16  1:43       ` Eric Biggers
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2021-01-16  0:20 UTC (permalink / raw)
  To: Dey, Megha, Ard Biesheuvel
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, wajdi.k.feghali, greg.b.tucker, robert.a.kasten,
	rajendrakumar.chinnaiyan, tomasz.kantecki, ryan.d.saffores,
	ilya.albrekht, kyung.min.park, Tony Luck, ira.weiny

On 1/15/21 4:14 PM, Dey, Megha wrote:
> Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.

That's true, bit it's also possible that a hypervisor could enumerate
support for PCLMULQDQ and not AES-NI.  In general, we've tried to
implement x86 CPU features independently, even if they never show up in
a real CPU independently.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
  2021-01-16  0:14     ` Dey, Megha
  2021-01-16  0:20       ` Dave Hansen
@ 2021-01-16  1:43       ` Eric Biggers
  2021-01-16  5:07         ` Dey, Megha
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Biggers @ 2021-01-16  1:43 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Ard Biesheuvel, Herbert Xu, David S. Miller,
	Linux Crypto Mailing List, Linux Kernel Mailing List,
	ravi.v.shankar, tim.c.chen, andi.kleen, dave.hansen,
	wajdi.k.feghali, greg.b.tucker, robert.a.kasten,
	rajendrakumar.chinnaiyan, tomasz.kantecki, ryan.d.saffores,
	ilya.albrekht, kyung.min.park, Tony Luck, ira.weiny

On Fri, Jan 15, 2021 at 04:14:40PM -0800, Dey, Megha wrote:
> > Hello Megha,
> > 
> > What is the purpose of this separate GHASH module? GHASH is only used
> > in combination with AES-CTR to produce GCM, and this series already
> > contains a GCM driver.
> > 
> > Do cores exist that implement PCLMULQDQ but not AES-NI?
> > 
> > If not, I think we should be able to drop this patch (and remove the
> > existing PCLMULQDQ GHASH driver as well)
> 
> AFAIK, dm-verity (authenticated but not encrypted file system) is one use
> case for authentication only.
> 
> Although I am not sure if GHASH is specifically used for this or SHA?
> 
> Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.
> 

dm-verity only uses unkeyed hash algorithms.  So no, it doesn't use GHASH.

- Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
  2021-01-16  0:20       ` Dave Hansen
@ 2021-01-16  2:04         ` Eric Biggers
  2021-01-16  5:13           ` Dave Hansen
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Biggers @ 2021-01-16  2:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dey, Megha, Ard Biesheuvel, Herbert Xu, David S. Miller,
	Linux Crypto Mailing List, Linux Kernel Mailing List,
	ravi.v.shankar, tim.c.chen, andi.kleen, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	Tony Luck, ira.weiny

On Fri, Jan 15, 2021 at 04:20:44PM -0800, Dave Hansen wrote:
> On 1/15/21 4:14 PM, Dey, Megha wrote:
> > Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.
> 
> That's true, bit it's also possible that a hypervisor could enumerate
> support for PCLMULQDQ and not AES-NI.  In general, we've tried to
> implement x86 CPU features independently, even if they never show up in
> a real CPU independently.

We only add optimized implementations of crypto algorithms if they are actually
useful, though.  If they would never be used in practice, that's not useful.

- Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
  2021-01-16  1:43       ` Eric Biggers
@ 2021-01-16  5:07         ` Dey, Megha
  0 siblings, 0 replies; 28+ messages in thread
From: Dey, Megha @ 2021-01-16  5:07 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ard Biesheuvel, Herbert Xu, David S. Miller,
	Linux Crypto Mailing List, Linux Kernel Mailing List,
	ravi.v.shankar, tim.c.chen, andi.kleen, dave.hansen,
	wajdi.k.feghali, greg.b.tucker, robert.a.kasten,
	rajendrakumar.chinnaiyan, tomasz.kantecki, ryan.d.saffores,
	ilya.albrekht, kyung.min.park, Tony Luck, ira.weiny


On 1/15/2021 5:43 PM, Eric Biggers wrote:
> On Fri, Jan 15, 2021 at 04:14:40PM -0800, Dey, Megha wrote:
>>> Hello Megha,
>>>
>>> What is the purpose of this separate GHASH module? GHASH is only used
>>> in combination with AES-CTR to produce GCM, and this series already
>>> contains a GCM driver.
>>>
>>> Do cores exist that implement PCLMULQDQ but not AES-NI?
>>>
>>> If not, I think we should be able to drop this patch (and remove the
>>> existing PCLMULQDQ GHASH driver as well)
>> AFAIK, dm-verity (authenticated but not encrypted file system) is one use
>> case for authentication only.
>>
>> Although I am not sure if GHASH is specifically used for this or SHA?
>>
>> Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.
>>
> dm-verity only uses unkeyed hash algorithms.  So no, it doesn't use GHASH.

Hmm, I see. If that is the case, I am not aware of any other use case 
apart from GCM.

I see that the existing GHASH module in the kernel from 2009. I am not 
sure if there was a use case then, which now is no longer valid.

There many be out-of-tree kernel modules which may be using it but again 
its only speculation.

So, in the next version should I remove the existing GHASH module? (And 
of course remove this patch as well?)

-Megha

>
> - Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
  2021-01-16  2:04         ` Eric Biggers
@ 2021-01-16  5:13           ` Dave Hansen
  2021-01-16 16:48             ` Ard Biesheuvel
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2021-01-16  5:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Dey, Megha, Ard Biesheuvel, Herbert Xu, David S. Miller,
	Linux Crypto Mailing List, Linux Kernel Mailing List,
	ravi.v.shankar, tim.c.chen, andi.kleen, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	Tony Luck, ira.weiny

On 1/15/21 6:04 PM, Eric Biggers wrote:
> On Fri, Jan 15, 2021 at 04:20:44PM -0800, Dave Hansen wrote:
>> On 1/15/21 4:14 PM, Dey, Megha wrote:
>>> Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.
>> That's true, bit it's also possible that a hypervisor could enumerate
>> support for PCLMULQDQ and not AES-NI.  In general, we've tried to
>> implement x86 CPU features independently, even if they never show up in
>> a real CPU independently.
> We only add optimized implementations of crypto algorithms if they are actually
> useful, though.  If they would never be used in practice, that's not useful.

Yes, totally agree.  If it's not of practical use, it doesn't get merged.

I just wanted to share what we do for other related but independent CPU
features.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
  2021-01-16  5:13           ` Dave Hansen
@ 2021-01-16 16:48             ` Ard Biesheuvel
  0 siblings, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2021-01-16 16:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric Biggers, Dey, Megha, Herbert Xu, David S. Miller,
	Linux Crypto Mailing List, Linux Kernel Mailing List,
	ravi.v.shankar, tim.c.chen, andi.kleen, wajdi.k.feghali,
	greg.b.tucker, robert.a.kasten, rajendrakumar.chinnaiyan,
	tomasz.kantecki, ryan.d.saffores, ilya.albrekht, kyung.min.park,
	Tony Luck, ira.weiny

On Sat, 16 Jan 2021 at 06:13, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/15/21 6:04 PM, Eric Biggers wrote:
> > On Fri, Jan 15, 2021 at 04:20:44PM -0800, Dave Hansen wrote:
> >> On 1/15/21 4:14 PM, Dey, Megha wrote:
> >>> Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.
> >> That's true, bit it's also possible that a hypervisor could enumerate
> >> support for PCLMULQDQ and not AES-NI.  In general, we've tried to
> >> implement x86 CPU features independently, even if they never show up in
> >> a real CPU independently.
> > We only add optimized implementations of crypto algorithms if they are actually
> > useful, though.  If they would never be used in practice, that's not useful.
>
> Yes, totally agree.  If it's not of practical use, it doesn't get merged.
>
> I just wanted to share what we do for other related but independent CPU
> features.

Thanks for the insight.

The issue with the current GHASH driver is that it uses infrastructure
that we may decide to remove (the async cryptd helper [0]). So adding
more dependencies on that without any proven benefit should obviously
be avoided at this time as well.

[0] https://lore.kernel.org/linux-arm-kernel/20201218170106.23280-1-ardb@kernel.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms
  2020-12-28 19:10   ` Dey, Megha
@ 2021-01-16 16:52     ` Ard Biesheuvel
  2021-01-16 18:35       ` Dey, Megha
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2021-01-16 16:52 UTC (permalink / raw)
  To: Dey, Megha
  Cc: Eric Biggers, Herbert Xu, David S. Miller,
	Linux Crypto Mailing List, Linux Kernel Mailing List,
	ravi.v.shankar, tim.c.chen, andi.kleen, Dave Hansen,
	wajdi.k.feghali, greg.b.tucker, robert.a.kasten,
	rajendrakumar.chinnaiyan, tomasz.kantecki, ryan.d.saffores,
	ilya.albrekht, kyung.min.park, Tony Luck, ira.weiny, X86 ML

On Mon, 28 Dec 2020 at 20:11, Dey, Megha <megha.dey@intel.com> wrote:
>
> Hi Eric,
>
> On 12/21/2020 3:20 PM, Eric Biggers wrote:
> > On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote:
> >> Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions
> >> (first implemented on Intel's Icelake client and Xeon CPUs).
> >>
> >> These algorithms take advantage of the AVX512 registers to keep the CPU
> >> busy and increase memory bandwidth utilization. They provide substantial
> >> (2-10x) improvements over existing crypto algorithms when update data size
> >> is greater than 128 bytes and do not have any significant impact when used
> >> on small amounts of data.
> >>
> >> However, these algorithms may also incur a frequency penalty and cause
> >> collateral damage to other workloads running on the same core(co-scheduled
> >> threads). These frequency drops are also known as bin drops where 1 bin
> >> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
> >> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
> >> are observed on the Icelake server.
> >>
> > Do these new algorithms all pass the self-tests, including the fuzz tests that
> > are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?
>
> I had tested these algorithms with CRYPTO_MANAGER_DISABLE_TESTS=n and
> tcrypt, not with
> CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (I wasn't aware this existed, my bad).
> I see a couple of errors after enabling it and am working on fixing those.
>

Hello Megha,

I think the GHASH changes can be dropped (as discussed in the other
thread), given the lack of a use case. The existing GHASH driver could
also be removed in the future, but I don't think it needs to be part
of this series.

Could you please rebase this onto the latest AES-NI changes that are
in Herbert's tree? (as well as the ones I sent out today) They address
some issues with indirect calls and excessive disabling of preemption,
and your GCM and CTR changes are definitely going to be affected by
this as well.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support
  2020-12-18 21:10 ` [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support Megha Dey
@ 2021-01-16 16:54   ` Ard Biesheuvel
  2021-01-20 22:38     ` Dey, Megha
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2021-01-16 16:54 UTC (permalink / raw)
  To: Megha Dey
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, Dave Hansen, wajdi.k.feghali, greg.b.tucker,
	robert.a.kasten, rajendrakumar.chinnaiyan, tomasz.kantecki,
	ryan.d.saffores, ilya.albrekht, kyung.min.park, Tony Luck,
	ira.weiny, X86 ML

On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@intel.com> wrote:
>
> This is a preparatory patch to introduce the optimized crypto algorithms
> using AVX512 instructions which would require VAES and VPLCMULQDQ support.
>
> Check for VAES and VPCLMULQDQ assembler support using AVX512 registers.
>
> Cc: x86@kernel.org
> Signed-off-by: Megha Dey <megha.dey@intel.com>
> ---
>  arch/x86/Kconfig.assembler | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
> index 26b8c08..9ea0bc8 100644
> --- a/arch/x86/Kconfig.assembler
> +++ b/arch/x86/Kconfig.assembler
> @@ -1,6 +1,16 @@
>  # SPDX-License-Identifier: GPL-2.0
>  # Copyright (C) 2020 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
>
> +config AS_VAES_AVX512
> +       def_bool $(as-instr,vaesenc %zmm0$(comma)%zmm1$(comma)%zmm1) && 64BIT

Is the '&& 64BIT' necessary here, but not below?

In any case, better to use a separate 'depends on' line, for legibility

> +       help
> +         Supported by binutils >= 2.30 and LLVM integrated assembler
> +
> +config AS_VPCLMULQDQ
> +       def_bool $(as-instr,vpclmulqdq \$0$(comma)%zmm2$(comma)%zmm6$(comma)%zmm4)
> +       help
> +         Supported by binutils >= 2.30 and LLVM integrated assembler
> +
>  config AS_AVX512
>         def_bool $(as-instr,vpmovm2b %k1$(comma)%zmm5)
>         help
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction
  2020-12-18 21:10 ` [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction Megha Dey
@ 2021-01-16 17:00   ` Ard Biesheuvel
  2021-01-20 22:46     ` Dey, Megha
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2021-01-16 17:00 UTC (permalink / raw)
  To: Megha Dey
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, Dave Hansen, wajdi.k.feghali, greg.b.tucker,
	robert.a.kasten, rajendrakumar.chinnaiyan, tomasz.kantecki,
	ryan.d.saffores, ilya.albrekht, kyung.min.park, Tony Luck,
	ira.weiny

On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@intel.com> wrote:
>
> From: Kyung Min Park <kyung.min.park@intel.com>
>
> Update the crc_pcl function that calculates T10 Data Integrity Field
> CRC16 (CRC T10 DIF) using VPCLMULQDQ instruction. VPCLMULQDQ instruction
> with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
> The advantage comes from packing multiples of 4 * 128 bit data into AVX512
> reducing instruction latency.
>
> The glue code in crct10diff module overrides the existing PCLMULQDQ version
> with the VPCLMULQDQ version when the following criteria are met:
> At compile time:
> 1. CONFIG_CRYPTO_AVX512 is enabled
> 2. toolchain(assembler) supports VPCLMULQDQ instructions
> At runtime:
> 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
>    only Icelake)
> 2. If compiled as built-in module, crct10dif_pclmul.use_avx512 is set at
>    boot time or /sys/module/crct10dif_pclmul/parameters/use_avx512 is set
>    to 1 after boot.
>    If compiled as loadable module, use_avx512 module parameter must be set:
>    modprobe crct10dif_pclmul use_avx512=1
>
> A typical run of tcrypt with CRC T10 DIF calculation with PCLMULQDQ
> instruction and VPCLMULQDQ instruction shows the following results:
> For bytes per update >= 1KB, we see the average improvement of 46%(~1.4x)
> For bytes per update < 1KB, we see the average improvement of 13%.
> Test was performed on an Icelake based platform with constant frequency
> set for CPU.
>
> Detailed results for a variety of block sizes and update sizes are in
> the table below.
>
> ---------------------------------------------------------------------------
> |            |            |         cycles/operation         |            |
> |            |            |       (the lower the better)     |            |
> |    byte    |   bytes    |----------------------------------| percentage |
> |   blocks   | per update |   CRC T10 DIF  |  CRC T10 DIF    | loss/gain  |
> |            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
> |------------|------------|----------------|-----------------|------------|
> |      16    |     16     |        77      |        106      |   -27.0    |
> |      64    |     16     |       411      |        390      |     5.4    |
> |      64    |     64     |        71      |         85      |   -16.0    |
> |     256    |     16     |      1224      |       1308      |    -6.4    |
> |     256    |     64     |       393      |        407      |    -3.4    |
> |     256    |    256     |        93      |         86      |     8.1    |
> |    1024    |     16     |      4564      |       5020      |    -9.0    |
> |    1024    |    256     |       486      |        475      |     2.3    |
> |    1024    |   1024     |       221      |        148      |    49.3    |
> |    2048    |     16     |      8945      |       9851      |    -9.1    |
> |    2048    |    256     |       982      |        951      |     3.3    |
> |    2048    |   1024     |       500      |        369      |    35.5    |
> |    2048    |   2048     |       413      |        265      |    55.8    |
> |    4096    |     16     |     17885      |      19351      |    -7.5    |
> |    4096    |    256     |      1828      |       1713      |     6.7    |
> |    4096    |   1024     |       968      |        805      |    20.0    |
> |    4096    |   4096     |       739      |        475      |    55.6    |
> |    8192    |     16     |     48339      |      41556      |    16.3    |
> |    8192    |    256     |      3494      |       3342      |     4.5    |
> |    8192    |   1024     |      1959      |       1462      |    34.0    |
> |    8192    |   4096     |      1561      |       1036      |    50.7    |
> |    8192    |   8192     |      1540      |       1004      |    53.4    |
> ---------------------------------------------------------------------------
>
> This work was inspired by the CRC T10 DIF AVX512 optimization published
> in Intel Intelligent Storage Acceleration Library.
> https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_by16_10.asm
>
> Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>
> Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
> Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
> Signed-off-by: Megha Dey <megha.dey@intel.com>
> ---
>  arch/x86/crypto/Makefile                  |   1 +
>  arch/x86/crypto/crct10dif-avx512-asm_64.S | 482 ++++++++++++++++++++++++++++++
>  arch/x86/crypto/crct10dif-pclmul_glue.c   |  24 +-
>  arch/x86/include/asm/disabled-features.h  |   8 +-
>  crypto/Kconfig                            |  23 ++
>  5 files changed, 535 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/crypto/crct10dif-avx512-asm_64.S
>
...
> diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c
> index 71291d5a..26a6350 100644
> --- a/arch/x86/crypto/crct10dif-pclmul_glue.c
> +++ b/arch/x86/crypto/crct10dif-pclmul_glue.c
> @@ -35,6 +35,16 @@
>  #include <asm/simd.h>
>
>  asmlinkage u16 crc_t10dif_pcl(u16 init_crc, const u8 *buf, size_t len);
> +#ifdef CONFIG_CRYPTO_CRCT10DIF_AVX512
> +asmlinkage u16 crct10dif_pcl_avx512(u16 init_crc, const u8 *buf, size_t len);
> +#else
> +static u16 crct10dif_pcl_avx512(u16 init_crc, const u8 *buf, size_t len)
> +{ return 0; }
> +#endif
> +

Please drop the alternative definition. If you code the references
correctly, the alternative is never called.

> +static bool use_avx512;
> +module_param(use_avx512, bool, 0644);
> +MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
>
>  struct chksum_desc_ctx {
>         __u16 crc;
> @@ -56,7 +66,12 @@ static int chksum_update(struct shash_desc *desc, const u8 *data,
>
>         if (length >= 16 && crypto_simd_usable()) {
>                 kernel_fpu_begin();
> -               ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
> +               if (IS_ENABLED(CONFIG_CRYPTO_CRCT10DIF_AVX512) &&
> +                   cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ) &&
> +                   use_avx512)
> +                       ctx->crc = crct10dif_pcl_avx512(ctx->crc, data, length);
> +               else
> +                       ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);

Please use a static call or static key here, and initialize its value
in the init code.

>                 kernel_fpu_end();
>         } else
>                 ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
> @@ -75,7 +90,12 @@ static int __chksum_finup(__u16 crc, const u8 *data, unsigned int len, u8 *out)
>  {
>         if (len >= 16 && crypto_simd_usable()) {
>                 kernel_fpu_begin();
> -               *(__u16 *)out = crc_t10dif_pcl(crc, data, len);
> +               if (IS_ENABLED(CONFIG_CRYPTO_CRCT10DIF_AVX512) &&
> +                   cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ) &&
> +                   use_avx512)
> +                       *(__u16 *)out = crct10dif_pcl_avx512(crc, data, len);
> +               else
> +                       *(__u16 *)out = crc_t10dif_pcl(crc, data, len);

Same here.

>                 kernel_fpu_end();
>         } else
>                 *(__u16 *)out = crc_t10dif_generic(crc, data, len);
> diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
> index 5861d34..1192dea 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -56,6 +56,12 @@
>  # define DISABLE_PTI           (1 << (X86_FEATURE_PTI & 31))
>  #endif
>
> +#if defined(CONFIG_AS_VPCLMULQDQ)
> +# define DISABLE_VPCLMULQDQ    0
> +#else
> +# define DISABLE_VPCLMULQDQ    (1 << (X86_FEATURE_VPCLMULQDQ & 31))
> +#endif
> +
>  #ifdef CONFIG_IOMMU_SUPPORT
>  # define DISABLE_ENQCMD        0
>  #else
> @@ -82,7 +88,7 @@
>  #define DISABLED_MASK14        0
>  #define DISABLED_MASK15        0
>  #define DISABLED_MASK16        (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
> -                        DISABLE_ENQCMD)
> +                        DISABLE_ENQCMD|DISABLE_VPCLMULQDQ)
>  #define DISABLED_MASK17        0
>  #define DISABLED_MASK18        0
>  #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
> diff --git a/crypto/Kconfig b/crypto/Kconfig
> index a367fcf..b090f14 100644
> --- a/crypto/Kconfig
> +++ b/crypto/Kconfig
> @@ -613,6 +613,29 @@ config CRYPTO_CRC32C_VPMSUM
>           (vpmsum) instructions, introduced in POWER8. Enable on POWER8
>           and newer processors for improved performance.
>
> +config CRYPTO_AVX512
> +       bool "AVX512 hardware acceleration for crypto algorithms"
> +       depends on X86
> +       depends on 64BIT
> +       help
> +         This option will compile in AVX512 hardware accelerated crypto
> +         algorithms. These optimized algorithms provide substantial(2-10x)
> +         improvements over existing crypto algorithms for large data size.
> +         However, it may also incur a frequency penalty (aka. "bin drops")
> +         and cause collateral damage to other workloads running on the
> +         same core.
> +
> +# We default CRYPTO_CRCT10DIF_AVX512 to Y but depend on CRYPTO_AVX512 in
> +# order to have a singular option (CRYPTO_AVX512) select multiple algorithms
> +# when supported. Specifically, if the platform and/or toolset does not
> +# support VPLMULQDQ. Then this algorithm should not be supported as part of
> +# the set that CRYPTO_AVX512 selects.
> +config CRYPTO_CRCT10DIF_AVX512
> +       bool
> +       default y
> +       depends on CRYPTO_AVX512
> +       depends on CRYPTO_CRCT10DIF_PCLMUL
> +       depends on AS_VPCLMULQDQ
>
>  config CRYPTO_CRC32C_SPARC64
>         tristate "CRC32c CRC algorithm (SPARC64)"
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization
  2020-12-18 21:11 ` [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization Megha Dey
@ 2021-01-16 17:03   ` Ard Biesheuvel
  2021-01-20 22:46     ` Dey, Megha
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2021-01-16 17:03 UTC (permalink / raw)
  To: Megha Dey
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, Dave Hansen, wajdi.k.feghali, greg.b.tucker,
	robert.a.kasten, rajendrakumar.chinnaiyan, tomasz.kantecki,
	ryan.d.saffores, ilya.albrekht, kyung.min.park, Tony Luck,
	ira.weiny

On Fri, 18 Dec 2020 at 22:08, Megha Dey <megha.dey@intel.com> wrote:
>
> Introduce the "by16" implementation of the AES CTR mode using AVX512
> optimizations. "by16" means that 16 independent blocks (each block
> being 128 bits) can be ciphered simultaneously as opposed to the
> current 8 blocks.
>
> The glue code in AESNI module overrides the existing "by8" CTR mode
> encryption/decryption routines with the "by16" ones when the following
> criteria are met:
> At compile time:
> 1. CONFIG_CRYPTO_AVX512 is enabled
> 2. toolchain(assembler) supports VAES instructions
> At runtime:
> 1. VAES and AVX512VL features are supported on platform (currently
>    only Icelake)
> 2. aesni_intel.use_avx512 module parameter is set at boot time. For this
>    algorithm, switching from AVX512 optimized version is not possible once
>    set at boot time because of how the code is structured today.(Can be
>    changed later if required)
>
> The functions aes_ctr_enc_128_avx512_by16(), aes_ctr_enc_192_avx512_by16()
> and aes_ctr_enc_256_avx512_by16() are adapted from Intel Optimized IPSEC
> Cryptographic library.
>
> On a Icelake desktop, with turbo disabled and all CPUs running at maximum
> frequency, the "by16" CTR mode optimization shows better performance
> across data & key sizes as measured by tcrypt.
>
> The average performance improvement of the "by16" version over the "by8"
> version is as follows:
> For all key sizes(128/192/256 bits),
>         data sizes < 128 bytes/block, negligible improvement(~3% loss)
>         data sizes > 128 bytes/block, there is an average improvement of
> 48% for both encryption and decryption.
>
> A typical run of tcrypt with AES CTR mode encryption/decryption of the
> "by8" and "by16" optimization on a Icelake desktop shows the following
> results:
>
> --------------------------------------------------------------
> |  key   | bytes | cycles/op (lower is better)| percentage   |
> | length |  per  |  encryption  |  decryption |  loss/gain   |
> | (bits) | block |-------------------------------------------|
> |        |       | by8  | by16  | by8  | by16 |  enc | dec   |
> |------------------------------------------------------------|
> |  128   |  16   | 156  | 168   | 164  | 168  | -7.7 |  -2.5 |
> |  128   |  64   | 180  | 190   | 157  | 146  | -5.6 |   7.1 |
> |  128   |  256  | 248  | 158   | 251  | 161  | 36.3 |  35.9 |
> |  128   |  1024 | 633  | 316   | 642  | 319  | 50.1 |  50.4 |
> |  128   |  1472 | 853  | 411   | 877  | 407  | 51.9 |  53.6 |
> |  128   |  8192 | 4463 | 1959  | 4447 | 1940 | 56.2 |  56.4 |
> |  192   |  16   | 136  | 145   | 149  | 166  | -6.7 | -11.5 |
> |  192   |  64   | 159  | 154   | 157  | 160  |  3.2 |  -2   |
> |  192   |  256  | 268  | 172   | 274  | 177  | 35.9 |  35.5 |
> |  192   |  1024 | 710  | 358   | 720  | 355  | 49.6 |  50.7 |
> |  192   |  1472 | 989  | 468   | 983  | 469  | 52.7 |  52.3 |
> |  192   |  8192 | 6326 | 3551  | 6301 | 3567 | 43.9 |  43.4 |
> |  256   |  16   | 153  | 165   | 139  | 156  | -7.9 | -12.3 |
> |  256   |  64   | 158  | 152   | 174  | 161  |  3.8 |   7.5 |
> |  256   |  256  | 283  | 176   | 287  | 202  | 37.9 |  29.7 |
> |  256   |  1024 | 797  | 393   | 807  | 395  | 50.7 |  51.1 |
> |  256   |  1472 | 1108 | 534   | 1107 | 527  | 51.9 |  52.4 |
> |  256   |  8192 | 5763 | 2616  | 5773 | 2617 | 54.7 |  54.7 |
> --------------------------------------------------------------
>
> This work was inspired by the AES CTR mode optimization published
> in Intel Optimized IPSEC Cryptographic library.
> https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/cntr_vaes_avx512.asm
>
> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
> Signed-off-by: Megha Dey <megha.dey@intel.com>
> ---
>  arch/x86/crypto/Makefile                    |   1 +
>  arch/x86/crypto/aes_ctrby16_avx512-x86_64.S | 856 ++++++++++++++++++++++++++++
>  arch/x86/crypto/aesni-intel_glue.c          |  57 +-
>  arch/x86/crypto/avx512_vaes_common.S        | 422 ++++++++++++++
>  arch/x86/include/asm/disabled-features.h    |   8 +-
>  crypto/Kconfig                              |  12 +
>  6 files changed, 1354 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S
>
...
> diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
> index ad8a718..f45059e 100644
> --- a/arch/x86/crypto/aesni-intel_glue.c
> +++ b/arch/x86/crypto/aesni-intel_glue.c
> @@ -46,6 +46,10 @@
>  #define CRYPTO_AES_CTX_SIZE (sizeof(struct crypto_aes_ctx) + AESNI_ALIGN_EXTRA)
>  #define XTS_AES_CTX_SIZE (sizeof(struct aesni_xts_ctx) + AESNI_ALIGN_EXTRA)
>
> +static bool use_avx512;
> +module_param(use_avx512, bool, 0644);
> +MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
> +
>  /* This data is stored at the end of the crypto_tfm struct.
>   * It's a type of per "session" data storage location.
>   * This needs to be 16 byte aligned.
> @@ -191,6 +195,35 @@ asmlinkage void aes_ctr_enc_192_avx_by8(const u8 *in, u8 *iv,
>                 void *keys, u8 *out, unsigned int num_bytes);
>  asmlinkage void aes_ctr_enc_256_avx_by8(const u8 *in, u8 *iv,
>                 void *keys, u8 *out, unsigned int num_bytes);
> +
> +#ifdef CONFIG_CRYPTO_AES_CTR_AVX512
> +asmlinkage void aes_ctr_enc_128_avx512_by16(void *keys, u8 *out,
> +                                           const u8 *in,
> +                                           unsigned int num_bytes,
> +                                           u8 *iv);
> +asmlinkage void aes_ctr_enc_192_avx512_by16(void *keys, u8 *out,
> +                                           const u8 *in,
> +                                           unsigned int num_bytes,
> +                                           u8 *iv);
> +asmlinkage void aes_ctr_enc_256_avx512_by16(void *keys, u8 *out,
> +                                           const u8 *in,
> +                                           unsigned int num_bytes,
> +                                           u8 *iv);
> +#else
> +static inline void aes_ctr_enc_128_avx512_by16(void *keys, u8 *out,
> +                                              const u8 *in,
> +                                              unsigned int num_bytes,
> +                                              u8 *iv) {}
> +static inline void aes_ctr_enc_192_avx512_by16(void *keys, u8 *out,
> +                                              const u8 *in,
> +                                              unsigned int num_bytes,
> +                                              u8 *iv) {}
> +static inline void aes_ctr_enc_256_avx512_by16(void *keys, u8 *out,
> +                                              const u8 *in,
> +                                              unsigned int num_bytes,
> +                                              u8 *iv) {}
> +#endif
> +

Please drop these alternatives.

>  /*
>   * asmlinkage void aesni_gcm_init_avx_gen2()
>   * gcm_data *my_ctx_data, context data
> @@ -487,6 +520,23 @@ static void aesni_ctr_enc_avx_tfm(struct crypto_aes_ctx *ctx, u8 *out,
>                 aes_ctr_enc_256_avx_by8(in, iv, (void *)ctx, out, len);
>  }
>
> +static void aesni_ctr_enc_avx512_tfm(struct crypto_aes_ctx *ctx, u8 *out,
> +                                    const u8 *in, unsigned int len, u8 *iv)
> +{
> +       /*
> +        * based on key length, override with the by16 version
> +        * of ctr mode encryption/decryption for improved performance.
> +        * aes_set_key_common() ensures that key length is one of
> +        * {128,192,256}
> +        */
> +       if (ctx->key_length == AES_KEYSIZE_128)
> +               aes_ctr_enc_128_avx512_by16((void *)ctx, out, in, len, iv);
> +       else if (ctx->key_length == AES_KEYSIZE_192)
> +               aes_ctr_enc_192_avx512_by16((void *)ctx, out, in, len, iv);
> +       else
> +               aes_ctr_enc_256_avx512_by16((void *)ctx, out, in, len, iv);
> +}
> +
>  static int ctr_crypt(struct skcipher_request *req)
>  {
>         struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> @@ -1076,7 +1126,12 @@ static int __init aesni_init(void)
>                 aesni_gcm_tfm = &aesni_gcm_tfm_sse;
>         }
>         aesni_ctr_enc_tfm = aesni_ctr_enc;
> -       if (boot_cpu_has(X86_FEATURE_AVX)) {
> +       if (use_avx512 && IS_ENABLED(CONFIG_CRYPTO_AES_CTR_AVX512) &&
> +           cpu_feature_enabled(X86_FEATURE_VAES)) {
> +               /* Ctr mode performance optimization using AVX512 */
> +               aesni_ctr_enc_tfm = aesni_ctr_enc_avx512_tfm;
> +               pr_info("AES CTR mode by16 optimization enabled\n");

This will need to be changed to a static_call_update() once my
outstanding patch is merged.

> +       } else if (boot_cpu_has(X86_FEATURE_AVX)) {
>                 /* optimize performance of ctr mode encryption transform */
>                 aesni_ctr_enc_tfm = aesni_ctr_enc_avx_tfm;
>                 pr_info("AES CTR mode by8 optimization enabled\n");

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ
  2020-12-18 21:11 ` [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ Megha Dey
@ 2021-01-16 17:16   ` Ard Biesheuvel
  2021-01-20 22:48     ` Dey, Megha
  0 siblings, 1 reply; 28+ messages in thread
From: Ard Biesheuvel @ 2021-01-16 17:16 UTC (permalink / raw)
  To: Megha Dey
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, Dave Hansen, wajdi.k.feghali, greg.b.tucker,
	robert.a.kasten, rajendrakumar.chinnaiyan, tomasz.kantecki,
	ryan.d.saffores, ilya.albrekht, kyung.min.park, Tony Luck,
	ira.weiny

On Fri, 18 Dec 2020 at 22:08, Megha Dey <megha.dey@intel.com> wrote:
>
> Introduce the AVX512 implementation that optimizes the AESNI-GCM encode
> and decode routines using VPCLMULQDQ.
>
> The glue code in AESNI module overrides the existing AVX2 GCM mode
> encryption/decryption routines with the AX512 AES GCM mode ones when the
> following criteria are met:
> At compile time:
> 1. CONFIG_CRYPTO_AVX512 is enabled
> 2. toolchain(assembler) supports VPCLMULQDQ instructions
> At runtime:
> 1. VPCLMULQDQ and AVX512VL features are supported on a platform
>    (currently only Icelake)
> 2. aesni_intel.use_avx512 module parameter is set at boot time. For this
>    algorithm, switching from AVX512 optimized version is not possible
>    once set at boot time because of how the code is structured today.(Can
>    be changed later if required)
>
> The functions aesni_gcm_init_avx_512, aesni_gcm_enc_update_avx_512,
> aesni_gcm_dec_update_avx_512 and aesni_gcm_finalize_avx_512 are adapted
> from the Intel Optimized IPSEC Cryptographic library.
>
> On a Icelake desktop, with turbo disabled and all CPUs running at
> maximum frequency, the AVX512 GCM mode optimization shows better
> performance across data & key sizes as measured by tcrypt.
>
> The average performance improvement of the AVX512 version over the AVX2
> version is as follows:
> For all key sizes(128/192/256 bits),
>         data sizes < 128 bytes/block, negligible improvement (~7.5%)
>         data sizes > 128 bytes/block, there is an average improvement of
>         40% for both encryption and decryption.
>
> A typical run of tcrypt with AES GCM mode encryption/decryption of the
> AVX2 and AVX512 optimization on a Icelake desktop shows the following
> results:
>
>   ----------------------------------------------------------------------
>   |   key  | bytes | cycles/op (lower is better)   | Percentage gain/  |
>   | length |   per |   encryption  |  decryption   |      loss         |
>   | (bits) | block |-------------------------------|-------------------|
>   |        |       | avx2 | avx512 | avx2 | avx512 | Encrypt | Decrypt |
>   |---------------------------------------------------------------------
>   |  128   | 16    | 689  |  701   | 689  |  707   |  -1.7   |  -2.61  |
>   |  128   | 64    | 731  |  660   | 771  |  649   |   9.7   |  15.82  |
>   |  128   | 256   | 911  |  750   | 900  |  721   |  17.67  |  19.88  |
>   |  128   | 512   | 1181 |  814   | 1161 |  782   |  31.07  |  32.64  |
>   |  128   | 1024  | 1676 |  1052  | 1685 |  1030  |  37.23  |  38.87  |
>   |  128   | 2048  | 2475 |  1447  | 2456 |  1419  |  41.53  |  42.22  |
>   |  128   | 4096  | 3806 |  2154  | 3820 |  2119  |  43.41  |  44.53  |
>   |  128   | 8192  | 9169 |  3806  | 6997 |  3718  |  58.49  |  46.86  |
>   |  192   | 16    | 754  |  683   | 737  |  672   |   9.42  |   8.82  |
>   |  192   | 64    | 735  |  686   | 715  |  640   |   6.66  |  10.49  |
>   |  192   | 256   | 949  |  738   | 2435 |  729   |  22.23  |  70     |
>   |  192   | 512   | 1235 |  854   | 1200 |  833   |  30.85  |  30.58  |
>   |  192   | 1024  | 1777 |  1084  | 1763 |  1051  |  38.99  |  40.39  |
>   |  192   | 2048  | 2574 |  1497  | 2592 |  1459  |  41.84  |  43.71  |
>   |  192   | 4096  | 4086 |  2317  | 4091 |  2244  |  43.29  |  45.14  |
>   |  192   | 8192  | 7481 |  4054  | 7505 |  3953  |  45.81  |  47.32  |
>   |  256   | 16    | 755  |  682   | 720  |  683   |   9.68  |   5.14  |
>   |  256   | 64    | 744  |  677   | 719  |  658   |   9     |   8.48  |
>   |  256   | 256   | 962  |  758   | 948  |  749   |  21.21  |  21     |
>   |  256   | 512   | 1297 |  862   | 1276 |  836   |  33.54  |  34.48  |
>   |  256   | 1024  | 1831 |  1114  | 1819 |  1095  |  39.16  |  39.8   |
>   |  256   | 2048  | 2767 |  1566  | 2715 |  1524  |  43.4   |  43.87  |
>   |  256   | 4096  | 4378 |  2382  | 4368 |  2354  |  45.6   |  46.11  |
>   |  256   | 8192  | 8075 |  4262  | 8080 |  4186  |  47.22  |  48.19  |
>   ----------------------------------------------------------------------
>
> This work was inspired by the AES GCM mode optimization published in
> Intel Optimized IPSEC Cryptographic library.
> https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/gcm_vaes_avx512.asm
>
> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
> Signed-off-by: Megha Dey <megha.dey@intel.com>
> ---
>  arch/x86/crypto/Makefile                    |    1 +
>  arch/x86/crypto/aesni-intel_avx512-x86_64.S | 1788 +++++++++++++++++++++++++++
>  arch/x86/crypto/aesni-intel_glue.c          |   62 +-
>  crypto/Kconfig                              |   12 +
>  4 files changed, 1858 insertions(+), 5 deletions(-)
>  create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S
>
...
> diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
> index 9e56cdf..8fc5bac 100644
> --- a/arch/x86/crypto/aesni-intel_glue.c
> +++ b/arch/x86/crypto/aesni-intel_glue.c
> @@ -55,13 +55,16 @@ MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
>   * This needs to be 16 byte aligned.
>   */
>  struct aesni_rfc4106_gcm_ctx {
> -       u8 hash_subkey[16] AESNI_ALIGN_ATTR;
> +       /* AVX512 optimized algorithms use 48 hash keys to conduct
> +        * multiple PCLMULQDQ operations in parallel
> +        */
> +       u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;
>         struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;
>         u8 nonce[4];
>  };
>
>  struct generic_gcmaes_ctx {
> -       u8 hash_subkey[16] AESNI_ALIGN_ATTR;
> +       u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;
>         struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;
>  };
>
> @@ -82,7 +85,7 @@ struct gcm_context_data {
>         u8 current_counter[GCM_BLOCK_LEN];
>         u64 partial_block_len;
>         u64 unused;
> -       u8 hash_keys[GCM_BLOCK_LEN * 16];
> +       u8 hash_keys[48 * 16];
>  };
>

This structure gets allocated on the stack, and gets inflated
significantly by this change, even though the code is not enabled by
default, and not even supported for most users.

Is it really necessary for this to be per-request data? If these are
precomputed powers of H, they can be moved into the TFM context
structure instead, which lives on the heap (and can be shared by all
concurrent users of the TFM)

>  asmlinkage int aesni_set_key(struct crypto_aes_ctx *ctx, const u8 *in_key,
> @@ -266,6 +269,47 @@ static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_gen2 = {
>         .finalize = &aesni_gcm_finalize_avx_gen2,
>  };
>
> +#ifdef CONFIG_CRYPTO_AES_GCM_AVX512
> +/*
> + * asmlinkage void aesni_gcm_init_avx_512()
> + * gcm_data *my_ctx_data, context data
> + * u8 *hash_subkey,  the Hash sub key input. Data starts on a 16-byte boundary.
> + */
> +asmlinkage void aesni_gcm_init_avx_512(void *my_ctx_data,
> +                                      struct gcm_context_data *gdata,
> +                                      u8 *iv,
> +                                      u8 *hash_subkey,
> +                                      const u8 *aad,
> +                                      unsigned long aad_len);
> +asmlinkage void aesni_gcm_enc_update_avx_512(void *ctx,
> +                                            struct gcm_context_data *gdata,
> +                                            u8 *out,
> +                                            const u8 *in,
> +                                            unsigned long plaintext_len);
> +asmlinkage void aesni_gcm_dec_update_avx_512(void *ctx,
> +                                            struct gcm_context_data *gdata,
> +                                            u8 *out,
> +                                            const u8 *in,
> +                                            unsigned long ciphertext_len);
> +asmlinkage void aesni_gcm_finalize_avx_512(void *ctx,
> +                                          struct gcm_context_data *gdata,
> +                                          u8 *auth_tag,
> +                                          unsigned long auth_tag_len);
> +
> +asmlinkage void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey);
> +
> +static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {
> +       .init = &aesni_gcm_init_avx_512,
> +       .enc_update = &aesni_gcm_enc_update_avx_512,
> +       .dec_update = &aesni_gcm_dec_update_avx_512,
> +       .finalize = &aesni_gcm_finalize_avx_512,
> +};
> +#else
> +static void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey)
> +{}
> +static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {};
> +#endif
> +

Please drop the alternative dummy definitions.

>  /*
>   * asmlinkage void aesni_gcm_init_avx_gen4()
>   * gcm_data *my_ctx_data, context data
> @@ -669,7 +713,11 @@ rfc4106_set_hash_subkey(u8 *hash_subkey, const u8 *key, unsigned int key_len)
>         /* We want to cipher all zeros to create the hash sub key. */
>         memset(hash_subkey, 0, RFC4106_HASH_SUBKEY_SIZE);
>
> -       aes_encrypt(&ctx, hash_subkey, hash_subkey);
> +       if (IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) && use_avx512 &&
> +           cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ))
> +               aes_gcm_precomp_avx_512(&ctx, hash_subkey);
> +       else
> +               aes_encrypt(&ctx, hash_subkey, hash_subkey);
>

I suppose this answers my question about the subkeys. Please find a
way to move these out of struct gcm_context_data so they don't need to
be copied to the stack for each request.


>         memzero_explicit(&ctx, sizeof(ctx));
>         return 0;
> @@ -1114,7 +1162,11 @@ static int __init aesni_init(void)
>         if (!x86_match_cpu(aesni_cpu_id))
>                 return -ENODEV;
>  #ifdef CONFIG_X86_64
> -       if (boot_cpu_has(X86_FEATURE_AVX2)) {
> +       if (use_avx512 && IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) &&
> +           cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ)) {
> +               pr_info("AVX512 version of gcm_enc/dec engaged.\n");
> +               aesni_gcm_tfm = &aesni_gcm_tfm_avx_512;

This was changed in the cryptodev tree to use static keys.

> +       } else if (boot_cpu_has(X86_FEATURE_AVX2)) {
>                 pr_info("AVX2 version of gcm_enc/dec engaged.\n");
>                 aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen4;
>         } else if (boot_cpu_has(X86_FEATURE_AVX)) {
> diff --git a/crypto/Kconfig b/crypto/Kconfig
> index 3043849..8c8a68d 100644
> --- a/crypto/Kconfig
> +++ b/crypto/Kconfig
> @@ -661,6 +661,18 @@ config CRYPTO_AES_CTR_AVX512
>         depends on CRYPTO_AES_NI_INTEL
>         depends on AS_VAES_AVX512
>
> +# We default CRYPTO_AES_GCM_AVX512 to Y but depend on CRYPTO_AVX512 in
> +# order to have a singular option (CRYPTO_AVX512) select multiple algorithms
> +# when supported. Specifically, if the platform and/or toolset does not
> +# support VPLMULQDQ. Then this algorithm should not be supported as part of
> +# the set that CRYPTO_AVX512 selects.
> +config CRYPTO_AES_GCM_AVX512
> +       bool
> +       default y
> +       depends on CRYPTO_AVX512
> +       depends on CRYPTO_AES_NI_INTEL
> +       depends on AS_VPCLMULQDQ
> +
>  config CRYPTO_CRC32C_SPARC64
>         tristate "CRC32c CRC algorithm (SPARC64)"
>         depends on SPARC64
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms
  2021-01-16 16:52     ` Ard Biesheuvel
@ 2021-01-16 18:35       ` Dey, Megha
  0 siblings, 0 replies; 28+ messages in thread
From: Dey, Megha @ 2021-01-16 18:35 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Eric Biggers, Herbert Xu, David S. Miller,
	Linux Crypto Mailing List, Linux Kernel Mailing List,
	ravi.v.shankar, tim.c.chen, andi.kleen, Dave Hansen,
	wajdi.k.feghali, greg.b.tucker, robert.a.kasten,
	rajendrakumar.chinnaiyan, tomasz.kantecki, ryan.d.saffores,
	ilya.albrekht, kyung.min.park, Tony Luck, ira.weiny, X86 ML

Hi Ard,

On 1/16/2021 8:52 AM, Ard Biesheuvel wrote:
> On Mon, 28 Dec 2020 at 20:11, Dey, Megha <megha.dey@intel.com> wrote:
>> Hi Eric,
>>
>> On 12/21/2020 3:20 PM, Eric Biggers wrote:
>>> On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote:
>>>> Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions
>>>> (first implemented on Intel's Icelake client and Xeon CPUs).
>>>>
>>>> These algorithms take advantage of the AVX512 registers to keep the CPU
>>>> busy and increase memory bandwidth utilization. They provide substantial
>>>> (2-10x) improvements over existing crypto algorithms when update data size
>>>> is greater than 128 bytes and do not have any significant impact when used
>>>> on small amounts of data.
>>>>
>>>> However, these algorithms may also incur a frequency penalty and cause
>>>> collateral damage to other workloads running on the same core(co-scheduled
>>>> threads). These frequency drops are also known as bin drops where 1 bin
>>>> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
>>>> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
>>>> are observed on the Icelake server.
>>>>
>>> Do these new algorithms all pass the self-tests, including the fuzz tests that
>>> are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?
>> I had tested these algorithms with CRYPTO_MANAGER_DISABLE_TESTS=n and
>> tcrypt, not with
>> CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (I wasn't aware this existed, my bad).
>> I see a couple of errors after enabling it and am working on fixing those.
>>
> Hello Megha,
>
> I think the GHASH changes can be dropped (as discussed in the other
> thread), given the lack of a use case. The existing GHASH driver could
> also be removed in the future, but I don't think it needs to be part
> of this series.
Ok, I will remove the GHASH patch from the next series.
>
> Could you please rebase this onto the latest AES-NI changes that are
> in Herbert's tree? (as well as the ones I sent out today) They address
> some issues with indirect calls and excessive disabling of preemption,
> and your GCM and CTR changes are definitely going to be affected by
> this as well.
Yeah sure, will do, thanks for the headsup!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support
  2021-01-16 16:54   ` Ard Biesheuvel
@ 2021-01-20 22:38     ` Dey, Megha
  0 siblings, 0 replies; 28+ messages in thread
From: Dey, Megha @ 2021-01-20 22:38 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, Dave Hansen, wajdi.k.feghali, greg.b.tucker,
	robert.a.kasten, rajendrakumar.chinnaiyan, tomasz.kantecki,
	ryan.d.saffores, ilya.albrekht, kyung.min.park, Tony Luck,
	ira.weiny, X86 ML

Hi Ard,

On 1/16/2021 8:54 AM, Ard Biesheuvel wrote:
> On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@intel.com> wrote:
>> This is a preparatory patch to introduce the optimized crypto algorithms
>> using AVX512 instructions which would require VAES and VPLCMULQDQ support.
>>
>> Check for VAES and VPCLMULQDQ assembler support using AVX512 registers.
>>
>> Cc: x86@kernel.org
>> Signed-off-by: Megha Dey <megha.dey@intel.com>
>> ---
>>   arch/x86/Kconfig.assembler | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
>> index 26b8c08..9ea0bc8 100644
>> --- a/arch/x86/Kconfig.assembler
>> +++ b/arch/x86/Kconfig.assembler
>> @@ -1,6 +1,16 @@
>>   # SPDX-License-Identifier: GPL-2.0
>>   # Copyright (C) 2020 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
>>
>> +config AS_VAES_AVX512
>> +       def_bool $(as-instr,vaesenc %zmm0$(comma)%zmm1$(comma)%zmm1) && 64BIT
> Is the '&& 64BIT' necessary here, but not below?
>
> In any case, better to use a separate 'depends on' line, for legibility

yeah , I think the '&& 64 BIT' is not required. I will remove it in the 
next version.

-Megha

>
>> +       help
>> +         Supported by binutils >= 2.30 and LLVM integrated assembler
>> +
>> +config AS_VPCLMULQDQ
>> +       def_bool $(as-instr,vpclmulqdq \$0$(comma)%zmm2$(comma)%zmm6$(comma)%zmm4)
>> +       help
>> +         Supported by binutils >= 2.30 and LLVM integrated assembler
>> +
>>   config AS_AVX512
>>          def_bool $(as-instr,vpmovm2b %k1$(comma)%zmm5)
>>          help
>> --
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction
  2021-01-16 17:00   ` Ard Biesheuvel
@ 2021-01-20 22:46     ` Dey, Megha
  0 siblings, 0 replies; 28+ messages in thread
From: Dey, Megha @ 2021-01-20 22:46 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, Dave Hansen, wajdi.k.feghali, greg.b.tucker,
	robert.a.kasten, rajendrakumar.chinnaiyan, tomasz.kantecki,
	ryan.d.saffores, ilya.albrekht, kyung.min.park, Tony Luck,
	ira.weiny

Hi Ard,

On 1/16/2021 9:00 AM, Ard Biesheuvel wrote:
> On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@intel.com> wrote:
>> From: Kyung Min Park <kyung.min.park@intel.com>
>>
>> Update the crc_pcl function that calculates T10 Data Integrity Field
>> CRC16 (CRC T10 DIF) using VPCLMULQDQ instruction. VPCLMULQDQ instruction
>> with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
>> The advantage comes from packing multiples of 4 * 128 bit data into AVX512
>> reducing instruction latency.
>>
>> The glue code in crct10diff module overrides the existing PCLMULQDQ version
>> with the VPCLMULQDQ version when the following criteria are met:
>> At compile time:
>> 1. CONFIG_CRYPTO_AVX512 is enabled
>> 2. toolchain(assembler) supports VPCLMULQDQ instructions
>> At runtime:
>> 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
>>     only Icelake)
>> 2. If compiled as built-in module, crct10dif_pclmul.use_avx512 is set at
>>     boot time or /sys/module/crct10dif_pclmul/parameters/use_avx512 is set
>>     to 1 after boot.
>>     If compiled as loadable module, use_avx512 module parameter must be set:
>>     modprobe crct10dif_pclmul use_avx512=1
>>
>> A typical run of tcrypt with CRC T10 DIF calculation with PCLMULQDQ
>> instruction and VPCLMULQDQ instruction shows the following results:
>> For bytes per update >= 1KB, we see the average improvement of 46%(~1.4x)
>> For bytes per update < 1KB, we see the average improvement of 13%.
>> Test was performed on an Icelake based platform with constant frequency
>> set for CPU.
>>
>> Detailed results for a variety of block sizes and update sizes are in
>> the table below.
>>
>> ---------------------------------------------------------------------------
>> |            |            |         cycles/operation         |            |
>> |            |            |       (the lower the better)     |            |
>> |    byte    |   bytes    |----------------------------------| percentage |
>> |   blocks   | per update |   CRC T10 DIF  |  CRC T10 DIF    | loss/gain  |
>> |            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
>> |------------|------------|----------------|-----------------|------------|
>> |      16    |     16     |        77      |        106      |   -27.0    |
>> |      64    |     16     |       411      |        390      |     5.4    |
>> |      64    |     64     |        71      |         85      |   -16.0    |
>> |     256    |     16     |      1224      |       1308      |    -6.4    |
>> |     256    |     64     |       393      |        407      |    -3.4    |
>> |     256    |    256     |        93      |         86      |     8.1    |
>> |    1024    |     16     |      4564      |       5020      |    -9.0    |
>> |    1024    |    256     |       486      |        475      |     2.3    |
>> |    1024    |   1024     |       221      |        148      |    49.3    |
>> |    2048    |     16     |      8945      |       9851      |    -9.1    |
>> |    2048    |    256     |       982      |        951      |     3.3    |
>> |    2048    |   1024     |       500      |        369      |    35.5    |
>> |    2048    |   2048     |       413      |        265      |    55.8    |
>> |    4096    |     16     |     17885      |      19351      |    -7.5    |
>> |    4096    |    256     |      1828      |       1713      |     6.7    |
>> |    4096    |   1024     |       968      |        805      |    20.0    |
>> |    4096    |   4096     |       739      |        475      |    55.6    |
>> |    8192    |     16     |     48339      |      41556      |    16.3    |
>> |    8192    |    256     |      3494      |       3342      |     4.5    |
>> |    8192    |   1024     |      1959      |       1462      |    34.0    |
>> |    8192    |   4096     |      1561      |       1036      |    50.7    |
>> |    8192    |   8192     |      1540      |       1004      |    53.4    |
>> ---------------------------------------------------------------------------
>>
>> This work was inspired by the CRC T10 DIF AVX512 optimization published
>> in Intel Intelligent Storage Acceleration Library.
>> https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_by16_10.asm
>>
>> Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>
>> Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
>> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
>> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
>> Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
>> Signed-off-by: Megha Dey <megha.dey@intel.com>
>> ---
>>   arch/x86/crypto/Makefile                  |   1 +
>>   arch/x86/crypto/crct10dif-avx512-asm_64.S | 482 ++++++++++++++++++++++++++++++
>>   arch/x86/crypto/crct10dif-pclmul_glue.c   |  24 +-
>>   arch/x86/include/asm/disabled-features.h  |   8 +-
>>   crypto/Kconfig                            |  23 ++
>>   5 files changed, 535 insertions(+), 3 deletions(-)
>>   create mode 100644 arch/x86/crypto/crct10dif-avx512-asm_64.S
>>
> ...
>> diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c
>> index 71291d5a..26a6350 100644
>> --- a/arch/x86/crypto/crct10dif-pclmul_glue.c
>> +++ b/arch/x86/crypto/crct10dif-pclmul_glue.c
>> @@ -35,6 +35,16 @@
>>   #include <asm/simd.h>
>>
>>   asmlinkage u16 crc_t10dif_pcl(u16 init_crc, const u8 *buf, size_t len);
>> +#ifdef CONFIG_CRYPTO_CRCT10DIF_AVX512
>> +asmlinkage u16 crct10dif_pcl_avx512(u16 init_crc, const u8 *buf, size_t len);
>> +#else
>> +static u16 crct10dif_pcl_avx512(u16 init_crc, const u8 *buf, size_t len)
>> +{ return 0; }
>> +#endif
>> +
> Please drop the alternative definition. If you code the references
> correctly, the alternative is never called.
ok.
>
>> +static bool use_avx512;
>> +module_param(use_avx512, bool, 0644);
>> +MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
>>
>>   struct chksum_desc_ctx {
>>          __u16 crc;
>> @@ -56,7 +66,12 @@ static int chksum_update(struct shash_desc *desc, const u8 *data,
>>
>>          if (length >= 16 && crypto_simd_usable()) {
>>                  kernel_fpu_begin();
>> -               ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
>> +               if (IS_ENABLED(CONFIG_CRYPTO_CRCT10DIF_AVX512) &&
>> +                   cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ) &&
>> +                   use_avx512)
>> +                       ctx->crc = crct10dif_pcl_avx512(ctx->crc, data, length);
>> +               else
>> +                       ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
> Please use a static call or static key here, and initialize its value
> in the init code.
Yeah, Ill make the change.
>
>>                  kernel_fpu_end();
>>          } else
>>                  ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
>> @@ -75,7 +90,12 @@ static int __chksum_finup(__u16 crc, const u8 *data, unsigned int len, u8 *out)
>>   {
>>          if (len >= 16 && crypto_simd_usable()) {
>>                  kernel_fpu_begin();
>> -               *(__u16 *)out = crc_t10dif_pcl(crc, data, len);
>> +               if (IS_ENABLED(CONFIG_CRYPTO_CRCT10DIF_AVX512) &&
>> +                   cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ) &&
>> +                   use_avx512)
>> +                       *(__u16 *)out = crct10dif_pcl_avx512(crc, data, len);
>> +               else
>> +                       *(__u16 *)out = crc_t10dif_pcl(crc, data, len);
> Same here.

will do

-Megha

>
>>                  kernel_fpu_end();
>>          } else
>>                  *(__u16 *)out = crc_t10dif_generic(crc, data, len);
>> diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
>> index 5861d34..1192dea 100644
>> --- a/arch/x86/include/asm/disabled-features.h
>> +++ b/arch/x86/include/asm/disabled-features.h
>> @@ -56,6 +56,12 @@
>>   # define DISABLE_PTI           (1 << (X86_FEATURE_PTI & 31))
>>   #endif
>>
>> +#if defined(CONFIG_AS_VPCLMULQDQ)
>> +# define DISABLE_VPCLMULQDQ    0
>> +#else
>> +# define DISABLE_VPCLMULQDQ    (1 << (X86_FEATURE_VPCLMULQDQ & 31))
>> +#endif
>> +
>>   #ifdef CONFIG_IOMMU_SUPPORT
>>   # define DISABLE_ENQCMD        0
>>   #else
>> @@ -82,7 +88,7 @@
>>   #define DISABLED_MASK14        0
>>   #define DISABLED_MASK15        0
>>   #define DISABLED_MASK16        (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
>> -                        DISABLE_ENQCMD)
>> +                        DISABLE_ENQCMD|DISABLE_VPCLMULQDQ)
>>   #define DISABLED_MASK17        0
>>   #define DISABLED_MASK18        0
>>   #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
>> diff --git a/crypto/Kconfig b/crypto/Kconfig
>> index a367fcf..b090f14 100644
>> --- a/crypto/Kconfig
>> +++ b/crypto/Kconfig
>> @@ -613,6 +613,29 @@ config CRYPTO_CRC32C_VPMSUM
>>            (vpmsum) instructions, introduced in POWER8. Enable on POWER8
>>            and newer processors for improved performance.
>>
>> +config CRYPTO_AVX512
>> +       bool "AVX512 hardware acceleration for crypto algorithms"
>> +       depends on X86
>> +       depends on 64BIT
>> +       help
>> +         This option will compile in AVX512 hardware accelerated crypto
>> +         algorithms. These optimized algorithms provide substantial(2-10x)
>> +         improvements over existing crypto algorithms for large data size.
>> +         However, it may also incur a frequency penalty (aka. "bin drops")
>> +         and cause collateral damage to other workloads running on the
>> +         same core.
>> +
>> +# We default CRYPTO_CRCT10DIF_AVX512 to Y but depend on CRYPTO_AVX512 in
>> +# order to have a singular option (CRYPTO_AVX512) select multiple algorithms
>> +# when supported. Specifically, if the platform and/or toolset does not
>> +# support VPLMULQDQ. Then this algorithm should not be supported as part of
>> +# the set that CRYPTO_AVX512 selects.
>> +config CRYPTO_CRCT10DIF_AVX512
>> +       bool
>> +       default y
>> +       depends on CRYPTO_AVX512
>> +       depends on CRYPTO_CRCT10DIF_PCLMUL
>> +       depends on AS_VPCLMULQDQ
>>
>>   config CRYPTO_CRC32C_SPARC64
>>          tristate "CRC32c CRC algorithm (SPARC64)"
>> --
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization
  2021-01-16 17:03   ` Ard Biesheuvel
@ 2021-01-20 22:46     ` Dey, Megha
  0 siblings, 0 replies; 28+ messages in thread
From: Dey, Megha @ 2021-01-20 22:46 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, Dave Hansen, wajdi.k.feghali, greg.b.tucker,
	robert.a.kasten, rajendrakumar.chinnaiyan, tomasz.kantecki,
	ryan.d.saffores, ilya.albrekht, kyung.min.park, Tony Luck,
	ira.weiny

Hi Ard,

On 1/16/2021 9:03 AM, Ard Biesheuvel wrote:
> On Fri, 18 Dec 2020 at 22:08, Megha Dey <megha.dey@intel.com> wrote:
>> Introduce the "by16" implementation of the AES CTR mode using AVX512
>> optimizations. "by16" means that 16 independent blocks (each block
>> being 128 bits) can be ciphered simultaneously as opposed to the
>> current 8 blocks.
>>
>> The glue code in AESNI module overrides the existing "by8" CTR mode
>> encryption/decryption routines with the "by16" ones when the following
>> criteria are met:
>> At compile time:
>> 1. CONFIG_CRYPTO_AVX512 is enabled
>> 2. toolchain(assembler) supports VAES instructions
>> At runtime:
>> 1. VAES and AVX512VL features are supported on platform (currently
>>     only Icelake)
>> 2. aesni_intel.use_avx512 module parameter is set at boot time. For this
>>     algorithm, switching from AVX512 optimized version is not possible once
>>     set at boot time because of how the code is structured today.(Can be
>>     changed later if required)
>>
>> The functions aes_ctr_enc_128_avx512_by16(), aes_ctr_enc_192_avx512_by16()
>> and aes_ctr_enc_256_avx512_by16() are adapted from Intel Optimized IPSEC
>> Cryptographic library.
>>
>> On a Icelake desktop, with turbo disabled and all CPUs running at maximum
>> frequency, the "by16" CTR mode optimization shows better performance
>> across data & key sizes as measured by tcrypt.
>>
>> The average performance improvement of the "by16" version over the "by8"
>> version is as follows:
>> For all key sizes(128/192/256 bits),
>>          data sizes < 128 bytes/block, negligible improvement(~3% loss)
>>          data sizes > 128 bytes/block, there is an average improvement of
>> 48% for both encryption and decryption.
>>
>> A typical run of tcrypt with AES CTR mode encryption/decryption of the
>> "by8" and "by16" optimization on a Icelake desktop shows the following
>> results:
>>
>> --------------------------------------------------------------
>> |  key   | bytes | cycles/op (lower is better)| percentage   |
>> | length |  per  |  encryption  |  decryption |  loss/gain   |
>> | (bits) | block |-------------------------------------------|
>> |        |       | by8  | by16  | by8  | by16 |  enc | dec   |
>> |------------------------------------------------------------|
>> |  128   |  16   | 156  | 168   | 164  | 168  | -7.7 |  -2.5 |
>> |  128   |  64   | 180  | 190   | 157  | 146  | -5.6 |   7.1 |
>> |  128   |  256  | 248  | 158   | 251  | 161  | 36.3 |  35.9 |
>> |  128   |  1024 | 633  | 316   | 642  | 319  | 50.1 |  50.4 |
>> |  128   |  1472 | 853  | 411   | 877  | 407  | 51.9 |  53.6 |
>> |  128   |  8192 | 4463 | 1959  | 4447 | 1940 | 56.2 |  56.4 |
>> |  192   |  16   | 136  | 145   | 149  | 166  | -6.7 | -11.5 |
>> |  192   |  64   | 159  | 154   | 157  | 160  |  3.2 |  -2   |
>> |  192   |  256  | 268  | 172   | 274  | 177  | 35.9 |  35.5 |
>> |  192   |  1024 | 710  | 358   | 720  | 355  | 49.6 |  50.7 |
>> |  192   |  1472 | 989  | 468   | 983  | 469  | 52.7 |  52.3 |
>> |  192   |  8192 | 6326 | 3551  | 6301 | 3567 | 43.9 |  43.4 |
>> |  256   |  16   | 153  | 165   | 139  | 156  | -7.9 | -12.3 |
>> |  256   |  64   | 158  | 152   | 174  | 161  |  3.8 |   7.5 |
>> |  256   |  256  | 283  | 176   | 287  | 202  | 37.9 |  29.7 |
>> |  256   |  1024 | 797  | 393   | 807  | 395  | 50.7 |  51.1 |
>> |  256   |  1472 | 1108 | 534   | 1107 | 527  | 51.9 |  52.4 |
>> |  256   |  8192 | 5763 | 2616  | 5773 | 2617 | 54.7 |  54.7 |
>> --------------------------------------------------------------
>>
>> This work was inspired by the AES CTR mode optimization published
>> in Intel Optimized IPSEC Cryptographic library.
>> https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/cntr_vaes_avx512.asm
>>
>> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
>> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
>> Signed-off-by: Megha Dey <megha.dey@intel.com>
>> ---
>>   arch/x86/crypto/Makefile                    |   1 +
>>   arch/x86/crypto/aes_ctrby16_avx512-x86_64.S | 856 ++++++++++++++++++++++++++++
>>   arch/x86/crypto/aesni-intel_glue.c          |  57 +-
>>   arch/x86/crypto/avx512_vaes_common.S        | 422 ++++++++++++++
>>   arch/x86/include/asm/disabled-features.h    |   8 +-
>>   crypto/Kconfig                              |  12 +
>>   6 files changed, 1354 insertions(+), 2 deletions(-)
>>   create mode 100644 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S
>>
> ...
>> diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
>> index ad8a718..f45059e 100644
>> --- a/arch/x86/crypto/aesni-intel_glue.c
>> +++ b/arch/x86/crypto/aesni-intel_glue.c
>> @@ -46,6 +46,10 @@
>>   #define CRYPTO_AES_CTX_SIZE (sizeof(struct crypto_aes_ctx) + AESNI_ALIGN_EXTRA)
>>   #define XTS_AES_CTX_SIZE (sizeof(struct aesni_xts_ctx) + AESNI_ALIGN_EXTRA)
>>
>> +static bool use_avx512;
>> +module_param(use_avx512, bool, 0644);
>> +MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
>> +
>>   /* This data is stored at the end of the crypto_tfm struct.
>>    * It's a type of per "session" data storage location.
>>    * This needs to be 16 byte aligned.
>> @@ -191,6 +195,35 @@ asmlinkage void aes_ctr_enc_192_avx_by8(const u8 *in, u8 *iv,
>>                  void *keys, u8 *out, unsigned int num_bytes);
>>   asmlinkage void aes_ctr_enc_256_avx_by8(const u8 *in, u8 *iv,
>>                  void *keys, u8 *out, unsigned int num_bytes);
>> +
>> +#ifdef CONFIG_CRYPTO_AES_CTR_AVX512
>> +asmlinkage void aes_ctr_enc_128_avx512_by16(void *keys, u8 *out,
>> +                                           const u8 *in,
>> +                                           unsigned int num_bytes,
>> +                                           u8 *iv);
>> +asmlinkage void aes_ctr_enc_192_avx512_by16(void *keys, u8 *out,
>> +                                           const u8 *in,
>> +                                           unsigned int num_bytes,
>> +                                           u8 *iv);
>> +asmlinkage void aes_ctr_enc_256_avx512_by16(void *keys, u8 *out,
>> +                                           const u8 *in,
>> +                                           unsigned int num_bytes,
>> +                                           u8 *iv);
>> +#else
>> +static inline void aes_ctr_enc_128_avx512_by16(void *keys, u8 *out,
>> +                                              const u8 *in,
>> +                                              unsigned int num_bytes,
>> +                                              u8 *iv) {}
>> +static inline void aes_ctr_enc_192_avx512_by16(void *keys, u8 *out,
>> +                                              const u8 *in,
>> +                                              unsigned int num_bytes,
>> +                                              u8 *iv) {}
>> +static inline void aes_ctr_enc_256_avx512_by16(void *keys, u8 *out,
>> +                                              const u8 *in,
>> +                                              unsigned int num_bytes,
>> +                                              u8 *iv) {}
>> +#endif
>> +
> Please drop these alternatives.
ok
>
>>   /*
>>    * asmlinkage void aesni_gcm_init_avx_gen2()
>>    * gcm_data *my_ctx_data, context data
>> @@ -487,6 +520,23 @@ static void aesni_ctr_enc_avx_tfm(struct crypto_aes_ctx *ctx, u8 *out,
>>                  aes_ctr_enc_256_avx_by8(in, iv, (void *)ctx, out, len);
>>   }
>>
>> +static void aesni_ctr_enc_avx512_tfm(struct crypto_aes_ctx *ctx, u8 *out,
>> +                                    const u8 *in, unsigned int len, u8 *iv)
>> +{
>> +       /*
>> +        * based on key length, override with the by16 version
>> +        * of ctr mode encryption/decryption for improved performance.
>> +        * aes_set_key_common() ensures that key length is one of
>> +        * {128,192,256}
>> +        */
>> +       if (ctx->key_length == AES_KEYSIZE_128)
>> +               aes_ctr_enc_128_avx512_by16((void *)ctx, out, in, len, iv);
>> +       else if (ctx->key_length == AES_KEYSIZE_192)
>> +               aes_ctr_enc_192_avx512_by16((void *)ctx, out, in, len, iv);
>> +       else
>> +               aes_ctr_enc_256_avx512_by16((void *)ctx, out, in, len, iv);
>> +}
>> +
>>   static int ctr_crypt(struct skcipher_request *req)
>>   {
>>          struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
>> @@ -1076,7 +1126,12 @@ static int __init aesni_init(void)
>>                  aesni_gcm_tfm = &aesni_gcm_tfm_sse;
>>          }
>>          aesni_ctr_enc_tfm = aesni_ctr_enc;
>> -       if (boot_cpu_has(X86_FEATURE_AVX)) {
>> +       if (use_avx512 && IS_ENABLED(CONFIG_CRYPTO_AES_CTR_AVX512) &&
>> +           cpu_feature_enabled(X86_FEATURE_VAES)) {
>> +               /* Ctr mode performance optimization using AVX512 */
>> +               aesni_ctr_enc_tfm = aesni_ctr_enc_avx512_tfm;
>> +               pr_info("AES CTR mode by16 optimization enabled\n");
> This will need to be changed to a static_call_update() once my
> outstanding patch is merged.
yeah will do!
>
>> +       } else if (boot_cpu_has(X86_FEATURE_AVX)) {
>>                  /* optimize performance of ctr mode encryption transform */
>>                  aesni_ctr_enc_tfm = aesni_ctr_enc_avx_tfm;
>>                  pr_info("AES CTR mode by8 optimization enabled\n");

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ
  2021-01-16 17:16   ` Ard Biesheuvel
@ 2021-01-20 22:48     ` Dey, Megha
  0 siblings, 0 replies; 28+ messages in thread
From: Dey, Megha @ 2021-01-20 22:48 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, David S. Miller, Linux Crypto Mailing List,
	Linux Kernel Mailing List, ravi.v.shankar, tim.c.chen,
	andi.kleen, Dave Hansen, wajdi.k.feghali, greg.b.tucker,
	robert.a.kasten, rajendrakumar.chinnaiyan, tomasz.kantecki,
	ryan.d.saffores, ilya.albrekht, kyung.min.park, Tony Luck,
	ira.weiny

Hi Ard,

On 1/16/2021 9:16 AM, Ard Biesheuvel wrote:
> On Fri, 18 Dec 2020 at 22:08, Megha Dey <megha.dey@intel.com> wrote:
>> Introduce the AVX512 implementation that optimizes the AESNI-GCM encode
>> and decode routines using VPCLMULQDQ.
>>
>> The glue code in AESNI module overrides the existing AVX2 GCM mode
>> encryption/decryption routines with the AX512 AES GCM mode ones when the
>> following criteria are met:
>> At compile time:
>> 1. CONFIG_CRYPTO_AVX512 is enabled
>> 2. toolchain(assembler) supports VPCLMULQDQ instructions
>> At runtime:
>> 1. VPCLMULQDQ and AVX512VL features are supported on a platform
>>     (currently only Icelake)
>> 2. aesni_intel.use_avx512 module parameter is set at boot time. For this
>>     algorithm, switching from AVX512 optimized version is not possible
>>     once set at boot time because of how the code is structured today.(Can
>>     be changed later if required)
>>
>> The functions aesni_gcm_init_avx_512, aesni_gcm_enc_update_avx_512,
>> aesni_gcm_dec_update_avx_512 and aesni_gcm_finalize_avx_512 are adapted
>> from the Intel Optimized IPSEC Cryptographic library.
>>
>> On a Icelake desktop, with turbo disabled and all CPUs running at
>> maximum frequency, the AVX512 GCM mode optimization shows better
>> performance across data & key sizes as measured by tcrypt.
>>
>> The average performance improvement of the AVX512 version over the AVX2
>> version is as follows:
>> For all key sizes(128/192/256 bits),
>>          data sizes < 128 bytes/block, negligible improvement (~7.5%)
>>          data sizes > 128 bytes/block, there is an average improvement of
>>          40% for both encryption and decryption.
>>
>> A typical run of tcrypt with AES GCM mode encryption/decryption of the
>> AVX2 and AVX512 optimization on a Icelake desktop shows the following
>> results:
>>
>>    ----------------------------------------------------------------------
>>    |   key  | bytes | cycles/op (lower is better)   | Percentage gain/  |
>>    | length |   per |   encryption  |  decryption   |      loss         |
>>    | (bits) | block |-------------------------------|-------------------|
>>    |        |       | avx2 | avx512 | avx2 | avx512 | Encrypt | Decrypt |
>>    |---------------------------------------------------------------------
>>    |  128   | 16    | 689  |  701   | 689  |  707   |  -1.7   |  -2.61  |
>>    |  128   | 64    | 731  |  660   | 771  |  649   |   9.7   |  15.82  |
>>    |  128   | 256   | 911  |  750   | 900  |  721   |  17.67  |  19.88  |
>>    |  128   | 512   | 1181 |  814   | 1161 |  782   |  31.07  |  32.64  |
>>    |  128   | 1024  | 1676 |  1052  | 1685 |  1030  |  37.23  |  38.87  |
>>    |  128   | 2048  | 2475 |  1447  | 2456 |  1419  |  41.53  |  42.22  |
>>    |  128   | 4096  | 3806 |  2154  | 3820 |  2119  |  43.41  |  44.53  |
>>    |  128   | 8192  | 9169 |  3806  | 6997 |  3718  |  58.49  |  46.86  |
>>    |  192   | 16    | 754  |  683   | 737  |  672   |   9.42  |   8.82  |
>>    |  192   | 64    | 735  |  686   | 715  |  640   |   6.66  |  10.49  |
>>    |  192   | 256   | 949  |  738   | 2435 |  729   |  22.23  |  70     |
>>    |  192   | 512   | 1235 |  854   | 1200 |  833   |  30.85  |  30.58  |
>>    |  192   | 1024  | 1777 |  1084  | 1763 |  1051  |  38.99  |  40.39  |
>>    |  192   | 2048  | 2574 |  1497  | 2592 |  1459  |  41.84  |  43.71  |
>>    |  192   | 4096  | 4086 |  2317  | 4091 |  2244  |  43.29  |  45.14  |
>>    |  192   | 8192  | 7481 |  4054  | 7505 |  3953  |  45.81  |  47.32  |
>>    |  256   | 16    | 755  |  682   | 720  |  683   |   9.68  |   5.14  |
>>    |  256   | 64    | 744  |  677   | 719  |  658   |   9     |   8.48  |
>>    |  256   | 256   | 962  |  758   | 948  |  749   |  21.21  |  21     |
>>    |  256   | 512   | 1297 |  862   | 1276 |  836   |  33.54  |  34.48  |
>>    |  256   | 1024  | 1831 |  1114  | 1819 |  1095  |  39.16  |  39.8   |
>>    |  256   | 2048  | 2767 |  1566  | 2715 |  1524  |  43.4   |  43.87  |
>>    |  256   | 4096  | 4378 |  2382  | 4368 |  2354  |  45.6   |  46.11  |
>>    |  256   | 8192  | 8075 |  4262  | 8080 |  4186  |  47.22  |  48.19  |
>>    ----------------------------------------------------------------------
>>
>> This work was inspired by the AES GCM mode optimization published in
>> Intel Optimized IPSEC Cryptographic library.
>> https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/gcm_vaes_avx512.asm
>>
>> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
>> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
>> Signed-off-by: Megha Dey <megha.dey@intel.com>
>> ---
>>   arch/x86/crypto/Makefile                    |    1 +
>>   arch/x86/crypto/aesni-intel_avx512-x86_64.S | 1788 +++++++++++++++++++++++++++
>>   arch/x86/crypto/aesni-intel_glue.c          |   62 +-
>>   crypto/Kconfig                              |   12 +
>>   4 files changed, 1858 insertions(+), 5 deletions(-)
>>   create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S
>>
> ...
>> diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
>> index 9e56cdf..8fc5bac 100644
>> --- a/arch/x86/crypto/aesni-intel_glue.c
>> +++ b/arch/x86/crypto/aesni-intel_glue.c
>> @@ -55,13 +55,16 @@ MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");
>>    * This needs to be 16 byte aligned.
>>    */
>>   struct aesni_rfc4106_gcm_ctx {
>> -       u8 hash_subkey[16] AESNI_ALIGN_ATTR;
>> +       /* AVX512 optimized algorithms use 48 hash keys to conduct
>> +        * multiple PCLMULQDQ operations in parallel
>> +        */
>> +       u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;
>>          struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;
>>          u8 nonce[4];
>>   };
>>
>>   struct generic_gcmaes_ctx {
>> -       u8 hash_subkey[16] AESNI_ALIGN_ATTR;
>> +       u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;
>>          struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;
>>   };
>>
>> @@ -82,7 +85,7 @@ struct gcm_context_data {
>>          u8 current_counter[GCM_BLOCK_LEN];
>>          u64 partial_block_len;
>>          u64 unused;
>> -       u8 hash_keys[GCM_BLOCK_LEN * 16];
>> +       u8 hash_keys[48 * 16];
>>   };
>>
> This structure gets allocated on the stack, and gets inflated
> significantly by this change, even though the code is not enabled by
> default, and not even supported for most users.
Hmm yeah, this makes sense. I will look into it.
>
> Is it really necessary for this to be per-request data? If these are
> precomputed powers of H, they can be moved into the TFM context
> structure instead, which lives on the heap (and can be shared by all
> concurrent users of the TFM)
Yeah, this is a per request data.
>
>>   asmlinkage int aesni_set_key(struct crypto_aes_ctx *ctx, const u8 *in_key,
>> @@ -266,6 +269,47 @@ static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_gen2 = {
>>          .finalize = &aesni_gcm_finalize_avx_gen2,
>>   };
>>
>> +#ifdef CONFIG_CRYPTO_AES_GCM_AVX512
>> +/*
>> + * asmlinkage void aesni_gcm_init_avx_512()
>> + * gcm_data *my_ctx_data, context data
>> + * u8 *hash_subkey,  the Hash sub key input. Data starts on a 16-byte boundary.
>> + */
>> +asmlinkage void aesni_gcm_init_avx_512(void *my_ctx_data,
>> +                                      struct gcm_context_data *gdata,
>> +                                      u8 *iv,
>> +                                      u8 *hash_subkey,
>> +                                      const u8 *aad,
>> +                                      unsigned long aad_len);
>> +asmlinkage void aesni_gcm_enc_update_avx_512(void *ctx,
>> +                                            struct gcm_context_data *gdata,
>> +                                            u8 *out,
>> +                                            const u8 *in,
>> +                                            unsigned long plaintext_len);
>> +asmlinkage void aesni_gcm_dec_update_avx_512(void *ctx,
>> +                                            struct gcm_context_data *gdata,
>> +                                            u8 *out,
>> +                                            const u8 *in,
>> +                                            unsigned long ciphertext_len);
>> +asmlinkage void aesni_gcm_finalize_avx_512(void *ctx,
>> +                                          struct gcm_context_data *gdata,
>> +                                          u8 *auth_tag,
>> +                                          unsigned long auth_tag_len);
>> +
>> +asmlinkage void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey);
>> +
>> +static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {
>> +       .init = &aesni_gcm_init_avx_512,
>> +       .enc_update = &aesni_gcm_enc_update_avx_512,
>> +       .dec_update = &aesni_gcm_dec_update_avx_512,
>> +       .finalize = &aesni_gcm_finalize_avx_512,
>> +};
>> +#else
>> +static void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey)
>> +{}
>> +static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {};
>> +#endif
>> +
> Please drop the alternative dummy definitions.
ok
>
>>   /*
>>    * asmlinkage void aesni_gcm_init_avx_gen4()
>>    * gcm_data *my_ctx_data, context data
>> @@ -669,7 +713,11 @@ rfc4106_set_hash_subkey(u8 *hash_subkey, const u8 *key, unsigned int key_len)
>>          /* We want to cipher all zeros to create the hash sub key. */
>>          memset(hash_subkey, 0, RFC4106_HASH_SUBKEY_SIZE);
>>
>> -       aes_encrypt(&ctx, hash_subkey, hash_subkey);
>> +       if (IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) && use_avx512 &&
>> +           cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ))
>> +               aes_gcm_precomp_avx_512(&ctx, hash_subkey);
>> +       else
>> +               aes_encrypt(&ctx, hash_subkey, hash_subkey);
>>
> I suppose this answers my question about the subkeys. Please find a
> way to move these out of struct gcm_context_data so they don't need to
> be copied to the stack for each request.
Hmm yeah. I will move this allocation to the heap instead.
>
>
>>          memzero_explicit(&ctx, sizeof(ctx));
>>          return 0;
>> @@ -1114,7 +1162,11 @@ static int __init aesni_init(void)
>>          if (!x86_match_cpu(aesni_cpu_id))
>>                  return -ENODEV;
>>   #ifdef CONFIG_X86_64
>> -       if (boot_cpu_has(X86_FEATURE_AVX2)) {
>> +       if (use_avx512 && IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) &&
>> +           cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ)) {
>> +               pr_info("AVX512 version of gcm_enc/dec engaged.\n");
>> +               aesni_gcm_tfm = &aesni_gcm_tfm_avx_512;
> This was changed in the cryptodev tree to use static keys.

yep, will make the necessary changes.


-Megha

>
>> +       } else if (boot_cpu_has(X86_FEATURE_AVX2)) {
>>                  pr_info("AVX2 version of gcm_enc/dec engaged.\n");
>>                  aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen4;
>>          } else if (boot_cpu_has(X86_FEATURE_AVX)) {
>> diff --git a/crypto/Kconfig b/crypto/Kconfig
>> index 3043849..8c8a68d 100644
>> --- a/crypto/Kconfig
>> +++ b/crypto/Kconfig
>> @@ -661,6 +661,18 @@ config CRYPTO_AES_CTR_AVX512
>>          depends on CRYPTO_AES_NI_INTEL
>>          depends on AS_VAES_AVX512
>>
>> +# We default CRYPTO_AES_GCM_AVX512 to Y but depend on CRYPTO_AVX512 in
>> +# order to have a singular option (CRYPTO_AVX512) select multiple algorithms
>> +# when supported. Specifically, if the platform and/or toolset does not
>> +# support VPLMULQDQ. Then this algorithm should not be supported as part of
>> +# the set that CRYPTO_AVX512 selects.
>> +config CRYPTO_AES_GCM_AVX512
>> +       bool
>> +       default y
>> +       depends on CRYPTO_AVX512
>> +       depends on CRYPTO_AES_NI_INTEL
>> +       depends on AS_VPCLMULQDQ
>> +
>>   config CRYPTO_CRC32C_SPARC64
>>          tristate "CRC32c CRC algorithm (SPARC64)"
>>          depends on SPARC64
>> --
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-01-21  6:41 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-18 21:10 [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Megha Dey
2020-12-18 21:10 ` [RFC V1 1/7] x86: Probe assembler capabilities for VAES and VPLCMULQDQ support Megha Dey
2021-01-16 16:54   ` Ard Biesheuvel
2021-01-20 22:38     ` Dey, Megha
2020-12-18 21:10 ` [RFC V1 2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction Megha Dey
2021-01-16 17:00   ` Ard Biesheuvel
2021-01-20 22:46     ` Dey, Megha
2020-12-18 21:11 ` [RFC V1 3/7] crypto: ghash - Optimized GHASH computations Megha Dey
2020-12-19 17:03   ` Ard Biesheuvel
2021-01-16  0:14     ` Dey, Megha
2021-01-16  0:20       ` Dave Hansen
2021-01-16  2:04         ` Eric Biggers
2021-01-16  5:13           ` Dave Hansen
2021-01-16 16:48             ` Ard Biesheuvel
2021-01-16  1:43       ` Eric Biggers
2021-01-16  5:07         ` Dey, Megha
2020-12-18 21:11 ` [RFC V1 4/7] crypto: tcrypt - Add speed test for optimized " Megha Dey
2020-12-18 21:11 ` [RFC V1 5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization Megha Dey
2021-01-16 17:03   ` Ard Biesheuvel
2021-01-20 22:46     ` Dey, Megha
2020-12-18 21:11 ` [RFC V1 6/7] crypto: aesni - fix coding style for if/else block Megha Dey
2020-12-18 21:11 ` [RFC V1 7/7] crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ Megha Dey
2021-01-16 17:16   ` Ard Biesheuvel
2021-01-20 22:48     ` Dey, Megha
2020-12-21 23:20 ` [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Eric Biggers
2020-12-28 19:10   ` Dey, Megha
2021-01-16 16:52     ` Ard Biesheuvel
2021-01-16 18:35       ` Dey, Megha

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.