All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
@ 2024-03-26  8:02 Eric Biggers
  2024-03-26  8:02 ` [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Eric Biggers @ 2024-03-26  8:02 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

This patchset adds new AES-XTS implementations that accelerate disk and
file encryption on modern x86_64 CPUs.

The largest improvements are seen on CPUs that support the VAES
extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
later.  However, an implementation using plain AESNI + AVX is also added
and provides a small boost on older CPUs too.

To try to handle the mess that is x86 SIMD, the code for all the new
AES-XTS implementations is generated from an assembly macro.  This makes
it so that we e.g. don't have to have entirely different source code
just for different vector lengths (xmm, ymm, zmm).

To avoid downclocking effects, zmm registers aren't used on certain
Intel CPU models such as Ice Lake.  These CPU models default to an
implementation using ymm registers instead.

This patchset increases the throughput of AES-256-XTS decryption by the
following amounts on the following CPUs:
                            
                          | 4096-byte messages | 512-byte messages |
    ----------------------+--------------------+-------------------+
    Intel Skylake         |        1%          |       11%         |
    Intel Ice Lake        |        92%         |       59%         |
    Intel Sapphire Rapids |       115%         |       78%         |
    AMD Zen 1             |        25%         |       20%         |
    AMD Zen 2             |        26%         |       20%         |
    AMD Zen 3             |        82%         |       40%         |
    AMD Zen 4             |       118%         |       48%         |

(The results for encryption are very similar to decryption.  I just tend
to measure decryption because decryption performance is more important.)

There's no separate kconfig option for the new AES-XTS implementations,
as they are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.

To make testing easier, all four new AES-XTS implementations are
registered separately with the crypto API.  They are prioritized
appropriately so that the best one for the CPU is used by default.

Open questions:

- Is the policy that I implemented for preferring ymm registers to zmm
  registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
  only Skylake has the bad downclocking.  My current proposal is a bit
  more conservative; it also excludes Ice Lake and Tiger Lake.  Those
  CPUs supposedly still have some downclocking, though not as much.

- Should the policy on the use of zmm registers be in a centralized
  place?  It probably doesn't make sense to have random different
  policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).

- Are there any other known issues with using AVX512 in kernel mode?  It
  seems to work, and technically it's not new because Poly1305 and ARIA
  already use AVX512, including the mask registers and zmm registers up
  to 31.  So if there was a major issue, like the new registers not
  being properly saved and restored, it probably would have already been
  found.  But AES-XTS support would introduce a wider use of it.

Eric Biggers (6):
  x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
  crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
  crypto: x86/aes-xts - wire up AESNI + AVX implementation
  crypto: x86/aes-xts - wire up VAES + AVX2 implementation
  crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
  crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation

 arch/x86/Kconfig.assembler           |  10 +
 arch/x86/crypto/Makefile             |   3 +-
 arch/x86/crypto/aes-xts-avx-x86_64.S | 796 +++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 263 ++++++++-
 4 files changed, 1070 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/crypto/aes-xts-avx-x86_64.S


base-commit: 4cece764965020c22cff7665b18a012006359095
-- 
2.44.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
  2024-03-26  8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
@ 2024-03-26  8:02 ` Eric Biggers
  2024-03-26  8:10   ` Ingo Molnar
  2024-03-26  8:03 ` [PATCH 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs Eric Biggers
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Eric Biggers @ 2024-03-26  8:02 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add config symbols AS_VAES and AS_VPCLMULQDQ that expose whether the
assembler supports the vector AES and carryless multiplication
cryptographic extensions.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/Kconfig.assembler | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index 8ad41da301e5..59aedf32c4ea 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -23,9 +23,19 @@ config AS_TPAUSE
 config AS_GFNI
 	def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2)
 	help
 	  Supported by binutils >= 2.30 and LLVM integrated assembler
 
+config AS_VAES
+	def_bool $(as-instr,vaesenc %ymm0$(comma)%ymm1$(comma)%ymm2)
+	help
+	  Supported by binutils >= 2.30 and LLVM integrated assembler
+
+config AS_VPCLMULQDQ
+	def_bool $(as-instr,vpclmulqdq \$0x10$(comma)%ymm0$(comma)%ymm1$(comma)%ymm2)
+	help
+	  Supported by binutils >= 2.30 and LLVM integrated assembler
+
 config AS_WRUSS
 	def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
 	help
 	  Supported by binutils >= 2.31 and LLVM integrated assembler
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
  2024-03-26  8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
  2024-03-26  8:02 ` [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
@ 2024-03-26  8:03 ` Eric Biggers
  2024-03-26  8:03 ` [PATCH 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation Eric Biggers
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Eric Biggers @ 2024-03-26  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add an assembly file aes-xts-avx-x86_64.S which contains a macro that
expands into AES-XTS implementations for x86_64 CPUs that support at
least AES-NI and AVX, optionally also taking advantage of VAES,
VPCLMULQDQ, and AVX512 or AVX10.

This patch doesn't expand the macro at all.  Later patches will do so,
adding each implementation individually so that the motivation and use
case for each individual implementation can be fully presented.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/Makefile             |   3 +-
 arch/x86/crypto/aes-xts-avx-x86_64.S | 758 +++++++++++++++++++++++++++
 2 files changed, 760 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/crypto/aes-xts-avx-x86_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 9aa46093c91b..9c5ce5613738 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -46,11 +46,12 @@ obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha-x86_64.o
 chacha-x86_64-y := chacha-avx2-x86_64.o chacha-ssse3-x86_64.o chacha_glue.o
 chacha-x86_64-$(CONFIG_AS_AVX512) += chacha-avx512vl-x86_64.o
 
 obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
-aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o
+aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o \
+	aes_ctrby8_avx-x86_64.o aes-xts-avx-x86_64.o
 
 obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
 sha1-ssse3-y := sha1_avx2_x86_64_asm.o sha1_ssse3_asm.o sha1_ssse3_glue.o
 sha1-ssse3-$(CONFIG_AS_SHA1_NI) += sha1_ni_asm.o
 
diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
new file mode 100644
index 000000000000..92f1580e1eb0
--- /dev/null
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -0,0 +1,758 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * AES-XTS for modern x86_64 CPUs
+ *
+ * Copyright 2024 Google LLC
+ *
+ * Author: Eric Biggers <ebiggers@google.com>
+ */
+
+/*
+ * This file implements AES-XTS for modern x86_64 CPUs.  To handle the
+ * complexities of coding for x86 SIMD, e.g. where every vector length needs
+ * different code, it uses a macro to generate several implementations that
+ * share similar source code but are targeted at different CPUs, listed below:
+ *
+ * AES-NI + AVX
+ *    - 128-bit vectors (1 AES block per vector)
+ *    - VEX-coded instructions
+ *    - xmm0-xmm15
+ *    - This is for older CPUs that lack VAES but do have AVX.
+ *
+ * VAES + VPCLMULQDQ + AVX2
+ *    - 256-bit vectors (2 AES blocks per vector)
+ *    - VEX-coded instructions
+ *    - ymm0-ymm15
+ *    - This is for CPUs that have VAES but lack AVX512 or AVX10,
+ *      e.g. Intel's Alder Lake and AMD's Zen 3.
+ *
+ * VAES + VPCLMULQDQ + AVX10/256 + BMI2
+ *    - 256-bit vectors (2 AES blocks per vector)
+ *    - EVEX-coded instructions
+ *    - ymm0-ymm31
+ *    - This is for CPUs that have AVX512 but where using zmm registers causes
+ *      downclocking, and for CPUs that have AVX10/256 but not AVX10/512.
+ *    - By "AVX10/256" we really mean (AVX512BW + AVX512VL) || AVX10/256.
+ *      To avoid confusion with 512-bit, we just write AVX10/256.
+ *
+ * VAES + VPCLMULQDQ + AVX10/512 + BMI2
+ *    - Same as the previous one, but upgrades to 512-bit vectors
+ *      (4 AES blocks per vector) in zmm0-zmm31.
+ *    - This is for CPUs that have good AVX512 or AVX10/512 support.
+ *
+ * This file doesn't have an implementation for AES-NI alone (without AVX), as
+ * the lack of VEX would make all the assembly code different.
+ *
+ * When we use VAES, we also use VPCLMULQDQ to parallelize the computation of
+ * the XTS tweaks.  This avoids a bottleneck.  Currently there don't seem to be
+ * any CPUs that support VAES but not VPCLMULQDQ.  If that changes, we might
+ * need to start also providing an implementation using VAES alone.
+ *
+ * The AES-XTS implementations in this file support everything required by the
+ * crypto API, including support for arbitrary input lengths and multi-part
+ * processing.  However, they are most heavily optimized for the common case of
+ * power-of-2 length inputs that are processed in a single part (disk sectors).
+ */
+
+#include <linux/linkage.h>
+#include <linux/cfi_types.h>
+
+.section .rodata
+.p2align 4
+.Lgf_poly:
+	// The low 64 bits of this value represent the polynomial x^7 + x^2 + x
+	// + 1.  It is the value that must be XOR'd into the low 64 bits of the
+	// tweak each time a 1 is carried out of the high 64 bits.
+	//
+	// The high 64 bits of this value is just the internal carry bit that
+	// exists when there's a carry out of the low 64 bits of the tweak.
+	.quad	0x87, 1
+
+	// This table contains constants for vpshufb and vpblendvb, used to
+	// handle variable byte shifts and blending during ciphertext stealing
+	// on CPUs that don't support AVX10-style masking.
+.Lcts_permute_table:
+	.byte	0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80
+	.byte	0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80
+	.byte	0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
+	.byte	0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
+	.byte	0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80
+	.byte	0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80
+.text
+
+// Function parameters
+.set	KEY,		%rdi	// Initially points to aesni_xts_ctx, then is
+				// advanced to point directly to the round keys
+.set	SRC,		%rsi	// Pointer to next source data
+.set	DST,		%rdx	// Pointer to next destination data
+.set	LEN,		%rcx	// Remaining length in bytes
+.set	IV,		%r8	// Pointer to IV
+.set	FLAGS,		%r9d	// XTS_* flags
+
+// Flags for the 'int flags' parameter.  Keep in sync with C file.
+#define XTS_FIRST	0x1
+#define XTS_UPDATE_IV	0x2
+
+// r10d holds the AES key length in bytes.
+.set	KEYLEN,		%r10d
+
+// %rax and %r11 are available as temporaries.
+
+// Move a vector between memory and a register.
+.macro	_vmovdqu	src, dst
+.if VL < 64
+	vmovdqu		\src, \dst
+.else
+	vmovdqu8	\src, \dst
+.endif
+.endm
+
+// Broadcast a 128-bit value into a vector.
+.macro	_vbroadcast128	src, dst
+.if VL == 16 && !USE_AVX10
+	vmovdqu		\src, \dst
+.elseif VL == 32 && !USE_AVX10
+	vbroadcasti128	\src, \dst
+.else
+	vbroadcasti32x4	\src, \dst
+.endif
+.endm
+
+// XOR two vectors together.
+.macro	_vpxor	src1, src2, dst
+.if USE_AVX10
+	vpxord		\src1, \src2, \dst
+.else
+	vpxor		\src1, \src2, \dst
+.endif
+.endm
+
+// XOR three vectors together.
+.macro	_xor3	src1, src2, src3_and_dst
+.if USE_AVX10
+	// vpternlogd with immediate 0x96 is a three-argument XOR.
+	vpternlogd	$0x96, \src1, \src2, \src3_and_dst
+.else
+	vpxor		\src1, \src3_and_dst, \src3_and_dst
+	vpxor		\src2, \src3_and_dst, \src3_and_dst
+.endif
+.endm
+
+.macro	_define_Vi	i
+.if VL == 16
+	.set	V\i,		%xmm\i
+.elseif VL == 32
+	.set	V\i,		%ymm\i
+.elseif VL == 64
+	.set	V\i,		%zmm\i
+.else
+	.error "Unsupported Vector Length (VL)"
+.endif
+.endm
+
+.macro _define_aliases
+	// Define register aliases V0-V15, or V0-V31 if all 32 SIMD registers
+	// are available, that map to the xmm, ymm, or zmm registers according
+	// to the selected Vector Length (VL).
+	_define_Vi	0
+	_define_Vi	1
+	_define_Vi	2
+	_define_Vi	3
+	_define_Vi	4
+	_define_Vi	5
+	_define_Vi	6
+	_define_Vi	7
+	_define_Vi	8
+	_define_Vi	9
+	_define_Vi	10
+	_define_Vi	11
+	_define_Vi	12
+	_define_Vi	13
+	_define_Vi	14
+	_define_Vi	15
+.if USE_AVX10
+	_define_Vi	16
+	_define_Vi	17
+	_define_Vi	18
+	_define_Vi	19
+	_define_Vi	20
+	_define_Vi	21
+	_define_Vi	22
+	_define_Vi	23
+	_define_Vi	24
+	_define_Vi	25
+	_define_Vi	26
+	_define_Vi	27
+	_define_Vi	28
+	_define_Vi	29
+	_define_Vi	30
+	_define_Vi	31
+.endif
+
+	// V0-V7 hold temporary values.
+
+	// V8-V11 hold XTS tweaks.  Each 128-bit lane holds one tweak.
+	.set	TWEAK0_XMM,	%xmm8
+	.set	TWEAK0,		V8
+	.set	TWEAK1_XMM,	%xmm9
+	.set	TWEAK1,		V9
+	.set	TWEAK2,		V10
+	.set	TWEAK3,		V11
+
+	// V12-V14 hold the first 3 AES round keys, copied to all 128-bit lanes.
+	.set	KEY0_XMM,	%xmm12
+	.set	KEY0,		V12
+	.set	KEY1_XMM,	%xmm13
+	.set	KEY1,		V13
+	.set	KEY2_XMM,	%xmm14
+	.set	KEY2,		V14
+
+	// V15 holds the constant from .Lgf_poly, copied to all 128-bit lanes.
+	.set	GF_POLY_XMM,	%xmm15
+	.set	GF_POLY,	V15
+
+	// If 32 SIMD registers are available, then V16-V27 hold the remaining
+	// AES round keys, copied to all 128-bit lanes.
+.if USE_AVX10
+	.set	KEY3_XMM,	%xmm16
+	.set	KEY3,		V16
+	.set	KEY4_XMM,	%xmm17
+	.set	KEY4,		V17
+	.set	KEY5_XMM,	%xmm18
+	.set	KEY5,		V18
+	.set	KEY6_XMM,	%xmm19
+	.set	KEY6,		V19
+	.set	KEY7_XMM,	%xmm20
+	.set	KEY7,		V20
+	.set	KEY8_XMM,	%xmm21
+	.set	KEY8,		V21
+	.set	KEY9_XMM,	%xmm22
+	.set	KEY9,		V22
+	.set	KEY10_XMM,	%xmm23
+	.set	KEY10,		V23
+	.set	KEY11_XMM,	%xmm24
+	.set	KEY11,		V24
+	.set	KEY12_XMM,	%xmm25
+	.set	KEY12,		V25
+	.set	KEY13_XMM,	%xmm26
+	.set	KEY13,		V26
+	.set	KEY14_XMM,	%xmm27
+	.set	KEY14,		V27
+.endif
+	// V28-V31 are currently unused.
+.endm
+
+// Do a single round of AES encryption (if \enc==1) or decryption (if \enc==0)
+// on the block(s) in \data using the round key(s) in \key.  The register length
+// determines the number of AES blocks en/decrypted.
+.macro	_vaes	enc, last, key, data
+.if \enc
+.if \last
+	vaesenclast	\key, \data, \data
+.else
+	vaesenc		\key, \data, \data
+.endif
+.else
+.if \last
+	vaesdeclast	\key, \data, \data
+.else
+	vaesdec		\key, \data, \data
+.endif
+.endif
+.endm
+
+// Do a single round of AES en/decryption on the block(s) in \data, using the
+// same key for all block(s).  The round key is loaded from the appropriate
+// register or memory location for round \i.  May clobber V4.
+.macro _vaes_1x		enc, last, i, xmm_suffix, data
+.if \i < NR_CACHED_ROUND_KEYS
+	_vaes		\enc, \last, KEY\i\xmm_suffix, \data
+.else
+.ifnb \xmm_suffix
+	_vaes		\enc, \last, \i*16(KEY), \data
+.else
+	_vbroadcast128	\i*16(KEY), V4
+	_vaes		\enc, \last, V4, \data
+.endif
+.endif
+.endm
+
+// Do a single round of AES en/decryption on the blocks in registers V0-V3,
+// using the same key for all blocks.  The round key is loaded from the
+// appropriate register or memory location for round \i.  May clobber V4.
+.macro	_vaes_4x	enc, last, i
+.if \i < NR_CACHED_ROUND_KEYS
+	_vaes		\enc, \last, KEY\i, V0
+	_vaes		\enc, \last, KEY\i, V1
+	_vaes		\enc, \last, KEY\i, V2
+	_vaes		\enc, \last, KEY\i, V3
+.else
+	_vbroadcast128	\i*16(KEY), V4
+	_vaes		\enc, \last, V4, V0
+	_vaes		\enc, \last, V4, V1
+	_vaes		\enc, \last, V4, V2
+	_vaes		\enc, \last, V4, V3
+.endif
+.endm
+
+// Do tweaked AES en/decryption (i.e., XOR with \tweak, then AES en/decrypt,
+// then XOR with \tweak again) of the block(s) in \data.  To process a single
+// block, use xmm registers and set \xmm_suffix=_XMM.  To process a vector of
+// length VL, use V* registers and leave \xmm_suffix empty.  May clobber V4.
+.macro	_aes_crypt	enc, xmm_suffix, tweak, data
+	_xor3		KEY0\xmm_suffix, \tweak, \data
+	_vaes_1x	\enc, 0, 1, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 2, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 3, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 4, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 5, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 6, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 7, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 8, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 9, \xmm_suffix, \data
+	cmp		$24, KEYLEN
+	jle		.Laes_128_or_192\@
+	_vaes_1x	\enc, 0, 10, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 11, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 12, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 13, \xmm_suffix, \data
+	_vaes_1x	\enc, 1, 14, \xmm_suffix, \data
+	jmp		.Laes_done\@
+.Laes_128_or_192\@:
+	je		.Laes_192\@
+	_vaes_1x	\enc, 1, 10, \xmm_suffix, \data
+	jmp		.Laes_done\@
+.Laes_192\@:
+	_vaes_1x	\enc, 0, 10, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 11, \xmm_suffix, \data
+	_vaes_1x	\enc, 1, 12, \xmm_suffix, \data
+.Laes_done\@:
+	_vpxor		\tweak, \data, \data
+.endm
+
+// Load the round keys: just the first 3 if !USE_AVX10, otherwise all of them.
+.macro	_load_round_keys
+	_vbroadcast128	0*16(KEY), KEY0
+	_vbroadcast128	1*16(KEY), KEY1
+	_vbroadcast128	2*16(KEY), KEY2
+	.set	NR_CACHED_ROUND_KEYS,	3
+.if USE_AVX10
+	_vbroadcast128	3*16(KEY), KEY3
+	_vbroadcast128	4*16(KEY), KEY4
+	_vbroadcast128	5*16(KEY), KEY5
+	_vbroadcast128	6*16(KEY), KEY6
+	_vbroadcast128	7*16(KEY), KEY7
+	_vbroadcast128	8*16(KEY), KEY8
+	_vbroadcast128	9*16(KEY), KEY9
+	_vbroadcast128	10*16(KEY), KEY10
+	// Note: if it's AES-128 or AES-192, the last several round keys won't
+	// be used.  We do the loads anyway to save a conditional jump.
+	_vbroadcast128	11*16(KEY), KEY11
+	_vbroadcast128	12*16(KEY), KEY12
+	_vbroadcast128	13*16(KEY), KEY13
+	_vbroadcast128	14*16(KEY), KEY14
+	.set	NR_CACHED_ROUND_KEYS,	15
+.endif
+.endm
+
+// Given a 128-bit XTS tweak in the xmm register \src, compute the next tweak
+// (by multiplying by the polynomial 'x') and write it to \dst.
+.macro	_next_tweak	src, tmp, dst
+	vpshufd		$0x13, \src, \tmp
+	vpaddq		\src, \src, \dst
+	vpsrad		$31, \tmp, \tmp
+	vpand		GF_POLY_XMM, \tmp, \tmp
+	vpxor		\tmp, \dst, \dst
+.endm
+
+// Given the XTS tweak(s) in the vector \src, compute the next vector of
+// tweak(s) (by multiplying by the polynomial 'x^(VL/16)') and write it to \dst.
+//
+// If VL > 16, then there are multiple tweaks, and we use vpclmulqdq to compute
+// all tweaks in the vector in parallel.  If VL=16, we just do the regular
+// computation without vpclmulqdq, as it's the faster method for a single tweak.
+.macro	_next_tweakvec	src, tmp1, tmp2, dst
+.if VL == 16
+	_next_tweak	\src, \tmp1, \dst
+.else
+	vpsrlq		$64 - VL/16, \src, \tmp1
+	vpclmulqdq	$0x01, GF_POLY, \tmp1, \tmp2
+	vpslldq		$8, \tmp1, \tmp1
+	vpsllq		$VL/16, \src, \dst
+	_xor3		\tmp1, \tmp2, \dst
+.endif
+.endm
+
+// Given an XTS tweak in TWEAK0_XMM, compute the following tweaks and store them
+// in the vector registers TWEAK0-TWEAK3.  Clobbers V0-V7.
+.macro	_compute_first_set_of_tweaks
+.if VL == 16
+	// With VL=16, multiplying by x serially is fastest.
+	_next_tweak	TWEAK0, %xmm0, TWEAK1
+	_next_tweak	TWEAK1, %xmm0, TWEAK2
+	_next_tweak	TWEAK2, %xmm0, TWEAK3
+.else
+.if VL == 32
+	// Compute the second block of TWEAK0.
+	_next_tweak	TWEAK0_XMM, %xmm0, %xmm1
+	vinserti128	$1, %xmm1, TWEAK0, TWEAK0
+.elseif VL == 64
+	// Compute the remaining blocks of TWEAK0.
+	_next_tweak	TWEAK0_XMM, %xmm0, %xmm1
+	_next_tweak	%xmm1, %xmm0, %xmm2
+	_next_tweak	%xmm2, %xmm0, %xmm3
+	vinserti32x4	$1, %xmm1, TWEAK0, TWEAK0
+	vinserti32x4	$2, %xmm2, TWEAK0, TWEAK0
+	vinserti32x4	$3, %xmm3, TWEAK0, TWEAK0
+.endif
+	// Compute TWEAK[1-3] from TWEAK0.
+	vpsrlq		$64 - 1*VL/16, TWEAK0, V0
+	vpsrlq		$64 - 2*VL/16, TWEAK0, V2
+	vpsrlq		$64 - 3*VL/16, TWEAK0, V4
+	vpclmulqdq	$0x01, GF_POLY, V0, V1
+	vpclmulqdq	$0x01, GF_POLY, V2, V3
+	vpclmulqdq	$0x01, GF_POLY, V4, V5
+	vpslldq		$8, V0, V0
+	vpslldq		$8, V2, V2
+	vpslldq		$8, V4, V4
+	vpsllq		$1*VL/16, TWEAK0, TWEAK1
+	vpsllq		$2*VL/16, TWEAK0, TWEAK2
+	vpsllq		$3*VL/16, TWEAK0, TWEAK3
+.if USE_AVX10
+	vpternlogd	$0x96, V0, V1, TWEAK1
+	vpternlogd	$0x96, V2, V3, TWEAK2
+	vpternlogd	$0x96, V4, V5, TWEAK3
+.else
+	vpxor		V0, TWEAK1, TWEAK1
+	vpxor		V2, TWEAK2, TWEAK2
+	vpxor		V4, TWEAK3, TWEAK3
+	vpxor		V1, TWEAK1, TWEAK1
+	vpxor		V3, TWEAK2, TWEAK2
+	vpxor		V5, TWEAK3, TWEAK3
+.endif
+.endif
+.endm
+
+// Advance the set of XTS tweaks in TWEAK0-TWEAK3 to the next set.
+.macro	_compute_next_set_of_tweaks
+.if VL == 16
+	// With VL=16, multiplying by x serially is fastest.
+	_next_tweak	TWEAK3, %xmm0, TWEAK0
+	_next_tweak	TWEAK0, %xmm0, TWEAK1
+	_next_tweak	TWEAK1, %xmm0, TWEAK2
+	_next_tweak	TWEAK2, %xmm0, TWEAK3
+.else
+	// Multiply each tweak by x^(4*VL/16) in parallel.
+	vpsrlq		$64 - 4*VL/16, TWEAK0, V0
+	vpsrlq		$64 - 4*VL/16, TWEAK1, V1
+	vpsrlq		$64 - 4*VL/16, TWEAK2, V2
+	vpsrlq		$64 - 4*VL/16, TWEAK3, V3
+	vpclmulqdq	$0x01, GF_POLY, V0, V4
+	vpclmulqdq	$0x01, GF_POLY, V1, V5
+	vpclmulqdq	$0x01, GF_POLY, V2, V6
+	vpclmulqdq	$0x01, GF_POLY, V3, V7
+	vpslldq		$8, V0, V0
+	vpslldq		$8, V1, V1
+	vpslldq		$8, V2, V2
+	vpslldq		$8, V3, V3
+	vpsllq		$4*VL/16, TWEAK0, TWEAK0
+	vpsllq		$4*VL/16, TWEAK1, TWEAK1
+	vpsllq		$4*VL/16, TWEAK2, TWEAK2
+	vpsllq		$4*VL/16, TWEAK3, TWEAK3
+.if USE_AVX10
+	vpternlogd	$0x96, V0, V4, TWEAK0
+	vpternlogd	$0x96, V1, V5, TWEAK1
+	vpternlogd	$0x96, V2, V6, TWEAK2
+	vpternlogd	$0x96, V3, V7, TWEAK3
+.else
+	vpxor		V0, TWEAK0, TWEAK0
+	vpxor		V1, TWEAK1, TWEAK1
+	vpxor		V2, TWEAK2, TWEAK2
+	vpxor		V3, TWEAK3, TWEAK3
+	vpxor		V4, TWEAK0, TWEAK0
+	vpxor		V5, TWEAK1, TWEAK1
+	vpxor		V6, TWEAK2, TWEAK2
+	vpxor		V7, TWEAK3, TWEAK3
+.endif
+.endif
+.endm
+
+.macro	aes_xts_crypt	enc
+	_define_aliases
+
+	// Load the AES key length: 16 (AES-128), 24 (AES-192), or 32 (AES-256).
+	mov		480(KEY), KEYLEN
+
+	// Check whether the data length is a multiple of the AES block length.
+	test		$15, LEN
+	jnz		.Lneed_cts\@
+
+.Lxts_init\@:
+	// Load the IV into TWEAK0_XMM, and if (flags & XTS_FIRST) encrypt it
+	// with the tweak key to get the first tweak.  If !(flags & XTS_FIRST),
+	// then this is a continuation call and the IV was already encrypted.
+	vmovdqu		(IV), TWEAK0_XMM
+	test		$XTS_FIRST, FLAGS
+	jz		.Lencrypt_iv_done\@
+	vpxor		0*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		1*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		2*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		3*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		4*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		5*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		6*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		7*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		8*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		9*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	cmp		$24, KEYLEN
+	jle		.Lencrypt_iv_aes_128_or_192\@
+	vaesenc		10*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		11*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		12*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		13*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenclast	14*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+.Lencrypt_iv_done\@:
+
+	// Advance KEY from tweak_ctx to crypt_ctx::key_enc (for encryption)
+	// or crypt_ctx::key_dec (for decryption).
+.if \enc
+	add		$496, KEY
+.else
+	add		$496 + 240, KEY
+.endif
+
+	// Load the gf_poly constant.
+	_vbroadcast128	.Lgf_poly(%rip), GF_POLY
+
+	// Compute the first set of tweaks TWEAK[0-3].
+	_compute_first_set_of_tweaks
+
+	// Cache as many round keys as possible.
+	_load_round_keys
+
+	sub		$4*VL, LEN
+	jl		.Lhandle_remainder_have_tweak0\@
+
+.Lmain_loop\@:
+	// This is the main loop, en/decrypting 4*VL bytes per iteration.
+
+	// Load the next set of source blocks.
+	_vmovdqu	0*VL(SRC), V0
+	_vmovdqu	1*VL(SRC), V1
+	_vmovdqu	2*VL(SRC), V2
+	_vmovdqu	3*VL(SRC), V3
+
+	// XOR each block with its tweak and the first round key.
+.if USE_AVX10
+	vpternlogd	$0x96, TWEAK0, KEY0, V0
+	vpternlogd	$0x96, TWEAK1, KEY0, V1
+	vpternlogd	$0x96, TWEAK2, KEY0, V2
+	vpternlogd	$0x96, TWEAK3, KEY0, V3
+.else
+	vpxor		TWEAK0, V0, V0
+	vpxor		TWEAK1, V1, V1
+	vpxor		TWEAK2, V2, V2
+	vpxor		TWEAK3, V3, V3
+	vpxor		KEY0, V0, V0
+	vpxor		KEY0, V1, V1
+	vpxor		KEY0, V2, V2
+	vpxor		KEY0, V3, V3
+.endif
+
+	// Do the AES rounds.
+	_vaes_4x	\enc, 0, 1
+	_vaes_4x	\enc, 0, 2
+	_vaes_4x	\enc, 0, 3
+	_vaes_4x	\enc, 0, 4
+	_vaes_4x	\enc, 0, 5
+	_vaes_4x	\enc, 0, 6
+	_vaes_4x	\enc, 0, 7
+	_vaes_4x	\enc, 0, 8
+	_vaes_4x	\enc, 0, 9
+	// Try to optimize for AES-256 by keeping the code for AES-128 and
+	// AES-192 out-of-line.
+	cmp		$24, KEYLEN
+	jle		.Lencrypt_4x_aes_128_or_192\@
+	_vaes_4x	\enc, 0, 10
+	_vaes_4x	\enc, 0, 11
+	_vaes_4x	\enc, 0, 12
+	_vaes_4x	\enc, 0, 13
+	_vaes_4x	\enc, 1, 14
+.Lencrypt_4x_done\@:
+
+	// XOR in the tweaks again.
+	_vpxor		TWEAK0, V0, V0
+	_vpxor		TWEAK1, V1, V1
+	_vpxor		TWEAK2, V2, V2
+	_vpxor		TWEAK3, V3, V3
+
+	// Store the destination blocks.
+	_vmovdqu	V0, 0*VL(DST)
+	_vmovdqu	V1, 1*VL(DST)
+	_vmovdqu	V2, 2*VL(DST)
+	_vmovdqu	V3, 3*VL(DST)
+
+	add		$4*VL, SRC
+	add		$4*VL, DST
+	sub		$4*VL, LEN
+	jl		.Lmain_loop_done\@
+
+	// Another iteration of the main loop is needed, so advance the tweaks.
+	_compute_next_set_of_tweaks
+
+	jmp		.Lmain_loop\@
+
+.Lmain_loop_done\@:
+	// Check for less common cases: the data length isn't a multiple of 4*VL
+	// and/or the caller needs the next tweak to be returned.  Optimize for
+	// the common case by falling through to the ret in that case.
+	test		$4*VL-1, LEN
+	jnz		.Lhandle_remainder\@
+	test		$XTS_UPDATE_IV, FLAGS
+	jnz		.Lhandle_remainder\@
+.Ldone\@:
+.if VL > 16
+	vzeroupper
+.endif
+	RET
+
+.Lhandle_remainder\@:
+	// Compute the next vector of tweaks and store it in TWEAK0.
+	_next_tweakvec	TWEAK3, V0, V1, TWEAK0
+.Lhandle_remainder_have_tweak0\@:
+	add		$4*VL, LEN	// Undo the extra sub from earlier.
+
+	// En/decrypt any remaining full blocks, one vector at a time.
+.if VL > 16
+	sub		$VL, LEN
+	jl		.Lvec_at_a_time_done\@
+.Lvec_at_a_time\@:
+	_vmovdqu	(SRC), V0
+	_aes_crypt	\enc, , TWEAK0, V0
+	_vmovdqu	V0, (DST)
+	_next_tweakvec	TWEAK0, V0, V1, TWEAK0
+	add		$VL, SRC
+	add		$VL, DST
+	sub		$VL, LEN
+	jge		.Lvec_at_a_time\@
+.Lvec_at_a_time_done\@:
+	add		$VL-16, LEN
+.else
+	sub		$16, LEN
+.endif
+
+	// En/decrypt any remaining full blocks, one at a time.
+	jl		.Lblock_at_a_time_done\@
+.Lblock_at_a_time\@:
+	vmovdqu		(SRC), %xmm0
+	_aes_crypt	\enc, _XMM, TWEAK0_XMM, %xmm0
+	vmovdqu		%xmm0, (DST)
+	_next_tweak	TWEAK0_XMM, %xmm0, TWEAK0_XMM
+	add		$16, SRC
+	add		$16, DST
+	sub		$16, LEN
+	jge		.Lblock_at_a_time\@
+.Lblock_at_a_time_done\@:
+	add		$16, LEN
+
+.Lfull_blocks_done\@:
+	// Now 0 <= LEN <= 15.  If LEN is nonzero, do ciphertext stealing to
+	// process the last 16 + LEN bytes.  If LEN is zero, we're done.
+	test		LEN, LEN
+	jnz		.Lcts\@
+
+	// Store the next tweak back to *IV to support continuation calls.
+	vmovdqu		TWEAK0_XMM, (IV)
+	jmp		.Ldone\@
+
+	// Out-of-line handling of AES-128 and AES-192
+.Lencrypt_iv_aes_128_or_192\@:
+	jz		.Lencrypt_iv_aes_192\@
+	vaesenclast	10*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	jmp		.Lencrypt_iv_done\@
+.Lencrypt_iv_aes_192\@:
+	vaesenc		10*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenc		11*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	vaesenclast	12*16(KEY), TWEAK0_XMM, TWEAK0_XMM
+	jmp		.Lencrypt_iv_done\@
+
+	// Out-of-line handling of AES-128 and AES-192
+.Lencrypt_4x_aes_128_or_192\@:
+	jz		.Lencrypt_4x_aes_192\@
+	_vaes_4x	\enc, 1, 10
+	jmp		.Lencrypt_4x_done\@
+.Lencrypt_4x_aes_192\@:
+	_vaes_4x	\enc, 0, 10
+	_vaes_4x	\enc, 0, 11
+	_vaes_4x	\enc, 1, 12
+	jmp		.Lencrypt_4x_done\@
+
+.Lneed_cts\@:
+	// The data length isn't a multiple of the AES block length, so
+	// ciphertext stealing (CTS) will be needed.  Subtract one block from
+	// LEN so that the main loop doesn't process the last full block.  The
+	// CTS step will process it specially along with the partial block.
+	sub		$16, LEN
+	jmp		.Lxts_init\@
+
+.Lcts\@:
+	// Do ciphertext stealing (CTS) to en/decrypt the last full block and
+	// the partial block.  CTS needs two tweaks.  TWEAK0_XMM contains the
+	// next tweak; compute the one after that.  Decryption uses these two
+	// tweaks in reverse order, so also define aliases to handle that.
+	_next_tweak	TWEAK0_XMM, %xmm0, TWEAK1_XMM
+.if \enc
+	.set		CTS_TWEAK0,	TWEAK0_XMM
+	.set		CTS_TWEAK1,	TWEAK1_XMM
+.else
+	.set		CTS_TWEAK0,	TWEAK1_XMM
+	.set		CTS_TWEAK1,	TWEAK0_XMM
+.endif
+
+	// En/decrypt the last full block.
+	vmovdqu		(SRC), %xmm0
+	_aes_crypt	\enc, _XMM, CTS_TWEAK0, %xmm0
+
+.if USE_AVX10
+	// Create a mask that has the first LEN bits set.
+	mov		$-1, %rax
+	bzhi		LEN, %rax, %rax
+	kmovq		%rax, %k1
+
+	// Swap the first LEN bytes of the above result with the partial block.
+	// Note that to support in-place en/decryption, the load from the src
+	// partial block must happen before the store to the dst partial block.
+	vmovdqa		%xmm0, %xmm1
+	vmovdqu8	16(SRC), %xmm0{%k1}
+	vmovdqu8	%xmm1, 16(DST){%k1}
+.else
+	lea		.Lcts_permute_table(%rip), %rax
+
+	// Load the src partial block, left-aligned.  Note that to support
+	// in-place en/decryption, this must happen before the store to the dst
+	// partial block.
+	vmovdqu		(SRC, LEN, 1), %xmm1
+
+	// Shift the first LEN bytes of the en/decryption of the last full block
+	// to the end of a register, then store it to DST+LEN.  This stores the
+	// dst partial block.  It also writes to the second part of the dst last
+	// full block, but that part is overwritten later.
+	vpshufb		(%rax, LEN, 1), %xmm0, %xmm2
+	vmovdqu		%xmm2, (DST, LEN, 1)
+
+	// Make xmm3 contain [16-LEN,16-LEN+1,...,14,15,0x80,0x80,...].
+	sub		LEN, %rax
+	vmovdqu		32(%rax), %xmm3
+
+	// Shift the src partial block to the beginning of its register.
+	vpshufb		%xmm3, %xmm1, %xmm1
+
+	// Do a blend to generate the src partial block followed by the second
+	// part of the en/decryption of the last full block.
+	vpblendvb	%xmm3, %xmm0, %xmm1, %xmm0
+.endif
+	// En/decrypt again and store the last full block.
+	_aes_crypt	\enc, _XMM, CTS_TWEAK1, %xmm0
+	vmovdqu		%xmm0, (DST)
+	jmp		.Ldone\@
+.endm
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation
  2024-03-26  8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
  2024-03-26  8:02 ` [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
  2024-03-26  8:03 ` [PATCH 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs Eric Biggers
@ 2024-03-26  8:03 ` Eric Biggers
  2024-03-26  8:03 ` [PATCH 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation Eric Biggers
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Eric Biggers @ 2024-03-26  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add an AES-XTS implementation "xts-aes-aesni-avx" for x86_64 CPUs that
have the AES-NI and AVX extensions but not VAES.  It's similar to the
existing xts-aes-aesni in that uses xmm registers to operate on one AES
block at a time.  It differs from xts-aes-aesni in the following ways:

- It uses the VEX-coded (non-destructive) instructions from AVX.
  This improves performance slightly.
- It supports only 64-bit (x86_64).
- It incorporates some small extra optimizations such as handling the
  tweak encryption more efficiently and caching some of the round keys.
- It's generated by an assembly macro that will also be used to generate
  VAES-based implementations.

The performance improvement over xts-aes-aesni varies from negligible to
substantial, depending on the CPU and other factors such as the size of
the messages en/decrypted.  For example, the following increases in
AES-256-XTS decryption throughput are seen on the following CPUs:

                   | 4096-byte messages | 512-byte messages |
    ---------------+--------------------+-------------------+
    Intel Skylake  |        1%          |       11%         |
    AMD Zen 1      |        25%         |       20%         |
    AMD Zen 2      |        26%         |       20%         |

(The above CPUs don't VAES, so they can't use VAES instead.)

While this isn't as large an improvement as what VAES provides, this
still seems worthwhile.  This implementation is fairly easy to provide
based on the assembly macro that's needed for VAES anyway, and it will
be the best implementation on a large number of CPUs (very roughly, the
CPUs launched by Intel and AMD from 2011 to 2018).

This makes the existing xts-aes-aesni *mostly* obsolete.  For now, leave
it in place to support 32-bit kernels and also CPUs like Intel Westmere
that support AES-NI but not AVX.  (We could potentially remove it anyway
and just rely on the indirect acceleration via ecb-aes-aesni in those
cases, but that change will need to be considered separately.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/aes-xts-avx-x86_64.S |   9 ++
 arch/x86/crypto/aesni-intel_glue.c   | 198 ++++++++++++++++++++++++++-
 2 files changed, 206 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
index 92f1580e1eb0..a8003fea97b7 100644
--- a/arch/x86/crypto/aes-xts-avx-x86_64.S
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -754,5 +754,14 @@
 	// En/decrypt again and store the last full block.
 	_aes_crypt	\enc, _XMM, CTS_TWEAK1, %xmm0
 	vmovdqu		%xmm0, (DST)
 	jmp		.Ldone\@
 .endm
+
+.set	VL, 16
+.set	USE_AVX10, 0
+SYM_TYPED_FUNC_START(aes_xts_encrypt_aesni_avx)
+	aes_xts_crypt	1
+SYM_FUNC_END(aes_xts_encrypt_aesni_avx)
+SYM_TYPED_FUNC_START(aes_xts_decrypt_aesni_avx)
+	aes_xts_crypt	0
+SYM_FUNC_END(aes_xts_decrypt_aesni_avx)
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index b1d90c25975a..d5e33c396b3e 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1135,10 +1135,197 @@ static struct skcipher_alg aesni_xctr = {
 	.encrypt	= xctr_crypt,
 	.decrypt	= xctr_crypt,
 };
 
 static struct simd_skcipher_alg *aesni_simd_xctr;
+
+// Flags for the 'int flags' parameter.  Keep in sync with asm file.
+#define XTS_FIRST	0x1
+#define XTS_UPDATE_IV	0x2
+
+typedef void (*xts_asm_func)(const struct aesni_xts_ctx *key,
+			     const u8 *src, u8 *dst, size_t len,
+			     u8 iv[AES_BLOCK_SIZE], int flags);
+
+/*
+ * This handles cases where the full message isn't available in one step of the
+ * scatterlist walk.
+ */
+static noinline int
+xts_crypt_slowpath(struct skcipher_request *req,
+		   struct skcipher_walk *walk, xts_asm_func asm_func)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const struct aesni_xts_ctx *ctx = aes_xts_ctx(tfm);
+	int tail = req->cryptlen % AES_BLOCK_SIZE;
+	struct scatterlist sg_src[2], sg_dst[2];
+	struct skcipher_request subreq;
+	struct scatterlist *src, *dst;
+	int flags = XTS_FIRST | XTS_UPDATE_IV;
+	int err;
+
+	/*
+	 * If the message length isn't divisible by the AES block size, then
+	 * separate off the last full block and the partial block.  This ensures
+	 * that they are processed in the same call to the assembly function,
+	 * which is required for ciphertext stealing.
+	 */
+	if (tail) {
+		skcipher_walk_abort(walk);
+
+		skcipher_request_set_tfm(&subreq, tfm);
+		skcipher_request_set_callback(&subreq,
+					      skcipher_request_flags(req),
+					      NULL, NULL);
+		skcipher_request_set_crypt(&subreq, req->src, req->dst,
+					   req->cryptlen - tail - AES_BLOCK_SIZE,
+					   req->iv);
+		req = &subreq;
+		err = skcipher_walk_virt(walk, req, false);
+	}
+
+	while (walk->nbytes) {
+		unsigned int nbytes = walk->nbytes;
+
+		if (nbytes < walk->total)
+			nbytes = round_down(nbytes, AES_BLOCK_SIZE);
+
+		kernel_fpu_begin();
+		(*asm_func)(ctx, walk->src.virt.addr, walk->dst.virt.addr,
+			    nbytes, req->iv, flags);
+		kernel_fpu_end();
+		flags &= ~XTS_FIRST;
+		err = skcipher_walk_done(walk, walk->nbytes - nbytes);
+	}
+
+	if (err || !tail)
+		return err;
+
+	/* Do ciphertext stealing with the last full block and partial block. */
+
+	dst = src = scatterwalk_ffwd(sg_src, req->src, req->cryptlen);
+	if (req->dst != req->src)
+		dst = scatterwalk_ffwd(sg_dst, req->dst, req->cryptlen);
+
+	skcipher_request_set_crypt(req, src, dst, AES_BLOCK_SIZE + tail,
+				   req->iv);
+
+	err = skcipher_walk_virt(walk, req, false);
+	if (err)
+		return err;
+
+	kernel_fpu_begin();
+	(*asm_func)(ctx, walk->src.virt.addr, walk->dst.virt.addr, walk->nbytes,
+		    req->iv, flags);
+	kernel_fpu_end();
+
+	return skcipher_walk_done(walk, 0);
+}
+
+/* __always_inline to avoid indirect call in fastpath */
+static __always_inline int
+xts_crypt2(struct skcipher_request *req, xts_asm_func asm_func)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const struct aesni_xts_ctx *ctx = aes_xts_ctx(tfm);
+	struct skcipher_walk walk;
+	int err;
+
+	/* The assembly code assumes these field offsets in the key struct. */
+	BUILD_BUG_ON(offsetof(struct aesni_xts_ctx, tweak_ctx) != 0);
+	BUILD_BUG_ON(offsetof(struct aesni_xts_ctx, tweak_ctx.key_enc) != 0);
+	BUILD_BUG_ON(offsetof(struct aesni_xts_ctx, tweak_ctx.key_length) != 480);
+	BUILD_BUG_ON(offsetof(struct aesni_xts_ctx, crypt_ctx) != 496);
+	BUILD_BUG_ON(offsetof(struct aesni_xts_ctx, crypt_ctx.key_enc) != 496);
+	BUILD_BUG_ON(offsetof(struct aesni_xts_ctx, crypt_ctx.key_dec) != 736);
+
+	if (req->cryptlen < AES_BLOCK_SIZE)
+		return -EINVAL;
+
+	err = skcipher_walk_virt(&walk, req, false);
+	if (err)
+		return err;
+	if (likely(walk.nbytes == walk.total)) {
+		kernel_fpu_begin();
+		(*asm_func)(ctx, walk.src.virt.addr, walk.dst.virt.addr,
+			    walk.nbytes, req->iv, XTS_FIRST);
+		kernel_fpu_end();
+		return skcipher_walk_done(&walk, 0);
+	}
+	return xts_crypt_slowpath(req, &walk, asm_func);
+}
+
+#define DEFINE_XTS_ALG(suffix, driver_name, priority)			       \
+									       \
+asmlinkage void aes_xts_encrypt_##suffix(const struct aesni_xts_ctx *key,      \
+					 const u8 *src, u8 *dst, size_t len,   \
+					 u8 iv[AES_BLOCK_SIZE], int flags);    \
+asmlinkage void aes_xts_decrypt_##suffix(const struct aesni_xts_ctx *key,      \
+					 const u8 *src, u8 *dst, size_t len,   \
+					 u8 iv[AES_BLOCK_SIZE], int flags);    \
+									       \
+static int xts_encrypt_##suffix(struct skcipher_request *req)		       \
+{									       \
+	return xts_crypt2(req, aes_xts_encrypt_##suffix);		       \
+}									       \
+									       \
+static int xts_decrypt_##suffix(struct skcipher_request *req)		       \
+{									       \
+	return xts_crypt2(req, aes_xts_decrypt_##suffix);		       \
+}									       \
+									       \
+static struct skcipher_alg aes_xts_alg_##suffix = {			       \
+	.base = {							       \
+		.cra_name		= "__xts(aes)",			       \
+		.cra_driver_name	= "__" driver_name,		       \
+		.cra_priority		= priority,			       \
+		.cra_flags		= CRYPTO_ALG_INTERNAL,		       \
+		.cra_blocksize		= AES_BLOCK_SIZE,		       \
+		.cra_ctxsize		= XTS_AES_CTX_SIZE,		       \
+		.cra_module		= THIS_MODULE,			       \
+	},								       \
+	.min_keysize	= 2 * AES_MIN_KEY_SIZE,				       \
+	.max_keysize	= 2 * AES_MAX_KEY_SIZE,				       \
+	.ivsize		= AES_BLOCK_SIZE,				       \
+	.walksize	= 2 * AES_BLOCK_SIZE,				       \
+	.setkey		= xts_aesni_setkey,				       \
+	.encrypt	= xts_encrypt_##suffix,				       \
+	.decrypt	= xts_decrypt_##suffix,				       \
+};									       \
+									       \
+static struct simd_skcipher_alg *aes_xts_simdalg_##suffix
+
+DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500);
+
+static int __init register_xts_algs(void)
+{
+	int err;
+
+	if (!boot_cpu_has(X86_FEATURE_AVX))
+		return 0;
+	err = simd_register_skciphers_compat(&aes_xts_alg_aesni_avx, 1,
+					     &aes_xts_simdalg_aesni_avx);
+	if (err)
+		return err;
+	return 0;
+}
+
+static void unregister_xts_algs(void)
+{
+	if (aes_xts_simdalg_aesni_avx)
+		simd_unregister_skciphers(&aes_xts_alg_aesni_avx, 1,
+					  &aes_xts_simdalg_aesni_avx);
+}
+#else
+static int __init register_xts_algs(void)
+{
+	return 0;
+}
+
+static void unregister_xts_algs(void)
+{
+}
 #endif /* CONFIG_X86_64 */
 
 #ifdef CONFIG_X86_64
 static int generic_gcmaes_set_key(struct crypto_aead *aead, const u8 *key,
 				  unsigned int key_len)
@@ -1274,17 +1461,25 @@ static int __init aesni_init(void)
 						     &aesni_simd_xctr);
 	if (err)
 		goto unregister_aeads;
 #endif /* CONFIG_X86_64 */
 
+	err = register_xts_algs();
+	if (err)
+		goto unregister_xts;
+
 	return 0;
 
+unregister_xts:
+	unregister_xts_algs();
 #ifdef CONFIG_X86_64
+	if (aesni_simd_xctr)
+		simd_unregister_skciphers(&aesni_xctr, 1, &aesni_simd_xctr);
 unregister_aeads:
+#endif /* CONFIG_X86_64 */
 	simd_unregister_aeads(aesni_aeads, ARRAY_SIZE(aesni_aeads),
 				aesni_simd_aeads);
-#endif /* CONFIG_X86_64 */
 
 unregister_skciphers:
 	simd_unregister_skciphers(aesni_skciphers, ARRAY_SIZE(aesni_skciphers),
 				  aesni_simd_skciphers);
 unregister_cipher:
@@ -1301,10 +1496,11 @@ static void __exit aesni_exit(void)
 	crypto_unregister_alg(&aesni_cipher_alg);
 #ifdef CONFIG_X86_64
 	if (boot_cpu_has(X86_FEATURE_AVX))
 		simd_unregister_skciphers(&aesni_xctr, 1, &aesni_simd_xctr);
 #endif /* CONFIG_X86_64 */
+	unregister_xts_algs();
 }
 
 late_initcall(aesni_init);
 module_exit(aesni_exit);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation
  2024-03-26  8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
                   ` (2 preceding siblings ...)
  2024-03-26  8:03 ` [PATCH 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation Eric Biggers
@ 2024-03-26  8:03 ` Eric Biggers
  2024-03-26  8:03 ` [PATCH 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation Eric Biggers
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Eric Biggers @ 2024-03-26  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add an AES-XTS implementation "xts-aes-vaes-avx2" for x86_64 CPUs with
the VAES, VPCLMULQDQ, and AVX2 extensions, but not AVX512 or AVX10.
This implementation uses ymm registers to operate on two AES blocks at a
time.  The assembly code is instantiated using a macro so that most of
the source code is shared with other implementations.

This is the optimal implementation on AMD Zen 3.  It should also be the
optimal implementation on Intel Alder Lake, which similarly supports
VAES but not AVX512.  Comparing to xts-aes-aesni-avx on Zen 3,
xts-aes-vaes-avx2 provides 51% higher AES-256-XTS decryption throughput
with 4096-byte messages, or 19% higher with 512-byte messages.

A large improvement is also seen with CPUs that do support AVX512 (e.g.,
74% higher AES-256-XTS decryption throughput on Ice Lake with 4096-byte
messages), though the following patches add AVX512 optimized
implementations to get a bit more performance on those CPUs.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/aes-xts-avx-x86_64.S | 11 +++++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 18 ++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
index a8003fea97b7..87ae2139b7ca 100644
--- a/arch/x86/crypto/aes-xts-avx-x86_64.S
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -763,5 +763,16 @@ SYM_TYPED_FUNC_START(aes_xts_encrypt_aesni_avx)
 	aes_xts_crypt	1
 SYM_FUNC_END(aes_xts_encrypt_aesni_avx)
 SYM_TYPED_FUNC_START(aes_xts_decrypt_aesni_avx)
 	aes_xts_crypt	0
 SYM_FUNC_END(aes_xts_decrypt_aesni_avx)
+
+#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
+.set	VL, 32
+.set	USE_AVX10, 0
+SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx2)
+	aes_xts_crypt	1
+SYM_FUNC_END(aes_xts_encrypt_vaes_avx2)
+SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx2)
+	aes_xts_crypt	0
+SYM_FUNC_END(aes_xts_decrypt_vaes_avx2)
+#endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index d5e33c396b3e..d958aa073c14 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1293,10 +1293,13 @@ static struct skcipher_alg aes_xts_alg_##suffix = {			       \
 };									       \
 									       \
 static struct simd_skcipher_alg *aes_xts_simdalg_##suffix
 
 DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500);
+#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
+DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600);
+#endif
 
 static int __init register_xts_algs(void)
 {
 	int err;
 
@@ -1304,18 +1307,33 @@ static int __init register_xts_algs(void)
 		return 0;
 	err = simd_register_skciphers_compat(&aes_xts_alg_aesni_avx, 1,
 					     &aes_xts_simdalg_aesni_avx);
 	if (err)
 		return err;
+#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
+	if (!boot_cpu_has(X86_FEATURE_AVX2) ||
+	    !boot_cpu_has(X86_FEATURE_VAES) ||
+	    !boot_cpu_has(X86_FEATURE_VPCLMULQDQ) ||
+	    !boot_cpu_has(X86_FEATURE_PCLMULQDQ) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL))
+		return 0;
+	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx2, 1,
+					     &aes_xts_simdalg_vaes_avx2);
+	if (err)
+		return err;
+#endif
 	return 0;
 }
 
 static void unregister_xts_algs(void)
 {
 	if (aes_xts_simdalg_aesni_avx)
 		simd_unregister_skciphers(&aes_xts_alg_aesni_avx, 1,
 					  &aes_xts_simdalg_aesni_avx);
+	if (aes_xts_simdalg_vaes_avx2)
+		simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1,
+					  &aes_xts_simdalg_vaes_avx2);
 }
 #else
 static int __init register_xts_algs(void)
 {
 	return 0;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
  2024-03-26  8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
                   ` (3 preceding siblings ...)
  2024-03-26  8:03 ` [PATCH 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation Eric Biggers
@ 2024-03-26  8:03 ` Eric Biggers
  2024-03-26  8:03 ` [PATCH 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Eric Biggers
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Eric Biggers @ 2024-03-26  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add an AES-XTS implementation "xts-aes-vaes-avx10_256" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/256 or AVX512BW + AVX512VL
extensions.  This implementation avoids using zmm registers, instead
using ymm registers to operate on two AES blocks at a time.  The
assembly code is instantiated using a macro so that most of the source
code is shared with other implementations.

This is the optimal implementation on CPUs that support VAES and AVX512
but where the zmm registers should not be used due to downclocking
effects, for example Intel's Ice Lake.  It should also be the optimal
implementation on future CPUs that support AVX10/256 but not AVX10/512.

The performance is slightly better than that of xts-aes-vaes-avx2, which
uses the same vector length, due to factors such as being able to use
ymm16-ymm31 to cache the AES round keys.  For example, on Ice Lake, the
throughput of decrypting 4096-byte messages with AES-256-XTS is 5.8%
higher with xts-aes-vaes-avx10_256 than with xts-aes-vaes-avx2.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/aes-xts-avx-x86_64.S |  9 +++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 16 ++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
index 87ae2139b7ca..c868b9af443b 100644
--- a/arch/x86/crypto/aes-xts-avx-x86_64.S
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -773,6 +773,15 @@ SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx2)
 	aes_xts_crypt	1
 SYM_FUNC_END(aes_xts_encrypt_vaes_avx2)
 SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx2)
 	aes_xts_crypt	0
 SYM_FUNC_END(aes_xts_decrypt_vaes_avx2)
+
+.set	VL, 32
+.set	USE_AVX10, 1
+SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_256)
+	aes_xts_crypt	1
+SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_256)
+SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_256)
+	aes_xts_crypt	0
+SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_256)
 #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index d958aa073c14..ac45e0b952b7 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1295,10 +1295,11 @@ static struct skcipher_alg aes_xts_alg_##suffix = {			       \
 static struct simd_skcipher_alg *aes_xts_simdalg_##suffix
 
 DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500);
 #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
 DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600);
+DEFINE_XTS_ALG(vaes_avx10_256, "xts-aes-vaes-avx10_256", 700);
 #endif
 
 static int __init register_xts_algs(void)
 {
 	int err;
@@ -1318,10 +1319,22 @@ static int __init register_xts_algs(void)
 		return 0;
 	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx2, 1,
 					     &aes_xts_simdalg_vaes_avx2);
 	if (err)
 		return err;
+
+	if (!boot_cpu_has(X86_FEATURE_AVX512BW) ||
+	    !boot_cpu_has(X86_FEATURE_AVX512VL) ||
+	    !boot_cpu_has(X86_FEATURE_BMI2) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
+			       XFEATURE_MASK_AVX512, NULL))
+		return 0;
+
+	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_256, 1,
+					     &aes_xts_simdalg_vaes_avx10_256);
+	if (err)
+		return err;
 #endif
 	return 0;
 }
 
 static void unregister_xts_algs(void)
@@ -1330,10 +1343,13 @@ static void unregister_xts_algs(void)
 		simd_unregister_skciphers(&aes_xts_alg_aesni_avx, 1,
 					  &aes_xts_simdalg_aesni_avx);
 	if (aes_xts_simdalg_vaes_avx2)
 		simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1,
 					  &aes_xts_simdalg_vaes_avx2);
+	if (aes_xts_simdalg_vaes_avx10_256)
+		simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_256, 1,
+					  &aes_xts_simdalg_vaes_avx10_256);
 }
 #else
 static int __init register_xts_algs(void)
 {
 	return 0;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
  2024-03-26  8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
                   ` (4 preceding siblings ...)
  2024-03-26  8:03 ` [PATCH 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation Eric Biggers
@ 2024-03-26  8:03 ` Eric Biggers
  2024-03-26  8:51 ` [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Ard Biesheuvel
  2024-04-05  7:58 ` Herbert Xu
  7 siblings, 0 replies; 19+ messages in thread
From: Eric Biggers @ 2024-03-26  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add an AES-XTS implementation "xts-aes-vaes-avx10_512" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/512 or AVX512BW + AVX512VL
extensions.  This implementation uses zmm registers to operate on four
AES blocks at a time.  The assembly code is instantiated using a macro
so that most of the source code is shared with other implementations.

To avoid downclocking on older Intel CPU models, an exclusion list is
used to prevent this 512-bit implementation from being used by default
on some CPU models.  They will use xts-aes-vaes-avx10_256 instead.  For
now, this exclusion list is simply coded into aesni-intel_glue.c.  It
may make sense to eventually move it into a more central location.

xts-aes-vaes-avx10_512 is slightly faster than xts-aes-vaes-avx10_256 on
some current CPUs.  E.g., on Intel Sapphire Rapids, AES-256-XTS
decryption throughput increases by 5.6% with 4096-byte inputs, or 8.9%
with 512-byte inputs.  On AMD Genoa, AES-256-XTS decryption throughput
increases by 15.3% with 4096-byte inputs, or 7.6% with 512-byte inputs.

Future CPUs may provide stronger 512-bit support, in which case a larger
benefit should be seen.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/aes-xts-avx-x86_64.S |  9 ++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 33 +++++++++++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
index c868b9af443b..024fc12c9a94 100644
--- a/arch/x86/crypto/aes-xts-avx-x86_64.S
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -782,6 +782,15 @@ SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_256)
 	aes_xts_crypt	1
 SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_256)
 SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_256)
 	aes_xts_crypt	0
 SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_256)
+
+.set	VL, 64
+.set	USE_AVX10, 1
+SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_512)
+	aes_xts_crypt	1
+SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_512)
+SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_512)
+	aes_xts_crypt	0
+SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_512)
 #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index ac45e0b952b7..49b259dff81f 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1296,12 +1296,32 @@ static struct simd_skcipher_alg *aes_xts_simdalg_##suffix
 
 DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500);
 #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
 DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600);
 DEFINE_XTS_ALG(vaes_avx10_256, "xts-aes-vaes-avx10_256", 700);
+DEFINE_XTS_ALG(vaes_avx10_512, "xts-aes-vaes-avx10_512", 800);
 #endif
 
+/*
+ * This is a list of CPU models that are known to suffer from downclocking when
+ * zmm registers (512-bit vectors) are used.  On these CPUs, the AES-XTS
+ * implementation with zmm registers won't be used by default.  An
+ * implementation with ymm registers (256-bit vectors) will be used instead.
+ */
+static const struct x86_cpu_id zmm_exclusion_list[] = {
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE },
+	/* Allow Rocket Lake and later, and Sapphire Rapids and later. */
+	/* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */
+};
+
 static int __init register_xts_algs(void)
 {
 	int err;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX))
@@ -1331,11 +1351,19 @@ static int __init register_xts_algs(void)
 
 	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_256, 1,
 					     &aes_xts_simdalg_vaes_avx10_256);
 	if (err)
 		return err;
-#endif
+
+	if (x86_match_cpu(zmm_exclusion_list))
+		aes_xts_alg_vaes_avx10_512.base.cra_priority = 1;
+
+	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_512, 1,
+					     &aes_xts_simdalg_vaes_avx10_512);
+	if (err)
+		return err;
+#endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
 	return 0;
 }
 
 static void unregister_xts_algs(void)
 {
@@ -1346,10 +1374,13 @@ static void unregister_xts_algs(void)
 		simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1,
 					  &aes_xts_simdalg_vaes_avx2);
 	if (aes_xts_simdalg_vaes_avx10_256)
 		simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_256, 1,
 					  &aes_xts_simdalg_vaes_avx10_256);
+	if (aes_xts_simdalg_vaes_avx10_512)
+		simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_512, 1,
+					  &aes_xts_simdalg_vaes_avx10_512);
 }
 #else
 static int __init register_xts_algs(void)
 {
 	return 0;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
  2024-03-26  8:02 ` [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
@ 2024-03-26  8:10   ` Ingo Molnar
  2024-03-26  8:18     ` Eric Biggers
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2024-03-26  8:10 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Chang S . Bae


* Eric Biggers <ebiggers@kernel.org> wrote:

> From: Eric Biggers <ebiggers@google.com>
> 
> Add config symbols AS_VAES and AS_VPCLMULQDQ that expose whether the
> assembler supports the vector AES and carryless multiplication
> cryptographic extensions.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  arch/x86/Kconfig.assembler | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
> index 8ad41da301e5..59aedf32c4ea 100644
> --- a/arch/x86/Kconfig.assembler
> +++ b/arch/x86/Kconfig.assembler
> @@ -23,9 +23,19 @@ config AS_TPAUSE
>  config AS_GFNI
>  	def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2)
>  	help
>  	  Supported by binutils >= 2.30 and LLVM integrated assembler
>  
> +config AS_VAES
> +	def_bool $(as-instr,vaesenc %ymm0$(comma)%ymm1$(comma)%ymm2)
> +	help
> +	  Supported by binutils >= 2.30 and LLVM integrated assembler

Nit: any reason it isn't called AS_VAESENC, like the instruction itself?

The other new AS_ Kconfig symbols follow the same nomenclature:

> +config AS_VPCLMULQDQ
> +	def_bool $(as-instr,vpclmulqdq \$0x10$(comma)%ymm0$(comma)%ymm1$(comma)%ymm2)
> +	help
> +	  Supported by binutils >= 2.30 and LLVM integrated assembler
> +
>  config AS_WRUSS
>  	def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
>  	help
>  	  Supported by binutils >= 2.31 and LLVM integrated assembler

With the nit above fixed:

  Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
  2024-03-26  8:10   ` Ingo Molnar
@ 2024-03-26  8:18     ` Eric Biggers
  2024-03-26  8:28       ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Biggers @ 2024-03-26  8:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-crypto, x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Chang S . Bae

On Tue, Mar 26, 2024 at 09:10:13AM +0100, Ingo Molnar wrote:
> 
> * Eric Biggers <ebiggers@kernel.org> wrote:
> 
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > Add config symbols AS_VAES and AS_VPCLMULQDQ that expose whether the
> > assembler supports the vector AES and carryless multiplication
> > cryptographic extensions.
> > 
> > Signed-off-by: Eric Biggers <ebiggers@google.com>
> > ---
> >  arch/x86/Kconfig.assembler | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
> > index 8ad41da301e5..59aedf32c4ea 100644
> > --- a/arch/x86/Kconfig.assembler
> > +++ b/arch/x86/Kconfig.assembler
> > @@ -23,9 +23,19 @@ config AS_TPAUSE
> >  config AS_GFNI
> >  	def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2)
> >  	help
> >  	  Supported by binutils >= 2.30 and LLVM integrated assembler
> >  
> > +config AS_VAES
> > +	def_bool $(as-instr,vaesenc %ymm0$(comma)%ymm1$(comma)%ymm2)
> > +	help
> > +	  Supported by binutils >= 2.30 and LLVM integrated assembler
> 
> Nit: any reason it isn't called AS_VAESENC, like the instruction itself?
> 
> The other new AS_ Kconfig symbols follow the same nomenclature:

The CPU feature flag is called VAES.  It guards the vaesenc, vaesenclast,
vaesdec, and vaesdeclast instructions when used on ymm and zmm registers.

So the name AS_VAES seems fine as-is.

I think you may have been confused by AS_VPCLMULQDQ, because in that case the
feature happens to provides a single instruction with the same name as the CPU
feature flag.

- Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
  2024-03-26  8:18     ` Eric Biggers
@ 2024-03-26  8:28       ` Ingo Molnar
  0 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2024-03-26  8:28 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Chang S . Bae


* Eric Biggers <ebiggers@kernel.org> wrote:

> On Tue, Mar 26, 2024 at 09:10:13AM +0100, Ingo Molnar wrote:
> > 
> > * Eric Biggers <ebiggers@kernel.org> wrote:
> > 
> > > From: Eric Biggers <ebiggers@google.com>
> > > 
> > > Add config symbols AS_VAES and AS_VPCLMULQDQ that expose whether the
> > > assembler supports the vector AES and carryless multiplication
> > > cryptographic extensions.
> > > 
> > > Signed-off-by: Eric Biggers <ebiggers@google.com>
> > > ---
> > >  arch/x86/Kconfig.assembler | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > > 
> > > diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
> > > index 8ad41da301e5..59aedf32c4ea 100644
> > > --- a/arch/x86/Kconfig.assembler
> > > +++ b/arch/x86/Kconfig.assembler
> > > @@ -23,9 +23,19 @@ config AS_TPAUSE
> > >  config AS_GFNI
> > >  	def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2)
> > >  	help
> > >  	  Supported by binutils >= 2.30 and LLVM integrated assembler
> > >  
> > > +config AS_VAES
> > > +	def_bool $(as-instr,vaesenc %ymm0$(comma)%ymm1$(comma)%ymm2)
> > > +	help
> > > +	  Supported by binutils >= 2.30 and LLVM integrated assembler
> > 
> > Nit: any reason it isn't called AS_VAESENC, like the instruction itself?
> > 
> > The other new AS_ Kconfig symbols follow the same nomenclature:
> 
> The CPU feature flag is called VAES.  It guards the vaesenc, vaesenclast,
> vaesdec, and vaesdeclast instructions when used on ymm and zmm registers.

I see - fair enough:

   Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-03-26  8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
                   ` (5 preceding siblings ...)
  2024-03-26  8:03 ` [PATCH 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Eric Biggers
@ 2024-03-26  8:51 ` Ard Biesheuvel
  2024-03-26 16:47   ` Eric Biggers
  2024-04-05  7:58 ` Herbert Xu
  7 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2024-03-26  8:51 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, x86, linux-kernel, Andy Lutomirski, Chang S . Bae

On Tue, 26 Mar 2024 at 10:06, Eric Biggers <ebiggers@kernel.org> wrote:
>
> This patchset adds new AES-XTS implementations that accelerate disk and
> file encryption on modern x86_64 CPUs.
>
> The largest improvements are seen on CPUs that support the VAES
> extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
> later.  However, an implementation using plain AESNI + AVX is also added
> and provides a small boost on older CPUs too.
>
> To try to handle the mess that is x86 SIMD, the code for all the new
> AES-XTS implementations is generated from an assembly macro.  This makes
> it so that we e.g. don't have to have entirely different source code
> just for different vector lengths (xmm, ymm, zmm).
>
> To avoid downclocking effects, zmm registers aren't used on certain
> Intel CPU models such as Ice Lake.  These CPU models default to an
> implementation using ymm registers instead.
>
> This patchset increases the throughput of AES-256-XTS decryption by the
> following amounts on the following CPUs:
>
>                           | 4096-byte messages | 512-byte messages |
>     ----------------------+--------------------+-------------------+
>     Intel Skylake         |        1%          |       11%         |
>     Intel Ice Lake        |        92%         |       59%         |
>     Intel Sapphire Rapids |       115%         |       78%         |
>     AMD Zen 1             |        25%         |       20%         |
>     AMD Zen 2             |        26%         |       20%         |
>     AMD Zen 3             |        82%         |       40%         |
>     AMD Zen 4             |       118%         |       48%         |
>
> (The results for encryption are very similar to decryption.  I just tend
> to measure decryption because decryption performance is more important.)
>
> There's no separate kconfig option for the new AES-XTS implementations,
> as they are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.
>
> To make testing easier, all four new AES-XTS implementations are
> registered separately with the crypto API.  They are prioritized
> appropriately so that the best one for the CPU is used by default.
>

This is very nice work!

I didn't check the performance delta on my system (it's Intel but I
have no idea which uarch), but it supports all flavours that you
implemented here, and all pass the selftests with
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y, so for the series

Tested-by: Ard Biesheuvel <ardb@kernel.org>

I will try to make time to review the code as well.

> Open questions:
>
> - Is the policy that I implemented for preferring ymm registers to zmm
>   registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
>   only Skylake has the bad downclocking.  My current proposal is a bit
>   more conservative; it also excludes Ice Lake and Tiger Lake.  Those
>   CPUs supposedly still have some downclocking, though not as much.
>
> - Should the policy on the use of zmm registers be in a centralized
>   place?  It probably doesn't make sense to have random different
>   policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
>
> - Are there any other known issues with using AVX512 in kernel mode?  It
>   seems to work, and technically it's not new because Poly1305 and ARIA
>   already use AVX512, including the mask registers and zmm registers up
>   to 31.  So if there was a major issue, like the new registers not
>   being properly saved and restored, it probably would have already been
>   found.  But AES-XTS support would introduce a wider use of it.
>

I don't have much input here, except that I think we should just
disable AVX512 kernel-wide on systems where there is no benefit in
terms of throughput. I suspect this might change with algorithms that
rely more heavily on the masking, but so far, we have been making
quite effective use of simple permute vectors and overlapping loads
and stores to do the same. And as Eric points out, the only relevant
use case in the kernel is blocks of size 2^n where n is at least 9.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-03-26  8:51 ` [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Ard Biesheuvel
@ 2024-03-26 16:47   ` Eric Biggers
  2024-04-03  8:12     ` David Laight
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Biggers @ 2024-03-26 16:47 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-crypto, x86, linux-kernel, Andy Lutomirski, Chang S . Bae

On Tue, Mar 26, 2024 at 10:51:48AM +0200, Ard Biesheuvel wrote:
> > Open questions:
> >
> > - Is the policy that I implemented for preferring ymm registers to zmm
> >   registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
> >   only Skylake has the bad downclocking.  My current proposal is a bit
> >   more conservative; it also excludes Ice Lake and Tiger Lake.  Those
> >   CPUs supposedly still have some downclocking, though not as much.
> >
> > - Should the policy on the use of zmm registers be in a centralized
> >   place?  It probably doesn't make sense to have random different
> >   policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
> >
> > - Are there any other known issues with using AVX512 in kernel mode?  It
> >   seems to work, and technically it's not new because Poly1305 and ARIA
> >   already use AVX512, including the mask registers and zmm registers up
> >   to 31.  So if there was a major issue, like the new registers not
> >   being properly saved and restored, it probably would have already been
> >   found.  But AES-XTS support would introduce a wider use of it.
> >
> 
> I don't have much input here, except that I think we should just
> disable AVX512 kernel-wide on systems where there is no benefit in
> terms of throughput. I suspect this might change with algorithms that
> rely more heavily on the masking, but so far, we have been making
> quite effective use of simple permute vectors and overlapping loads
> and stores to do the same. And as Eric points out, the only relevant
> use case in the kernel is blocks of size 2^n where n is at least 9.

There are several benefits to AVX512 besides the 512-bit zmm registers.  Besides
masking, there are also twice as many SIMD registers which make it possible to
cache all the AES round keys.  There are also other new instructions such as
vpternlogd which I've used in AES-XTS to XOR values together more efficiently.

That's why this patchset adds both xts-aes-vaes-avx10_256 and
xts-aes-vaes-avx10_512.  And I've adopted the new "AVX10" naming, maybe a bit
early, to emphasize that it's not just about 512-bit...

Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds
on 4096-byte messages in MB/s I'm seeing:

    xts-aes-aesni                  5136
    xts-aes-aesni-avx              5366
    xts-aes-vaes-avx2              9337
    xts-aes-vaes-avx10_256         9876
    xts-aes-vaes-avx10_512         10215

So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2.
But taking advantage of AVX512 does help a bit more, first from the parts other
than 512-bit registers, then a bit more from 512-bit registers.

I do have Ice Lake on the exclusion list from xts-aes-vaes-avx10_512 anyway,
since the concern with downclocking is not really about the performance of the
code itself but rather the impact on unrelated code running on the CPU.

And I *think* the right policy is to just disable the use of the zmm registers,
as opposed to AVX512 entirely.  As AVX512 was originally presented it did tie
these together, but they don't have to be.  AVX10 (which supposedly future
x86_64 CPUs will have) explicitly moves away from that by repackaging the
existing AVX512 features and making the zmm registers optional.

- Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-03-26 16:47   ` Eric Biggers
@ 2024-04-03  8:12     ` David Laight
  2024-04-04  1:35       ` Eric Biggers
  0 siblings, 1 reply; 19+ messages in thread
From: David Laight @ 2024-04-03  8:12 UTC (permalink / raw)
  To: 'Eric Biggers', Ard Biesheuvel
  Cc: linux-crypto, x86, linux-kernel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers
> Sent: 26 March 2024 16:48
....
> Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds
> on 4096-byte messages in MB/s I'm seeing:
> 
>     xts-aes-aesni                  5136
>     xts-aes-aesni-avx              5366
>     xts-aes-vaes-avx2              9337
>     xts-aes-vaes-avx10_256         9876
>     xts-aes-vaes-avx10_512         10215
> 
> So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2.
> But taking advantage of AVX512 does help a bit more, first from the parts other
> than 512-bit registers, then a bit more from 512-bit registers.

How much does the kernel_fpu_begin() cost on real workloads?
(ie when the registers are live and it forces an extra save/restore)

I've not looked at the code but I often see what looks like
excessive inlining in crypto code.
This will speed up benchmarks but can have a negative effect
on real code both because of the time taken to load the
code and the effect of displacing other code.

It might be that this code is a simple loop....

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-04-03  8:12     ` David Laight
@ 2024-04-04  1:35       ` Eric Biggers
  2024-04-04  7:53         ` David Laight
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Biggers @ 2024-04-04  1:35 UTC (permalink / raw)
  To: David Laight
  Cc: Ard Biesheuvel, linux-crypto, x86, linux-kernel, Andy Lutomirski,
	Chang S . Bae

Hi David,

On Wed, Apr 03, 2024 at 08:12:09AM +0000, David Laight wrote:
> From: Eric Biggers
> > Sent: 26 March 2024 16:48
> ....
> > Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds
> > on 4096-byte messages in MB/s I'm seeing:
> > 
> >     xts-aes-aesni                  5136
> >     xts-aes-aesni-avx              5366
> >     xts-aes-vaes-avx2              9337
> >     xts-aes-vaes-avx10_256         9876
> >     xts-aes-vaes-avx10_512         10215
> > 
> > So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2.
> > But taking advantage of AVX512 does help a bit more, first from the parts other
> > than 512-bit registers, then a bit more from 512-bit registers.
> 
> How much does the kernel_fpu_begin() cost on real workloads?
> (ie when the registers are live and it forces an extra save/restore)

x86 Linux does lazy restore of the FPU state.  The first kernel_fpu_begin() can
have a significant cost, as it issues an XSAVE (or equivalent) instruction and
causes an XRSTOR (or equivalent) instruction to be issued when returning to
userspace when it otherwise might not be needed.  Additional kernel_fpu_begin()
/ kernel_fpu_end() pairs without returning to userspace have only a small cost,
as they don't cause any more saves or restores of the FPU state to be done.

My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end()
pair per message (if the message doesn't span any page boundaries, which is
almost always the case).  That's exactly the same as the current xts-aes-aesni.

I think what you may really be asking is how much the overhead of the XSAVE /
XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel
clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does.
That's much more relevant to this patchset.

I think the answer is that there is no additional overhead.  This is because the
XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers,
and it operates on the userspace state, not the kernel's.  Some of the newer
variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where
they don't save parts of the state that are unmodified since the last XRSTOR;
however, that is unimportant here because the kernel's FPU state is never saved.

(This would change if x86 Linux were to support preemption of kernel-mode FPU
code.  In that case, we may need to take more care to minimize use of AVX and
AVX512 state.  That being said, AES-XTS tends to be used for bulk data anyway.)

This is based on theory, though.  I'll do a test to confirm that there's indeed
no additional overhead.  And also, even if there's no additional overhead, what
the existing overhead actually is.

> I've not looked at the code but I often see what looks like
> excessive inlining in crypto code.
> This will speed up benchmarks but can have a negative effect
> on real code both because of the time taken to load the
> code and the effect of displacing other code.
> 
> It might be that this code is a simple loop....

This is a different topic.  By "inlining" I assume that you also mean things
like loop unrolling.  I totally agree that some of the crypto assembly code goes
way overboard on this, resulting in an unreasonably large machine code size.
The AVX implementation of AES-GCM (aesni-intel_avx-x86_64.S), which was written
by Intel, is the worst offender by far, generating 256011 bytes of machine code.
In OpenSSL, Intel has even taken that to the next level with their VAES
optimized implementation of AES-GCM generating 696040 bytes of machine code.

For my AES-XTS code I've limited the code size to a much more reasonable level
by focusing on the things that make the most difference.  My assembly file
compiles to 14386 bytes of machine code (less than 6% of AES-GCM).  It consists
of encryption and decryption functions for each of the four included
implementations, and also the short function aes_xts_encrypt_iv().  On a
particular CPU model, only one implementation is actually used, resulting in at
most 3500-4000 bytes being actually used at runtime.  However, roughly half of
that is code to handle messages that aren't a multiple of 256 bytes, which
aren't really encountered in practice.  I've placed that code out-of-line to try
to prevent it from polluting the CPU's instruction cache.

On the C side in aesni-intel-glue.c, I have roughly ~600 bytes of code per
implementation for the inlined fast path: half for encryption, half for
decryption.  There arewith ~600 additional bytes for the rarely-executed slow
path of page-spanning messages shared by all implementations.

So in practice, at runtime just over 2 KB of AES-XTS code will get executed,
half for encryption and half for decryption.  That seems reasonable for
something as performance-critical as disk and file encryption.

There are changes that could be made to make the code smaller, for example
rolling up the AES rounds, making encryption and decryption share more code,
doing 1x-wide instead of 4x-wide, etc.  We could also skip the AVX512
implementations and top out at VAES + AVX2.  There are issues with these changes
though -- either they straight up hurt performance on CPUs that I tested, or
they demand a lot more out of the CPU (e.g. relying much more heavily on the
branch predictor) and I was concerned about issues on non-tested or future CPUs.

So, I think my current proposal is at a reasonable place regarding compiled code
size, especially when it's compared to the monstrosity that is some of the
existing crypto assembly code.  But let me know if there are any specific
choices I've made that you may have a different opinion on.

- Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-04-04  1:35       ` Eric Biggers
@ 2024-04-04  7:53         ` David Laight
  2024-04-05 19:19           ` Eric Biggers
  0 siblings, 1 reply; 19+ messages in thread
From: David Laight @ 2024-04-04  7:53 UTC (permalink / raw)
  To: 'Eric Biggers'
  Cc: Ard Biesheuvel, linux-crypto, x86, linux-kernel, Andy Lutomirski,
	Chang S . Bae

From: Eric Biggers
> Sent: 04 April 2024 02:35
> 
> Hi David,
> 
> On Wed, Apr 03, 2024 at 08:12:09AM +0000, David Laight wrote:
> > From: Eric Biggers
> > > Sent: 26 March 2024 16:48
> > ....
> > > Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds
> > > on 4096-byte messages in MB/s I'm seeing:
> > >
> > >     xts-aes-aesni                  5136
> > >     xts-aes-aesni-avx              5366
> > >     xts-aes-vaes-avx2              9337
> > >     xts-aes-vaes-avx10_256         9876
> > >     xts-aes-vaes-avx10_512         10215
> > >
> > > So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2.
> > > But taking advantage of AVX512 does help a bit more, first from the parts other
> > > than 512-bit registers, then a bit more from 512-bit registers.
> >
> > How much does the kernel_fpu_begin() cost on real workloads?
> > (ie when the registers are live and it forces an extra save/restore)
> 
> x86 Linux does lazy restore of the FPU state.  The first kernel_fpu_begin() can
> have a significant cost, as it issues an XSAVE (or equivalent) instruction and
> causes an XRSTOR (or equivalent) instruction to be issued when returning to
> userspace when it otherwise might not be needed.  Additional kernel_fpu_begin()
> / kernel_fpu_end() pairs without returning to userspace have only a small cost,
> as they don't cause any more saves or restores of the FPU state to be done.
> 
> My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end()
> pair per message (if the message doesn't span any page boundaries, which is
> almost always the case).  That's exactly the same as the current xts-aes-aesni.

I realised after sending it that the code almost certainly already did
kernel_fpu_begin() - so there probably isn't a difference because all the
fpu state is always saved.
(I'm sure there should be a way of getting access to (say) 2 ymm registers
by providing an on-stack save area to allow wide data copies or special
instructions - but that is a different issue.)

> I think what you may really be asking is how much the overhead of the XSAVE /
> XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel
> clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does.
> That's much more relevant to this patchset.

It depends on what has to be saved, not on what is used.
Although, since all the x/y/zmm registers are caller-saved I think they could
be 'zapped' on syscall entry (and restored as zero later).
Trouble is I suspect there is a single piece of code somewhere that relies
on them being preserved across an inlined system call.

> I think the answer is that there is no additional overhead.  This is because the
> XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers,
> and it operates on the userspace state, not the kernel's.  Some of the newer
> variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where
> they don't save parts of the state that are unmodified since the last XRSTOR;
> however, that is unimportant here because the kernel's FPU state is never saved.
> 
> (This would change if x86 Linux were to support preemption of kernel-mode FPU
> code.  In that case, we may need to take more care to minimize use of AVX and
> AVX512 state.  That being said, AES-XTS tends to be used for bulk data anyway.)
> 
> This is based on theory, though.  I'll do a test to confirm that there's indeed
> no additional overhead.  And also, even if there's no additional overhead, what
> the existing overhead actually is.

Yes, I was wondering how it is used for 'real applications'.
If a system call that would normally return immediately (or at least without
a full process switch) hits the aes code it gets the cost of the XSAVE added.
Whereas the benchmark probably doesn't do anywhere near as many.

OTOH this is probably no different.

> 
> > I've not looked at the code but I often see what looks like
> > excessive inlining in crypto code.
> > This will speed up benchmarks but can have a negative effect
> > on real code both because of the time taken to load the
> > code and the effect of displacing other code.
> >
> > It might be that this code is a simple loop....
> 
> This is a different topic.  By "inlining" I assume that you also mean things
> like loop unrolling.  I totally agree that some of the crypto assembly code goes
> way overboard on this, resulting in an unreasonably large machine code size.
> The AVX implementation of AES-GCM (aesni-intel_avx-x86_64.S), which was written
> by Intel, is the worst offender by far, generating 256011 bytes of machine code.
> In OpenSSL, Intel has even taken that to the next level with their VAES
> optimized implementation of AES-GCM generating 696040 bytes of machine code.

That is truly stunning!
I can't believe anything that big is actually 'optimised'.
Just think of all the TLB misses :-)
Unless it is slightly faster if you are encrypting several TB of data.

...
> So, I think my current proposal is at a reasonable place regarding compiled code
> size, especially when it's compared to the monstrosity that is some of the
> existing crypto assembly code.  But let me know if there are any specific
> choices I've made that you may have a different opinion on.

At least you've thought about code size.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-03-26  8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
                   ` (6 preceding siblings ...)
  2024-03-26  8:51 ` [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Ard Biesheuvel
@ 2024-04-05  7:58 ` Herbert Xu
  7 siblings, 0 replies; 19+ messages in thread
From: Herbert Xu @ 2024-04-05  7:58 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-crypto, x86, linux-kernel, ardb, luto, chang.seok.bae

Eric Biggers <ebiggers@kernel.org> wrote:
> This patchset adds new AES-XTS implementations that accelerate disk and
> file encryption on modern x86_64 CPUs.
> 
> The largest improvements are seen on CPUs that support the VAES
> extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
> later.  However, an implementation using plain AESNI + AVX is also added
> and provides a small boost on older CPUs too.
> 
> To try to handle the mess that is x86 SIMD, the code for all the new
> AES-XTS implementations is generated from an assembly macro.  This makes
> it so that we e.g. don't have to have entirely different source code
> just for different vector lengths (xmm, ymm, zmm).
> 
> To avoid downclocking effects, zmm registers aren't used on certain
> Intel CPU models such as Ice Lake.  These CPU models default to an
> implementation using ymm registers instead.
> 
> This patchset increases the throughput of AES-256-XTS decryption by the
> following amounts on the following CPUs:
>                            
>                          | 4096-byte messages | 512-byte messages |
>    ----------------------+--------------------+-------------------+
>    Intel Skylake         |        1%          |       11%         |
>    Intel Ice Lake        |        92%         |       59%         |
>    Intel Sapphire Rapids |       115%         |       78%         |
>    AMD Zen 1             |        25%         |       20%         |
>    AMD Zen 2             |        26%         |       20%         |
>    AMD Zen 3             |        82%         |       40%         |
>    AMD Zen 4             |       118%         |       48%         |
> 
> (The results for encryption are very similar to decryption.  I just tend
> to measure decryption because decryption performance is more important.)
> 
> There's no separate kconfig option for the new AES-XTS implementations,
> as they are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.
> 
> To make testing easier, all four new AES-XTS implementations are
> registered separately with the crypto API.  They are prioritized
> appropriately so that the best one for the CPU is used by default.
> 
> Open questions:
> 
> - Is the policy that I implemented for preferring ymm registers to zmm
>  registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
>  only Skylake has the bad downclocking.  My current proposal is a bit
>  more conservative; it also excludes Ice Lake and Tiger Lake.  Those
>  CPUs supposedly still have some downclocking, though not as much.
> 
> - Should the policy on the use of zmm registers be in a centralized
>  place?  It probably doesn't make sense to have random different
>  policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
> 
> - Are there any other known issues with using AVX512 in kernel mode?  It
>  seems to work, and technically it's not new because Poly1305 and ARIA
>  already use AVX512, including the mask registers and zmm registers up
>  to 31.  So if there was a major issue, like the new registers not
>  being properly saved and restored, it probably would have already been
>  found.  But AES-XTS support would introduce a wider use of it.
> 
> Eric Biggers (6):
>  x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
>  crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
>  crypto: x86/aes-xts - wire up AESNI + AVX implementation
>  crypto: x86/aes-xts - wire up VAES + AVX2 implementation
>  crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
>  crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
> 
> arch/x86/Kconfig.assembler           |  10 +
> arch/x86/crypto/Makefile             |   3 +-
> arch/x86/crypto/aes-xts-avx-x86_64.S | 796 +++++++++++++++++++++++++++
> arch/x86/crypto/aesni-intel_glue.c   | 263 ++++++++-
> 4 files changed, 1070 insertions(+), 2 deletions(-)
> create mode 100644 arch/x86/crypto/aes-xts-avx-x86_64.S
> 
> 
> base-commit: 4cece764965020c22cff7665b18a012006359095

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-04-04  7:53         ` David Laight
@ 2024-04-05 19:19           ` Eric Biggers
  2024-04-08  7:41             ` David Laight
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Biggers @ 2024-04-05 19:19 UTC (permalink / raw)
  To: David Laight
  Cc: Ard Biesheuvel, linux-crypto, x86, linux-kernel, Andy Lutomirski,
	Chang S . Bae

On Thu, Apr 04, 2024 at 07:53:48AM +0000, David Laight wrote:
> > >
> > > How much does the kernel_fpu_begin() cost on real workloads?
> > > (ie when the registers are live and it forces an extra save/restore)
> > 
> > x86 Linux does lazy restore of the FPU state.  The first kernel_fpu_begin() can
> > have a significant cost, as it issues an XSAVE (or equivalent) instruction and
> > causes an XRSTOR (or equivalent) instruction to be issued when returning to
> > userspace when it otherwise might not be needed.  Additional kernel_fpu_begin()
> > / kernel_fpu_end() pairs without returning to userspace have only a small cost,
> > as they don't cause any more saves or restores of the FPU state to be done.
> > 
> > My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end()
> > pair per message (if the message doesn't span any page boundaries, which is
> > almost always the case).  That's exactly the same as the current xts-aes-aesni.
> 
> I realised after sending it that the code almost certainly already did
> kernel_fpu_begin() - so there probably isn't a difference because all the
> fpu state is always saved.
> (I'm sure there should be a way of getting access to (say) 2 ymm registers
> by providing an on-stack save area to allow wide data copies or special
> instructions - but that is a different issue.)
> 
> > I think what you may really be asking is how much the overhead of the XSAVE /
> > XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel
> > clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does.
> > That's much more relevant to this patchset.
> 
> It depends on what has to be saved, not on what is used.
> Although, since all the x/y/zmm registers are caller-saved I think they could
> be 'zapped' on syscall entry (and restored as zero later).
> Trouble is I suspect there is a single piece of code somewhere that relies
> on them being preserved across an inlined system call.
> 
> > I think the answer is that there is no additional overhead.  This is because the
> > XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers,
> > and it operates on the userspace state, not the kernel's.  Some of the newer
> > variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where
> > they don't save parts of the state that are unmodified since the last XRSTOR;
> > however, that is unimportant here because the kernel's FPU state is never saved.
> > 
> > (This would change if x86 Linux were to support preemption of kernel-mode FPU
> > code.  In that case, we may need to take more care to minimize use of AVX and
> > AVX512 state.  That being said, AES-XTS tends to be used for bulk data anyway.)
> > 
> > This is based on theory, though.  I'll do a test to confirm that there's indeed
> > no additional overhead.  And also, even if there's no additional overhead, what
> > the existing overhead actually is.
> 
> Yes, I was wondering how it is used for 'real applications'.
> If a system call that would normally return immediately (or at least without
> a full process switch) hits the aes code it gets the cost of the XSAVE added.
> Whereas the benchmark probably doesn't do anywhere near as many.
> 
> OTOH this is probably no different.

I did some tests on Sapphire Rapids using a system call that I customized to do
nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.

On average the bare syscall took 70 ns.  The syscall with the kernel_fpu_begin /
kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
if it used ymm, or 360 ns if it used zmm.  I also tried making the kernel
clobber different registers in the kernel_fpu_begin / kernel_fpu_end section,
and as I expected this did not make any difference.

Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
takes about 2235ns.  With xts-aes-vaes-avx10_512 it takes 75 ns.  (Not a typo --
it really is almost 30 times faster!)  So it seems clear the FPU state save and
restore is worth it even just for a single sector using the traditional 512-byte
sector size, let alone a 4096-byte sector size which is recommended these days.

- Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-04-05 19:19           ` Eric Biggers
@ 2024-04-08  7:41             ` David Laight
  2024-04-08 12:31               ` Eric Biggers
  0 siblings, 1 reply; 19+ messages in thread
From: David Laight @ 2024-04-08  7:41 UTC (permalink / raw)
  To: 'Eric Biggers'
  Cc: Ard Biesheuvel, linux-crypto, x86, linux-kernel, Andy Lutomirski,
	Chang S . Bae

From: Eric Biggers
> Sent: 05 April 2024 20:19
...
> I did some tests on Sapphire Rapids using a system call that I customized to do
> nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.
> 
> On average the bare syscall took 70 ns.  The syscall with the kernel_fpu_begin /
> kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
> if it used ymm, or 360 ns if it used zmm...
> 
> Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
> instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
> On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
> takes about 2235ns.  With xts-aes-vaes-avx10_512 it takes 75 ns...

So most of the cost of a single 512-byte sector is the kernel_fpu_begin().
But it is so much slower any other way it is still faster.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-04-08  7:41             ` David Laight
@ 2024-04-08 12:31               ` Eric Biggers
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Biggers @ 2024-04-08 12:31 UTC (permalink / raw)
  To: David Laight
  Cc: Ard Biesheuvel, linux-crypto, x86, linux-kernel, Andy Lutomirski,
	Chang S . Bae

On Mon, Apr 08, 2024 at 07:41:44AM +0000, David Laight wrote:
> From: Eric Biggers
> > Sent: 05 April 2024 20:19
> ...
> > I did some tests on Sapphire Rapids using a system call that I customized to do
> > nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.
> > 
> > On average the bare syscall took 70 ns.  The syscall with the kernel_fpu_begin /
> > kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
> > if it used ymm, or 360 ns if it used zmm...
> > 
> > Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
> > instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
> > On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
> > takes about 2235ns.  With xts-aes-vaes-avx10_512 it takes 75 ns...
> 
> So most of the cost of a single 512-byte sector is the kernel_fpu_begin().
> But it is so much slower any other way it is still faster.
> 

Yes.  To clarify, the 75 ns time I mentioned for a 512-byte sector is the
average for repeated calls, amortizing the XSAVE and XRSTOR.  For a real single
512-byte sector that eats the entire cost of the XSAVE and XRSTOR by itself, if
all state is in-use it should be about 75 + (360 - 70) = 365 ns (based on the
syscall benchmarks I did), with the XSAVE and XRSTOR accounting for 80% of that
time.  But yes, that's still over 6 times faster than the scalar alternative.

- Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-04-08 12:31 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-26  8:02 [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
2024-03-26  8:02 ` [PATCH 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
2024-03-26  8:10   ` Ingo Molnar
2024-03-26  8:18     ` Eric Biggers
2024-03-26  8:28       ` Ingo Molnar
2024-03-26  8:03 ` [PATCH 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs Eric Biggers
2024-03-26  8:03 ` [PATCH 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation Eric Biggers
2024-03-26  8:03 ` [PATCH 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation Eric Biggers
2024-03-26  8:03 ` [PATCH 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation Eric Biggers
2024-03-26  8:03 ` [PATCH 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Eric Biggers
2024-03-26  8:51 ` [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs Ard Biesheuvel
2024-03-26 16:47   ` Eric Biggers
2024-04-03  8:12     ` David Laight
2024-04-04  1:35       ` Eric Biggers
2024-04-04  7:53         ` David Laight
2024-04-05 19:19           ` Eric Biggers
2024-04-08  7:41             ` David Laight
2024-04-08 12:31               ` Eric Biggers
2024-04-05  7:58 ` Herbert Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.