All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
@ 2017-08-29 18:05 Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 01/12] x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S Josh Poimboeuf
                   ` (12 more replies)
  0 siblings, 13 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Many of the x86 crypto functions use RBP as a temporary register.  This
breaks frame pointer convention, and breaks stack traces when unwinding
from an interrupt in the crypto code.

Convert most* of them to leave RBP alone.

These pass the crypto boot tests for me.  Any further testing would be
appreciated!

[*] There are still a few crypto files left that need fixing, but the
    fixes weren't trivial and nobody reported unwinder warnings about
    them yet, so I'm skipping them for now.

Josh Poimboeuf (12):
  x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in des3_ede-asm_64.S
  x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
  x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
  x86/crypto: Fix RBP usage in sha256-avx-asm.S
  x86/crypto: Fix RBP usage in sha256-avx2-asm.S
  x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
  x86/crypto: Fix RBP usage in sha512-avx2-asm.S
  x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S

 arch/x86/crypto/blowfish-x86_64-asm_64.S    | 48 ++++++++++++++-------------
 arch/x86/crypto/camellia-x86_64-asm_64.S    | 26 +++++++--------
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S   | 47 +++++++++++++++++----------
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S   | 50 ++++++++++++++++++++---------
 arch/x86/crypto/des3_ede-asm_64.S           | 15 +++++----
 arch/x86/crypto/sha1_avx2_x86_64_asm.S      |  4 +--
 arch/x86/crypto/sha1_ssse3_asm.S            | 11 +++----
 arch/x86/crypto/sha256-avx-asm.S            | 15 ++++-----
 arch/x86/crypto/sha256-avx2-asm.S           | 16 ++++-----
 arch/x86/crypto/sha256-ssse3-asm.S          | 15 ++++-----
 arch/x86/crypto/sha512-avx2-asm.S           | 21 ++++++++----
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 12 +++----
 12 files changed, 160 insertions(+), 120 deletions(-)

-- 
2.13.5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 01/12] x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 02/12] x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S Josh Poimboeuf
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R12 instead of RBP.  R12 can't be used as the RT0 register because
of x86 instruction encoding limitations.  So use R12 for CTX and RDI for
CTX.  This means that CTX is no longer an implicit function argument.
Instead it needs to be explicitly copied from RDI.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/blowfish-x86_64-asm_64.S | 48 +++++++++++++++++---------------
 1 file changed, 26 insertions(+), 22 deletions(-)

diff --git a/arch/x86/crypto/blowfish-x86_64-asm_64.S b/arch/x86/crypto/blowfish-x86_64-asm_64.S
index 246c67006ed0..8c1fcb6bad21 100644
--- a/arch/x86/crypto/blowfish-x86_64-asm_64.S
+++ b/arch/x86/crypto/blowfish-x86_64-asm_64.S
@@ -33,7 +33,7 @@
 #define s3	((16 + 2 + (3 * 256)) * 4)
 
 /* register macros */
-#define CTX %rdi
+#define CTX %r12
 #define RIO %rsi
 
 #define RX0 %rax
@@ -56,12 +56,12 @@
 #define RX2bh %ch
 #define RX3bh %dh
 
-#define RT0 %rbp
+#define RT0 %rdi
 #define RT1 %rsi
 #define RT2 %r8
 #define RT3 %r9
 
-#define RT0d %ebp
+#define RT0d %edi
 #define RT1d %esi
 #define RT2d %r8d
 #define RT3d %r9d
@@ -120,13 +120,14 @@
 
 ENTRY(__blowfish_enc_blk)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 *	%rcx: bool, if true: xor output
 	 */
-	movq %rbp, %r11;
+	movq %r12, %r11;
 
+	movq %rdi, CTX;
 	movq %rsi, %r10;
 	movq %rdx, RIO;
 
@@ -142,7 +143,7 @@ ENTRY(__blowfish_enc_blk)
 	round_enc(14);
 	add_roundkey_enc(16);
 
-	movq %r11, %rbp;
+	movq %r11, %r12;
 
 	movq %r10, RIO;
 	test %cl, %cl;
@@ -157,12 +158,13 @@ ENDPROC(__blowfish_enc_blk)
 
 ENTRY(blowfish_dec_blk)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 */
-	movq %rbp, %r11;
+	movq %r12, %r11;
 
+	movq %rdi, CTX;
 	movq %rsi, %r10;
 	movq %rdx, RIO;
 
@@ -181,7 +183,7 @@ ENTRY(blowfish_dec_blk)
 	movq %r10, RIO;
 	write_block();
 
-	movq %r11, %rbp;
+	movq %r11, %r12;
 
 	ret;
 ENDPROC(blowfish_dec_blk)
@@ -298,20 +300,21 @@ ENDPROC(blowfish_dec_blk)
 
 ENTRY(__blowfish_enc_blk_4way)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 *	%rcx: bool, if true: xor output
 	 */
-	pushq %rbp;
+	pushq %r12;
 	pushq %rbx;
 	pushq %rcx;
 
-	preload_roundkey_enc(0);
-
+	movq %rdi, CTX
 	movq %rsi, %r11;
 	movq %rdx, RIO;
 
+	preload_roundkey_enc(0);
+
 	read_block4();
 
 	round_enc4(0);
@@ -324,39 +327,40 @@ ENTRY(__blowfish_enc_blk_4way)
 	round_enc4(14);
 	add_preloaded_roundkey4();
 
-	popq %rbp;
+	popq %r12;
 	movq %r11, RIO;
 
-	test %bpl, %bpl;
+	test %r12b, %r12b;
 	jnz .L__enc_xor4;
 
 	write_block4();
 
 	popq %rbx;
-	popq %rbp;
+	popq %r12;
 	ret;
 
 .L__enc_xor4:
 	xor_block4();
 
 	popq %rbx;
-	popq %rbp;
+	popq %r12;
 	ret;
 ENDPROC(__blowfish_enc_blk_4way)
 
 ENTRY(blowfish_dec_blk_4way)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 */
-	pushq %rbp;
+	pushq %r12;
 	pushq %rbx;
-	preload_roundkey_dec(17);
 
-	movq %rsi, %r11;
+	movq %rdi, CTX;
+	movq %rsi, %r11
 	movq %rdx, RIO;
 
+	preload_roundkey_dec(17);
 	read_block4();
 
 	round_dec4(17);
@@ -373,7 +377,7 @@ ENTRY(blowfish_dec_blk_4way)
 	write_block4();
 
 	popq %rbx;
-	popq %rbp;
+	popq %r12;
 
 	ret;
 ENDPROC(blowfish_dec_blk_4way)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 02/12] x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 01/12] x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 03/12] x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S Josh Poimboeuf
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R12 instead of RBP.  Both are callee-saved registers, so the
substitution is straightforward.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/camellia-x86_64-asm_64.S | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/crypto/camellia-x86_64-asm_64.S b/arch/x86/crypto/camellia-x86_64-asm_64.S
index 310319c601ed..95ba6956a7f6 100644
--- a/arch/x86/crypto/camellia-x86_64-asm_64.S
+++ b/arch/x86/crypto/camellia-x86_64-asm_64.S
@@ -75,17 +75,17 @@
 #define RCD1bh %dh
 
 #define RT0 %rsi
-#define RT1 %rbp
+#define RT1 %r12
 #define RT2 %r8
 
 #define RT0d %esi
-#define RT1d %ebp
+#define RT1d %r12d
 #define RT2d %r8d
 
 #define RT2bl %r8b
 
 #define RXOR %r9
-#define RRBP %r10
+#define RR12 %r10
 #define RDST %r11
 
 #define RXORd %r9d
@@ -197,7 +197,7 @@ ENTRY(__camellia_enc_blk)
 	 *	%rdx: src
 	 *	%rcx: bool xor
 	 */
-	movq %rbp, RRBP;
+	movq %r12, RR12;
 
 	movq %rcx, RXOR;
 	movq %rsi, RDST;
@@ -227,13 +227,13 @@ ENTRY(__camellia_enc_blk)
 
 	enc_outunpack(mov, RT1);
 
-	movq RRBP, %rbp;
+	movq RR12, %r12;
 	ret;
 
 .L__enc_xor:
 	enc_outunpack(xor, RT1);
 
-	movq RRBP, %rbp;
+	movq RR12, %r12;
 	ret;
 ENDPROC(__camellia_enc_blk)
 
@@ -248,7 +248,7 @@ ENTRY(camellia_dec_blk)
 	movl $24, RXORd;
 	cmovel RXORd, RT2d; /* max */
 
-	movq %rbp, RRBP;
+	movq %r12, RR12;
 	movq %rsi, RDST;
 	movq %rdx, RIO;
 
@@ -271,7 +271,7 @@ ENTRY(camellia_dec_blk)
 
 	dec_outunpack();
 
-	movq RRBP, %rbp;
+	movq RR12, %r12;
 	ret;
 ENDPROC(camellia_dec_blk)
 
@@ -433,7 +433,7 @@ ENTRY(__camellia_enc_blk_2way)
 	 */
 	pushq %rbx;
 
-	movq %rbp, RRBP;
+	movq %r12, RR12;
 	movq %rcx, RXOR;
 	movq %rsi, RDST;
 	movq %rdx, RIO;
@@ -461,14 +461,14 @@ ENTRY(__camellia_enc_blk_2way)
 
 	enc_outunpack2(mov, RT2);
 
-	movq RRBP, %rbp;
+	movq RR12, %r12;
 	popq %rbx;
 	ret;
 
 .L__enc2_xor:
 	enc_outunpack2(xor, RT2);
 
-	movq RRBP, %rbp;
+	movq RR12, %r12;
 	popq %rbx;
 	ret;
 ENDPROC(__camellia_enc_blk_2way)
@@ -485,7 +485,7 @@ ENTRY(camellia_dec_blk_2way)
 	cmovel RXORd, RT2d; /* max */
 
 	movq %rbx, RXOR;
-	movq %rbp, RRBP;
+	movq %r12, RR12;
 	movq %rsi, RDST;
 	movq %rdx, RIO;
 
@@ -508,7 +508,7 @@ ENTRY(camellia_dec_blk_2way)
 
 	dec_outunpack2();
 
-	movq RRBP, %rbp;
+	movq RR12, %r12;
 	movq RXOR, %rbx;
 	ret;
 ENDPROC(camellia_dec_blk_2way)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 03/12] x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 01/12] x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 02/12] x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 04/12] x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S Josh Poimboeuf
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R15 instead of RBP.  R15 can't be used as the RID1 register because
of x86 instruction encoding limitations.  So use R15 for CTX and RDI for
CTX.  This means that CTX is no longer an implicit function argument.
Instead it needs to be explicitly copied from RDI.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S | 47 ++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 17 deletions(-)

diff --git a/arch/x86/crypto/cast5-avx-x86_64-asm_64.S b/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
index b4a8806234ea..86107c961bb4 100644
--- a/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
@@ -47,7 +47,7 @@
 /**********************************************************************
   16-way AVX cast5
  **********************************************************************/
-#define CTX %rdi
+#define CTX %r15
 
 #define RL1 %xmm0
 #define RR1 %xmm1
@@ -70,8 +70,8 @@
 
 #define RTMP %xmm15
 
-#define RID1  %rbp
-#define RID1d %ebp
+#define RID1  %rdi
+#define RID1d %edi
 #define RID2  %rsi
 #define RID2d %esi
 
@@ -226,7 +226,7 @@
 .align 16
 __cast5_enc_blk16:
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	RL1: blocks 1 and 2
 	 *	RR1: blocks 3 and 4
 	 *	RL2: blocks 5 and 6
@@ -246,9 +246,11 @@ __cast5_enc_blk16:
 	 *	RR4: encrypted blocks 15 and 16
 	 */
 
-	pushq %rbp;
+	pushq %r15;
 	pushq %rbx;
 
+	movq %rdi, CTX;
+
 	vmovdqa .Lbswap_mask, RKM;
 	vmovd .Lfirst_mask, R1ST;
 	vmovd .L32_mask, R32;
@@ -283,7 +285,7 @@ __cast5_enc_blk16:
 
 .L__skip_enc:
 	popq %rbx;
-	popq %rbp;
+	popq %r15;
 
 	vmovdqa .Lbswap_mask, RKM;
 
@@ -298,7 +300,7 @@ ENDPROC(__cast5_enc_blk16)
 .align 16
 __cast5_dec_blk16:
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	RL1: encrypted blocks 1 and 2
 	 *	RR1: encrypted blocks 3 and 4
 	 *	RL2: encrypted blocks 5 and 6
@@ -318,9 +320,11 @@ __cast5_dec_blk16:
 	 *	RR4: decrypted blocks 15 and 16
 	 */
 
-	pushq %rbp;
+	pushq %r15;
 	pushq %rbx;
 
+	movq %rdi, CTX;
+
 	vmovdqa .Lbswap_mask, RKM;
 	vmovd .Lfirst_mask, R1ST;
 	vmovd .L32_mask, R32;
@@ -356,7 +360,7 @@ __cast5_dec_blk16:
 
 	vmovdqa .Lbswap_mask, RKM;
 	popq %rbx;
-	popq %rbp;
+	popq %r15;
 
 	outunpack_blocks(RR1, RL1, RTMP, RX, RKM);
 	outunpack_blocks(RR2, RL2, RTMP, RX, RKM);
@@ -372,12 +376,14 @@ ENDPROC(__cast5_dec_blk16)
 
 ENTRY(cast5_ecb_enc_16way)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 */
 	FRAME_BEGIN
+	pushq %r15;
 
+	movq %rdi, CTX;
 	movq %rsi, %r11;
 
 	vmovdqu (0*4*4)(%rdx), RL1;
@@ -400,18 +406,22 @@ ENTRY(cast5_ecb_enc_16way)
 	vmovdqu RR4, (6*4*4)(%r11);
 	vmovdqu RL4, (7*4*4)(%r11);
 
+	popq %r15;
 	FRAME_END
 	ret;
 ENDPROC(cast5_ecb_enc_16way)
 
 ENTRY(cast5_ecb_dec_16way)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 */
 
 	FRAME_BEGIN
+	pushq %r15;
+
+	movq %rdi, CTX;
 	movq %rsi, %r11;
 
 	vmovdqu (0*4*4)(%rdx), RL1;
@@ -434,20 +444,22 @@ ENTRY(cast5_ecb_dec_16way)
 	vmovdqu RR4, (6*4*4)(%r11);
 	vmovdqu RL4, (7*4*4)(%r11);
 
+	popq %r15;
 	FRAME_END
 	ret;
 ENDPROC(cast5_ecb_dec_16way)
 
 ENTRY(cast5_cbc_dec_16way)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 */
 	FRAME_BEGIN
-
 	pushq %r12;
+	pushq %r15;
 
+	movq %rdi, CTX;
 	movq %rsi, %r11;
 	movq %rdx, %r12;
 
@@ -483,23 +495,24 @@ ENTRY(cast5_cbc_dec_16way)
 	vmovdqu RR4, (6*16)(%r11);
 	vmovdqu RL4, (7*16)(%r11);
 
+	popq %r15;
 	popq %r12;
-
 	FRAME_END
 	ret;
 ENDPROC(cast5_cbc_dec_16way)
 
 ENTRY(cast5_ctr_16way)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 *	%rcx: iv (big endian, 64bit)
 	 */
 	FRAME_BEGIN
-
 	pushq %r12;
+	pushq %r15;
 
+	movq %rdi, CTX;
 	movq %rsi, %r11;
 	movq %rdx, %r12;
 
@@ -558,8 +571,8 @@ ENTRY(cast5_ctr_16way)
 	vmovdqu RR4, (6*16)(%r11);
 	vmovdqu RL4, (7*16)(%r11);
 
+	popq %r15;
 	popq %r12;
-
 	FRAME_END
 	ret;
 ENDPROC(cast5_ctr_16way)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 04/12] x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
                   ` (2 preceding siblings ...)
  2017-08-29 18:05 ` [PATCH 03/12] x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 05/12] x86/crypto: Fix RBP usage in des3_ede-asm_64.S Josh Poimboeuf
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R15 instead of RBP.  R15 can't be used as the RID1 register because
of x86 instruction encoding limitations.  So use R15 for CTX and RDI for
CTX.  This means that CTX is no longer an implicit function argument.
Instead it needs to be explicitly copied from RDI.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S | 50 +++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
index 952d3156a933..7f30b6f0d72c 100644
--- a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
@@ -47,7 +47,7 @@
 /**********************************************************************
   8-way AVX cast6
  **********************************************************************/
-#define CTX %rdi
+#define CTX %r15
 
 #define RA1 %xmm0
 #define RB1 %xmm1
@@ -70,8 +70,8 @@
 
 #define RTMP %xmm15
 
-#define RID1  %rbp
-#define RID1d %ebp
+#define RID1  %rdi
+#define RID1d %edi
 #define RID2  %rsi
 #define RID2d %esi
 
@@ -264,15 +264,17 @@
 .align 8
 __cast6_enc_blk8:
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: blocks
 	 * output:
 	 *	RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: encrypted blocks
 	 */
 
-	pushq %rbp;
+	pushq %r15;
 	pushq %rbx;
 
+	movq %rdi, CTX;
+
 	vmovdqa .Lbswap_mask, RKM;
 	vmovd .Lfirst_mask, R1ST;
 	vmovd .L32_mask, R32;
@@ -297,7 +299,7 @@ __cast6_enc_blk8:
 	QBAR(11);
 
 	popq %rbx;
-	popq %rbp;
+	popq %r15;
 
 	vmovdqa .Lbswap_mask, RKM;
 
@@ -310,15 +312,17 @@ ENDPROC(__cast6_enc_blk8)
 .align 8
 __cast6_dec_blk8:
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: encrypted blocks
 	 * output:
 	 *	RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: decrypted blocks
 	 */
 
-	pushq %rbp;
+	pushq %r15;
 	pushq %rbx;
 
+	movq %rdi, CTX;
+
 	vmovdqa .Lbswap_mask, RKM;
 	vmovd .Lfirst_mask, R1ST;
 	vmovd .L32_mask, R32;
@@ -343,7 +347,7 @@ __cast6_dec_blk8:
 	QBAR(0);
 
 	popq %rbx;
-	popq %rbp;
+	popq %r15;
 
 	vmovdqa .Lbswap_mask, RKM;
 	outunpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
@@ -354,12 +358,14 @@ ENDPROC(__cast6_dec_blk8)
 
 ENTRY(cast6_ecb_enc_8way)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 */
 	FRAME_BEGIN
+	pushq %r15;
 
+	movq %rdi, CTX;
 	movq %rsi, %r11;
 
 	load_8way(%rdx, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
@@ -368,18 +374,21 @@ ENTRY(cast6_ecb_enc_8way)
 
 	store_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+	popq %r15;
 	FRAME_END
 	ret;
 ENDPROC(cast6_ecb_enc_8way)
 
 ENTRY(cast6_ecb_dec_8way)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 */
 	FRAME_BEGIN
+	pushq %r15;
 
+	movq %rdi, CTX;
 	movq %rsi, %r11;
 
 	load_8way(%rdx, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
@@ -388,20 +397,22 @@ ENTRY(cast6_ecb_dec_8way)
 
 	store_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+	popq %r15;
 	FRAME_END
 	ret;
 ENDPROC(cast6_ecb_dec_8way)
 
 ENTRY(cast6_cbc_dec_8way)
 	/* input:
-	 *	%rdi: ctx, CTX
+	 *	%rdi: ctx
 	 *	%rsi: dst
 	 *	%rdx: src
 	 */
 	FRAME_BEGIN
-
 	pushq %r12;
+	pushq %r15;
 
+	movq %rdi, CTX;
 	movq %rsi, %r11;
 	movq %rdx, %r12;
 
@@ -411,8 +422,8 @@ ENTRY(cast6_cbc_dec_8way)
 
 	store_cbc_8way(%r12, %r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+	popq %r15;
 	popq %r12;
-
 	FRAME_END
 	ret;
 ENDPROC(cast6_cbc_dec_8way)
@@ -425,9 +436,10 @@ ENTRY(cast6_ctr_8way)
 	 *	%rcx: iv (little endian, 128bit)
 	 */
 	FRAME_BEGIN
-
 	pushq %r12;
+	pushq %r15
 
+	movq %rdi, CTX;
 	movq %rsi, %r11;
 	movq %rdx, %r12;
 
@@ -438,8 +450,8 @@ ENTRY(cast6_ctr_8way)
 
 	store_ctr_8way(%r12, %r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+	popq %r15;
 	popq %r12;
-
 	FRAME_END
 	ret;
 ENDPROC(cast6_ctr_8way)
@@ -452,7 +464,9 @@ ENTRY(cast6_xts_enc_8way)
 	 *	%rcx: iv (t ⊕ αⁿ ∈ GF(2¹²⁸))
 	 */
 	FRAME_BEGIN
+	pushq %r15;
 
+	movq %rdi, CTX
 	movq %rsi, %r11;
 
 	/* regs <= src, dst <= IVs, regs <= regs xor IVs */
@@ -464,6 +478,7 @@ ENTRY(cast6_xts_enc_8way)
 	/* dst <= regs xor IVs(in dst) */
 	store_xts_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+	popq %r15;
 	FRAME_END
 	ret;
 ENDPROC(cast6_xts_enc_8way)
@@ -476,7 +491,9 @@ ENTRY(cast6_xts_dec_8way)
 	 *	%rcx: iv (t ⊕ αⁿ ∈ GF(2¹²⁸))
 	 */
 	FRAME_BEGIN
+	pushq %r15;
 
+	movq %rdi, CTX
 	movq %rsi, %r11;
 
 	/* regs <= src, dst <= IVs, regs <= regs xor IVs */
@@ -488,6 +505,7 @@ ENTRY(cast6_xts_dec_8way)
 	/* dst <= regs xor IVs(in dst) */
 	store_xts_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+	popq %r15;
 	FRAME_END
 	ret;
 ENDPROC(cast6_xts_dec_8way)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 05/12] x86/crypto: Fix RBP usage in des3_ede-asm_64.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
                   ` (3 preceding siblings ...)
  2017-08-29 18:05 ` [PATCH 04/12] x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 06/12] x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S Josh Poimboeuf
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use RSI instead of RBP for RT1.  Since RSI is also used as a the 'dst'
function argument, it needs to be saved on the stack until the argument
is needed.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/des3_ede-asm_64.S | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/des3_ede-asm_64.S b/arch/x86/crypto/des3_ede-asm_64.S
index f3e91647ca27..8e49ce117494 100644
--- a/arch/x86/crypto/des3_ede-asm_64.S
+++ b/arch/x86/crypto/des3_ede-asm_64.S
@@ -64,12 +64,12 @@
 #define RW2bh %ch
 
 #define RT0 %r15
-#define RT1 %rbp
+#define RT1 %rsi
 #define RT2 %r14
 #define RT3 %rdx
 
 #define RT0d %r15d
-#define RT1d %ebp
+#define RT1d %esi
 #define RT2d %r14d
 #define RT3d %edx
 
@@ -177,13 +177,14 @@ ENTRY(des3_ede_x86_64_crypt_blk)
 	 *	%rsi: dst
 	 *	%rdx: src
 	 */
-	pushq %rbp;
 	pushq %rbx;
 	pushq %r12;
 	pushq %r13;
 	pushq %r14;
 	pushq %r15;
 
+	pushq %rsi; /* dst */
+
 	read_block(%rdx, RL0, RR0);
 	initial_permutation(RL0, RR0);
 
@@ -241,6 +242,8 @@ ENTRY(des3_ede_x86_64_crypt_blk)
 	round1(32+15, RL0, RR0, dummy2);
 
 	final_permutation(RR0, RL0);
+
+	popq %rsi /* dst */
 	write_block(%rsi, RR0, RL0);
 
 	popq %r15;
@@ -248,7 +251,6 @@ ENTRY(des3_ede_x86_64_crypt_blk)
 	popq %r13;
 	popq %r12;
 	popq %rbx;
-	popq %rbp;
 
 	ret;
 ENDPROC(des3_ede_x86_64_crypt_blk)
@@ -432,13 +434,14 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
 	 *	%rdx: src (3 blocks)
 	 */
 
-	pushq %rbp;
 	pushq %rbx;
 	pushq %r12;
 	pushq %r13;
 	pushq %r14;
 	pushq %r15;
 
+	pushq %rsi /* dst */
+
 	/* load input */
 	movl 0 * 4(%rdx), RL0d;
 	movl 1 * 4(%rdx), RR0d;
@@ -520,6 +523,7 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
 	bswapl RR2d;
 	bswapl RL2d;
 
+	popq %rsi /* dst */
 	movl RR0d, 0 * 4(%rsi);
 	movl RL0d, 1 * 4(%rsi);
 	movl RR1d, 2 * 4(%rsi);
@@ -532,7 +536,6 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
 	popq %r13;
 	popq %r12;
 	popq %rbx;
-	popq %rbp;
 
 	ret;
 ENDPROC(des3_ede_x86_64_crypt_blk_3way)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 06/12] x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
                   ` (4 preceding siblings ...)
  2017-08-29 18:05 ` [PATCH 05/12] x86/crypto: Fix RBP usage in des3_ede-asm_64.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-09-06 16:11   ` Tim Chen
  2017-08-29 18:05 ` [PATCH 07/12] x86/crypto: Fix RBP usage in sha1_ssse3_asm.S Josh Poimboeuf
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R11 instead of RBP.  Since R11 isn't a callee-saved register, it
doesn't need to be saved and restored on the stack.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/sha1_avx2_x86_64_asm.S | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/crypto/sha1_avx2_x86_64_asm.S b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
index 1eab79c9ac48..9f712a7dfd79 100644
--- a/arch/x86/crypto/sha1_avx2_x86_64_asm.S
+++ b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
@@ -89,7 +89,7 @@
 #define	REG_RE	%rdx
 #define	REG_RTA	%r12
 #define	REG_RTB	%rbx
-#define	REG_T1	%ebp
+#define	REG_T1	%r11d
 #define	xmm_mov	vmovups
 #define	avx2_zeroupper	vzeroupper
 #define	RND_F1	1
@@ -637,7 +637,6 @@ _loop3:
 	ENTRY(\name)
 
 	push	%rbx
-	push	%rbp
 	push	%r12
 	push	%r13
 	push	%r14
@@ -673,7 +672,6 @@ _loop3:
 	pop	%r14
 	pop	%r13
 	pop	%r12
-	pop	%rbp
 	pop	%rbx
 
 	ret
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 07/12] x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
                   ` (5 preceding siblings ...)
  2017-08-29 18:05 ` [PATCH 06/12] x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 08/12] x86/crypto: Fix RBP usage in sha256-avx-asm.S Josh Poimboeuf
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Swap the usages of R12 and RBP.  Use R12 for the REG_D register, and use
RBP to store the pre-aligned stack pointer.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/sha1_ssse3_asm.S | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/sha1_ssse3_asm.S b/arch/x86/crypto/sha1_ssse3_asm.S
index a4109506a5e8..6204bd53528c 100644
--- a/arch/x86/crypto/sha1_ssse3_asm.S
+++ b/arch/x86/crypto/sha1_ssse3_asm.S
@@ -37,7 +37,7 @@
 #define REG_A	%ecx
 #define REG_B	%esi
 #define REG_C	%edi
-#define REG_D	%ebp
+#define REG_D	%r12d
 #define REG_E	%edx
 
 #define REG_T1	%eax
@@ -74,10 +74,10 @@
 	ENTRY(\name)
 
 	push	%rbx
-	push	%rbp
 	push	%r12
+	push	%rbp
+	mov	%rsp, %rbp
 
-	mov	%rsp, %r12
 	sub	$64, %rsp		# allocate workspace
 	and	$~15, %rsp		# align stack
 
@@ -99,10 +99,9 @@
 	xor	%rax, %rax
 	rep stosq
 
-	mov	%r12, %rsp		# deallocate workspace
-
-	pop	%r12
+	mov	%rbp, %rsp		# deallocate workspace
 	pop	%rbp
+	pop	%r12
 	pop	%rbx
 	ret
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 08/12] x86/crypto: Fix RBP usage in sha256-avx-asm.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
                   ` (6 preceding siblings ...)
  2017-08-29 18:05 ` [PATCH 07/12] x86/crypto: Fix RBP usage in sha1_ssse3_asm.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 09/12] x86/crypto: Fix RBP usage in sha256-avx2-asm.S Josh Poimboeuf
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Swap the usages of R12 and RBP.  Use R12 for the TBL register, and use
RBP to store the pre-aligned stack pointer.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/sha256-avx-asm.S | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx-asm.S b/arch/x86/crypto/sha256-avx-asm.S
index e08888a1a5f2..001bbcf93c79 100644
--- a/arch/x86/crypto/sha256-avx-asm.S
+++ b/arch/x86/crypto/sha256-avx-asm.S
@@ -103,7 +103,7 @@ SRND = %rsi       # clobbers INP
 c = %ecx
 d = %r8d
 e = %edx
-TBL = %rbp
+TBL = %r12
 a = %eax
 b = %ebx
 
@@ -350,13 +350,13 @@ a = TMP_
 ENTRY(sha256_transform_avx)
 .align 32
 	pushq   %rbx
-	pushq   %rbp
+	pushq   %r12
 	pushq   %r13
 	pushq   %r14
 	pushq   %r15
-	pushq   %r12
+	pushq	%rbp
+	movq	%rsp, %rbp
 
-	mov	%rsp, %r12
 	subq    $STACK_SIZE, %rsp	# allocate stack space
 	and	$~15, %rsp		# align stack pointer
 
@@ -452,13 +452,12 @@ loop2:
 
 done_hash:
 
-	mov	%r12, %rsp
-
-	popq	%r12
+	mov	%rbp, %rsp
+	popq	%rbp
 	popq    %r15
 	popq    %r14
 	popq    %r13
-	popq    %rbp
+	popq	%r12
 	popq    %rbx
 	ret
 ENDPROC(sha256_transform_avx)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 09/12] x86/crypto: Fix RBP usage in sha256-avx2-asm.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
                   ` (7 preceding siblings ...)
  2017-08-29 18:05 ` [PATCH 08/12] x86/crypto: Fix RBP usage in sha256-avx-asm.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 10/12] x86/crypto: Fix RBP usage in sha256-ssse3-asm.S Josh Poimboeuf
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R12 instead of RBP for the TBL register.  Since R12 is also used as
another temporary register (T1), it gets clobbered in each round of
computation.  So the table address needs to be freshly reloaded into R12
each time it's used.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/sha256-avx2-asm.S | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx2-asm.S b/arch/x86/crypto/sha256-avx2-asm.S
index 89c8f09787d2..cdd647231fa9 100644
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -99,7 +99,7 @@ e       = %edx	# clobbers NUM_BLKS
 y3	= %esi	# clobbers INP
 
 
-TBL	= %rbp
+TBL	= %r12	# clobbered by T1
 SRND	= CTX	# SRND is same register as CTX
 
 a = %eax
@@ -531,7 +531,6 @@ STACK_SIZE	= _RSP      + _RSP_SIZE
 ENTRY(sha256_transform_rorx)
 .align 32
 	pushq	%rbx
-	pushq	%rbp
 	pushq	%r12
 	pushq	%r13
 	pushq	%r14
@@ -568,8 +567,6 @@ ENTRY(sha256_transform_rorx)
 	mov	CTX, _CTX(%rsp)
 
 loop0:
-	lea     K256(%rip), TBL
-
 	## Load first 16 dwords from two blocks
 	VMOVDQ	0*32(INP),XTMP0
 	VMOVDQ	1*32(INP),XTMP1
@@ -597,18 +594,22 @@ last_block_enter:
 
 .align 16
 loop1:
+	lea	K256(%rip), TBL
 	vpaddd	0*32(TBL, SRND), X0, XFER
 	vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 0*32
 
+	lea	K256(%rip), TBL
 	vpaddd	1*32(TBL, SRND), X0, XFER
 	vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 1*32
 
+	lea	K256(%rip), TBL
 	vpaddd	2*32(TBL, SRND), X0, XFER
 	vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 2*32
 
+	lea	K256(%rip), TBL
 	vpaddd	3*32(TBL, SRND), X0, XFER
 	vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 3*32
@@ -619,9 +620,12 @@ loop1:
 
 loop2:
 	## Do last 16 rounds with no scheduling
+	lea	K256(%rip), TBL
 	vpaddd	0*32(TBL, SRND), X0, XFER
 	vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
 	DO_4ROUNDS	_XFER + 0*32
+
+	lea	K256(%rip), TBL
 	vpaddd	1*32(TBL, SRND), X1, XFER
 	vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
 	DO_4ROUNDS	_XFER + 1*32
@@ -676,9 +680,6 @@ loop3:
 	ja	done_hash
 
 do_last_block:
-	#### do last block
-	lea	K256(%rip), TBL
-
 	VMOVDQ	0*16(INP),XWORD0
 	VMOVDQ	1*16(INP),XWORD1
 	VMOVDQ	2*16(INP),XWORD2
@@ -718,7 +719,6 @@ done_hash:
 	popq	%r14
 	popq	%r13
 	popq	%r12
-	popq	%rbp
 	popq	%rbx
 	ret
 ENDPROC(sha256_transform_rorx)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 10/12] x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
                   ` (8 preceding siblings ...)
  2017-08-29 18:05 ` [PATCH 09/12] x86/crypto: Fix RBP usage in sha256-avx2-asm.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 11/12] x86/crypto: Fix RBP usage in sha512-avx2-asm.S Josh Poimboeuf
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Swap the usages of R12 and RBP.  Use R12 for the TBL register, and use
RBP to store the pre-aligned stack pointer.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/sha256-ssse3-asm.S | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/crypto/sha256-ssse3-asm.S b/arch/x86/crypto/sha256-ssse3-asm.S
index 39b83c93e7fd..c6c05ed2c16a 100644
--- a/arch/x86/crypto/sha256-ssse3-asm.S
+++ b/arch/x86/crypto/sha256-ssse3-asm.S
@@ -95,7 +95,7 @@ SRND = %rsi       # clobbers INP
 c = %ecx
 d = %r8d
 e = %edx
-TBL = %rbp
+TBL = %r12
 a = %eax
 b = %ebx
 
@@ -356,13 +356,13 @@ a = TMP_
 ENTRY(sha256_transform_ssse3)
 .align 32
 	pushq   %rbx
-	pushq   %rbp
+	pushq   %r12
 	pushq   %r13
 	pushq   %r14
 	pushq   %r15
-	pushq   %r12
+	pushq   %rbp
+	mov	%rsp, %rbp
 
-	mov	%rsp, %r12
 	subq    $STACK_SIZE, %rsp
 	and	$~15, %rsp
 
@@ -462,13 +462,12 @@ loop2:
 
 done_hash:
 
-	mov	%r12, %rsp
-
-	popq    %r12
+	mov	%rbp, %rsp
+	popq	%rbp
 	popq    %r15
 	popq    %r14
 	popq    %r13
-	popq    %rbp
+	popq    %r12
 	popq    %rbx
 
 	ret
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 11/12] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
                   ` (9 preceding siblings ...)
  2017-08-29 18:05 ` [PATCH 10/12] x86/crypto: Fix RBP usage in sha256-ssse3-asm.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-08-29 18:05 ` [PATCH 12/12] x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S Josh Poimboeuf
  2017-09-02  0:09 ` [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Eric Biggers
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R12 instead of RBP for the TBL register.  Since R12 is also used as
another temporary register (T1), it gets clobbered in each round of
computation.  So the TBL value needs to be freshly reloaded into R12
each time it's used.  Since the value of TBL can change, store its
permanent value on the stack at the frame_TBL offset.

Also remove the unused y4 variable.

Reported-by: Eric Biggers <ebiggers3@gmail.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/sha512-avx2-asm.S | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/crypto/sha512-avx2-asm.S b/arch/x86/crypto/sha512-avx2-asm.S
index 7f5f6c6ec72e..37cfc2004abd 100644
--- a/arch/x86/crypto/sha512-avx2-asm.S
+++ b/arch/x86/crypto/sha512-avx2-asm.S
@@ -81,7 +81,7 @@ d           = %r8
 e           = %rdx
 y3          = %rsi
 
-TBL   = %rbp
+TBL   = %r12 # clobbered by T1
 
 a     = %rax
 b     = %rbx
@@ -96,11 +96,10 @@ y0    = %r13
 y1    = %r14
 y2    = %r15
 
-y4    = %r12
-
 # Local variables (stack frame)
 XFER_SIZE = 4*8
 SRND_SIZE = 1*8
+TBL_SIZE = 1*8
 INP_SIZE = 1*8
 INPEND_SIZE = 1*8
 RSPSAVE_SIZE = 1*8
@@ -108,7 +107,8 @@ GPRSAVE_SIZE = 6*8
 
 frame_XFER = 0
 frame_SRND = frame_XFER + XFER_SIZE
-frame_INP = frame_SRND + SRND_SIZE
+frame_TBL = frame_SRND + SRND_SIZE
+frame_INP = frame_TBL + TBL_SIZE
 frame_INPEND = frame_INP + INP_SIZE
 frame_RSPSAVE = frame_INPEND + INPEND_SIZE
 frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE
@@ -601,7 +601,7 @@ ENTRY(sha512_transform_rorx)
 	vmovdqa	PSHUFFLE_BYTE_FLIP_MASK(%rip), BYTE_FLIP_MASK
 
 loop0:
-	lea	K512(%rip), TBL
+	movq	$K512, frame_TBL(%rsp)
 
 	## byte swap first 16 dwords
 	COPY_YMM_AND_BSWAP	Y_0, (INP), BYTE_FLIP_MASK
@@ -616,39 +616,46 @@ loop0:
 
 .align 16
 loop1:
+	mov frame_TBL(%rsp), TBL
 	vpaddq	(TBL), Y_0, XFER
 	vmovdqa XFER, frame_XFER(%rsp)
 	FOUR_ROUNDS_AND_SCHED
 
+	mov frame_TBL(%rsp), TBL
 	vpaddq	1*32(TBL), Y_0, XFER
 	vmovdqa XFER, frame_XFER(%rsp)
 	FOUR_ROUNDS_AND_SCHED
 
+	mov frame_TBL(%rsp), TBL
 	vpaddq	2*32(TBL), Y_0, XFER
 	vmovdqa XFER, frame_XFER(%rsp)
 	FOUR_ROUNDS_AND_SCHED
 
+	mov frame_TBL(%rsp), TBL
 	vpaddq	3*32(TBL), Y_0, XFER
 	vmovdqa XFER, frame_XFER(%rsp)
-	add	$(4*32), TBL
 	FOUR_ROUNDS_AND_SCHED
 
+	addq	$(4*32), frame_TBL(%rsp)
 	subq	$1, frame_SRND(%rsp)
 	jne	loop1
 
 	movq	$2, frame_SRND(%rsp)
 loop2:
+	mov frame_TBL(%rsp), TBL
 	vpaddq	(TBL), Y_0, XFER
 	vmovdqa XFER, frame_XFER(%rsp)
 	DO_4ROUNDS
+
+	mov frame_TBL(%rsp), TBL
 	vpaddq	1*32(TBL), Y_1, XFER
 	vmovdqa XFER, frame_XFER(%rsp)
-	add	$(2*32), TBL
 	DO_4ROUNDS
 
 	vmovdqa	Y_2, Y_0
 	vmovdqa	Y_3, Y_1
 
+	add	$(2*32), frame_TBL(%rsp)
 	subq	$1, frame_SRND(%rsp)
 	jne	loop2
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 12/12] x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
                   ` (10 preceding siblings ...)
  2017-08-29 18:05 ` [PATCH 11/12] x86/crypto: Fix RBP usage in sha512-avx2-asm.S Josh Poimboeuf
@ 2017-08-29 18:05 ` Josh Poimboeuf
  2017-09-02  0:09 ` [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Eric Biggers
  12 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-08-29 18:05 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, Tim Chen, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R13 instead of RBP.  Both are callee-saved registers, so the
substitution is straightforward.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index b3f49d286348..73b471da3622 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -76,8 +76,8 @@
 #define RT %xmm14
 #define RR %xmm15
 
-#define RID1  %rbp
-#define RID1d %ebp
+#define RID1  %r13
+#define RID1d %r13d
 #define RID2  %rsi
 #define RID2d %esi
 
@@ -259,7 +259,7 @@ __twofish_enc_blk8:
 
 	vmovdqu w(CTX), RK1;
 
-	pushq %rbp;
+	pushq %r13;
 	pushq %rbx;
 	pushq %rcx;
 
@@ -282,7 +282,7 @@ __twofish_enc_blk8:
 
 	popq %rcx;
 	popq %rbx;
-	popq %rbp;
+	popq %r13;
 
 	outunpack_blocks(RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
 	outunpack_blocks(RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);
@@ -301,7 +301,7 @@ __twofish_dec_blk8:
 
 	vmovdqu (w+4*4)(CTX), RK1;
 
-	pushq %rbp;
+	pushq %r13;
 	pushq %rbx;
 
 	inpack_blocks(RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
@@ -322,7 +322,7 @@ __twofish_dec_blk8:
 	vmovdqu (w)(CTX), RK1;
 
 	popq %rbx;
-	popq %rbp;
+	popq %r13;
 
 	outunpack_blocks(RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
 	outunpack_blocks(RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
                   ` (11 preceding siblings ...)
  2017-08-29 18:05 ` [PATCH 12/12] x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S Josh Poimboeuf
@ 2017-09-02  0:09 ` Eric Biggers
  2017-09-07  0:15   ` Josh Poimboeuf
  2017-09-07  7:15   ` Ingo Molnar
  12 siblings, 2 replies; 31+ messages in thread
From: Eric Biggers @ 2017-09-02  0:09 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby

Hi Josh,

On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote:
> Many of the x86 crypto functions use RBP as a temporary register.  This
> breaks frame pointer convention, and breaks stack traces when unwinding
> from an interrupt in the crypto code.
> 
> Convert most* of them to leave RBP alone.
> 
> These pass the crypto boot tests for me.  Any further testing would be
> appreciated!
> 
> [*] There are still a few crypto files left that need fixing, but the
>     fixes weren't trivial and nobody reported unwinder warnings about
>     them yet, so I'm skipping them for now.
> 
> Josh Poimboeuf (12):
>   x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in des3_ede-asm_64.S
>   x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
>   x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
>   x86/crypto: Fix RBP usage in sha256-avx-asm.S
>   x86/crypto: Fix RBP usage in sha256-avx2-asm.S
>   x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
>   x86/crypto: Fix RBP usage in sha512-avx2-asm.S
>   x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S
> 

Thanks for fixing these!  I don't have time to review these in detail, but I ran
the crypto self-tests on the affected algorithms, and they all pass.  I also
benchmarked them before and after; the only noticable performance difference was
that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
there is a way around that?  Otherwise it's probably not a big deal.

For reference, this was the list of affected algorithms I used:

shash: sha1-ssse3, sha1-avx, sha1-avx2, sha256-ssse3, sha256-avx, sha256-avx2,
	sha512-ssse3, sha512-avx, sha512-avx2

cipher: blowfish-asm, camellia-asm, des3_ede-asm

skcipher: ecb-blowfish-asm, cbc-blowfish-asm, ctr-blowfish-asm, ecb-cast5-avx,
	cbc-cast5-avx, ctr-cast5-avx, ecb-cast6-avx, cbc-cast6-avx, ctr-cast6-avx,
	lrw-cast6-avx, xts-cast6-avx, ecb-camellia-asm, cbc-camellia-asm,
	ctr-camellia-asm, lrw-camellia-asm, xts-camellia-asm, ecb-des3_ede-asm,
	cbc-des3_ede-asm, ctr-des3_ede-asm, ecb-twofish-avx, cbc-twofish-avx,
	ctr-twofish-avx, lrw-twofish-avx, xts-twofish-avx

Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 06/12] x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
  2017-08-29 18:05 ` [PATCH 06/12] x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S Josh Poimboeuf
@ 2017-09-06 16:11   ` Tim Chen
  0 siblings, 0 replies; 31+ messages in thread
From: Tim Chen @ 2017-09-06 16:11 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: linux-kernel, Mathias Krause, Chandramouli Narayanan,
	Jussi Kivilinna, Peter Zijlstra, Herbert Xu, David S. Miller,
	linux-crypto, Eric Biggers, Andy Lutomirski, Jiri Slaby

On 08/29/2017 11:05 AM, Josh Poimboeuf wrote:
> Using RBP as a temporary register breaks frame pointer convention and
> breaks stack traces when unwinding from an interrupt in the crypto code.
> 
> Use R11 instead of RBP.  Since R11 isn't a callee-saved register, it
> doesn't need to be saved and restored on the stack.

These changes seem okay.

Thanks.

Tim

> 
> Reported-by: Eric Biggers <ebiggers@google.com>
> Reported-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> ---
>  arch/x86/crypto/sha1_avx2_x86_64_asm.S | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/x86/crypto/sha1_avx2_x86_64_asm.S b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
> index 1eab79c9ac48..9f712a7dfd79 100644
> --- a/arch/x86/crypto/sha1_avx2_x86_64_asm.S
> +++ b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
> @@ -89,7 +89,7 @@
>  #define	REG_RE	%rdx
>  #define	REG_RTA	%r12
>  #define	REG_RTB	%rbx
> -#define	REG_T1	%ebp
> +#define	REG_T1	%r11d
>  #define	xmm_mov	vmovups
>  #define	avx2_zeroupper	vzeroupper
>  #define	RND_F1	1
> @@ -637,7 +637,6 @@ _loop3:
>  	ENTRY(\name)
>  
>  	push	%rbx
> -	push	%rbp
>  	push	%r12
>  	push	%r13
>  	push	%r14
> @@ -673,7 +672,6 @@ _loop3:
>  	pop	%r14
>  	pop	%r13
>  	pop	%r12
> -	pop	%rbp
>  	pop	%rbx
>  
>  	ret
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-02  0:09 ` [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Eric Biggers
@ 2017-09-07  0:15   ` Josh Poimboeuf
  2017-09-07  7:15   ` Ingo Molnar
  1 sibling, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-09-07  0:15 UTC (permalink / raw)
  To: Eric Biggers
  Cc: x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby

On Fri, Sep 01, 2017 at 05:09:19PM -0700, Eric Biggers wrote:
> Hi Josh,
> 
> On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote:
> > Many of the x86 crypto functions use RBP as a temporary register.  This
> > breaks frame pointer convention, and breaks stack traces when unwinding
> > from an interrupt in the crypto code.
> > 
> > Convert most* of them to leave RBP alone.
> > 
> > These pass the crypto boot tests for me.  Any further testing would be
> > appreciated!
> > 
> > [*] There are still a few crypto files left that need fixing, but the
> >     fixes weren't trivial and nobody reported unwinder warnings about
> >     them yet, so I'm skipping them for now.
> > 
> > Josh Poimboeuf (12):
> >   x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in des3_ede-asm_64.S
> >   x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
> >   x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
> >   x86/crypto: Fix RBP usage in sha256-avx-asm.S
> >   x86/crypto: Fix RBP usage in sha256-avx2-asm.S
> >   x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
> >   x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> >   x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S
> > 
> 
> Thanks for fixing these!  I don't have time to review these in detail, but I ran
> the crypto self-tests on the affected algorithms, and they all pass.  I also
> benchmarked them before and after; the only noticable performance difference was
> that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> there is a way around that?  Otherwise it's probably not a big deal.

Thanks for testing.  I might have a way to make sha256-avx2 faster, but
not sha512-avx2.  I'll let you know when I have a patch.

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-02  0:09 ` [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Eric Biggers
  2017-09-07  0:15   ` Josh Poimboeuf
@ 2017-09-07  7:15   ` Ingo Molnar
  2017-09-07 17:58     ` Eric Biggers
  1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2017-09-07  7:15 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Josh Poimboeuf, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby


* Eric Biggers <ebiggers3@gmail.com> wrote:

> Thanks for fixing these!  I don't have time to review these in detail, but I ran
> the crypto self-tests on the affected algorithms, and they all pass.  I also
> benchmarked them before and after; the only noticable performance difference was
> that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> there is a way around that?  Otherwise it's probably not a big deal.

Which CPU model did you use for the test?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-07  7:15   ` Ingo Molnar
@ 2017-09-07 17:58     ` Eric Biggers
  2017-09-07 21:26       ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Biggers @ 2017-09-07 17:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Josh Poimboeuf, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby

On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> 
> * Eric Biggers <ebiggers3@gmail.com> wrote:
> 
> > Thanks for fixing these!  I don't have time to review these in detail, but I ran
> > the crypto self-tests on the affected algorithms, and they all pass.  I also
> > benchmarked them before and after; the only noticable performance difference was
> > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> > there is a way around that?  Otherwise it's probably not a big deal.
> 
> Which CPU model did you use for the test?
> 
> Thanks,
> 
> 	Ingo

This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".

Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-07 17:58     ` Eric Biggers
@ 2017-09-07 21:26       ` Ingo Molnar
  2017-09-08 17:57         ` Eric Biggers
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2017-09-07 21:26 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Josh Poimboeuf, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby


* Eric Biggers <ebiggers3@gmail.com> wrote:

> On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers <ebiggers3@gmail.com> wrote:
> > 
> > > Thanks for fixing these!  I don't have time to review these in detail, but I ran
> > > the crypto self-tests on the affected algorithms, and they all pass.  I also
> > > benchmarked them before and after; the only noticable performance difference was
> > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> > > there is a way around that?  Otherwise it's probably not a big deal.
> > 
> > Which CPU model did you use for the test?
> > 
> > Thanks,
> > 
> > 	Ingo
> 
> This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".

Any chance to test this with the latest microarchitecture - any Skylake derivative 
Intel CPU you have access to?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-07 21:26       ` Ingo Molnar
@ 2017-09-08 17:57         ` Eric Biggers
  2017-09-13 21:24           ` Josh Poimboeuf
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Biggers @ 2017-09-08 17:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Josh Poimboeuf, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby

On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> 
> * Eric Biggers <ebiggers3@gmail.com> wrote:
> 
> > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > 
> > > * Eric Biggers <ebiggers3@gmail.com> wrote:
> > > 
> > > > Thanks for fixing these!  I don't have time to review these in detail, but I ran
> > > > the crypto self-tests on the affected algorithms, and they all pass.  I also
> > > > benchmarked them before and after; the only noticable performance difference was
> > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > 
> > > Which CPU model did you use for the test?
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > 
> > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> 
> Any chance to test this with the latest microarchitecture - any Skylake derivative 
> Intel CPU you have access to?
> 
> Thanks,
> 
> 	Ingo

Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The results
were the following which seemed a bit worse than Haswell:

	sha256-avx2 became 3.5% slower
	sha512-avx2 became 7.5% slower

Note: it's tricky to benchmark this, especially with just a few percent
difference, so don't read too much into the exact numbers.

Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-08 17:57         ` Eric Biggers
@ 2017-09-13 21:24           ` Josh Poimboeuf
  2017-09-13 22:33             ` Josh Poimboeuf
  2017-09-14  9:16             ` Ingo Molnar
  0 siblings, 2 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2017-09-13 21:24 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ingo Molnar, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby

On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote:
> On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers <ebiggers3@gmail.com> wrote:
> > 
> > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Eric Biggers <ebiggers3@gmail.com> wrote:
> > > > 
> > > > > Thanks for fixing these!  I don't have time to review these in detail, but I ran
> > > > > the crypto self-tests on the affected algorithms, and they all pass.  I also
> > > > > benchmarked them before and after; the only noticable performance difference was
> > > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> > > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > > 
> > > > Which CPU model did you use for the test?
> > > > 
> > > > Thanks,
> > > > 
> > > > 	Ingo
> > > 
> > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> > 
> > Any chance to test this with the latest microarchitecture - any Skylake derivative 
> > Intel CPU you have access to?
> > 
> > Thanks,
> > 
> > 	Ingo
> 
> Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The results
> were the following which seemed a bit worse than Haswell:
> 
> 	sha256-avx2 became 3.5% slower
> 	sha512-avx2 became 7.5% slower
> 
> Note: it's tricky to benchmark this, especially with just a few percent
> difference, so don't read too much into the exact numbers.

Here's a v2 for the sha256-avx2 patch, would you mind seeing if this
closes the performance gap?

I'm still looking at the other one (sha512-avx2), but so far I haven't
found a way to speed it back up.

From: Josh Poimboeuf <jpoimboe@redhat.com>
Subject: [PATCH] x86/crypto: Fix RBP usage in sha256-avx2-asm.S

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

There's no need to use RBP as a temporary register for the TBL value,
because it always stores the same value: the address of the K256 table.
Instead just reference the address of K256 directly.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/sha256-avx2-asm.S | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx2-asm.S b/arch/x86/crypto/sha256-avx2-asm.S
index 89c8f09787d2..1420db15dcdd 100644
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -98,8 +98,6 @@ d	= %r8d
 e       = %edx	# clobbers NUM_BLKS
 y3	= %esi	# clobbers INP
 
-
-TBL	= %rbp
 SRND	= CTX	# SRND is same register as CTX
 
 a = %eax
@@ -531,7 +529,6 @@ STACK_SIZE	= _RSP      + _RSP_SIZE
 ENTRY(sha256_transform_rorx)
 .align 32
 	pushq	%rbx
-	pushq	%rbp
 	pushq	%r12
 	pushq	%r13
 	pushq	%r14
@@ -568,8 +565,6 @@ ENTRY(sha256_transform_rorx)
 	mov	CTX, _CTX(%rsp)
 
 loop0:
-	lea     K256(%rip), TBL
-
 	## Load first 16 dwords from two blocks
 	VMOVDQ	0*32(INP),XTMP0
 	VMOVDQ	1*32(INP),XTMP1
@@ -597,19 +592,19 @@ last_block_enter:
 
 .align 16
 loop1:
-	vpaddd	0*32(TBL, SRND), X0, XFER
+	vpaddd	K256+0*32(SRND), X0, XFER
 	vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 0*32
 
-	vpaddd	1*32(TBL, SRND), X0, XFER
+	vpaddd	K256+1*32(SRND), X0, XFER
 	vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 1*32
 
-	vpaddd	2*32(TBL, SRND), X0, XFER
+	vpaddd	K256+2*32(SRND), X0, XFER
 	vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 2*32
 
-	vpaddd	3*32(TBL, SRND), X0, XFER
+	vpaddd	K256+3*32(SRND), X0, XFER
 	vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 3*32
 
@@ -619,10 +614,11 @@ loop1:
 
 loop2:
 	## Do last 16 rounds with no scheduling
-	vpaddd	0*32(TBL, SRND), X0, XFER
+	vpaddd	K256+0*32(SRND), X0, XFER
 	vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
 	DO_4ROUNDS	_XFER + 0*32
-	vpaddd	1*32(TBL, SRND), X1, XFER
+
+	vpaddd	K256+1*32(SRND), X1, XFER
 	vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
 	DO_4ROUNDS	_XFER + 1*32
 	add	$2*32, SRND
@@ -676,9 +672,6 @@ loop3:
 	ja	done_hash
 
 do_last_block:
-	#### do last block
-	lea	K256(%rip), TBL
-
 	VMOVDQ	0*16(INP),XWORD0
 	VMOVDQ	1*16(INP),XWORD1
 	VMOVDQ	2*16(INP),XWORD2
@@ -718,7 +711,6 @@ done_hash:
 	popq	%r14
 	popq	%r13
 	popq	%r12
-	popq	%rbp
 	popq	%rbx
 	ret
 ENDPROC(sha256_transform_rorx)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-13 21:24           ` Josh Poimboeuf
@ 2017-09-13 22:33             ` Josh Poimboeuf
  2017-09-15  4:54               ` Eric Biggers
  2017-09-14  9:16             ` Ingo Molnar
  1 sibling, 1 reply; 31+ messages in thread
From: Josh Poimboeuf @ 2017-09-13 22:33 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ingo Molnar, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby

On Wed, Sep 13, 2017 at 04:24:28PM -0500, Josh Poimboeuf wrote:
> On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote:
> > On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> > > 
> > > * Eric Biggers <ebiggers3@gmail.com> wrote:
> > > 
> > > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > > > 
> > > > > * Eric Biggers <ebiggers3@gmail.com> wrote:
> > > > > 
> > > > > > Thanks for fixing these!  I don't have time to review these in detail, but I ran
> > > > > > the crypto self-tests on the affected algorithms, and they all pass.  I also
> > > > > > benchmarked them before and after; the only noticable performance difference was
> > > > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> > > > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > > > 
> > > > > Which CPU model did you use for the test?
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > 	Ingo
> > > > 
> > > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> > > 
> > > Any chance to test this with the latest microarchitecture - any Skylake derivative 
> > > Intel CPU you have access to?
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > 
> > Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The results
> > were the following which seemed a bit worse than Haswell:
> > 
> > 	sha256-avx2 became 3.5% slower
> > 	sha512-avx2 became 7.5% slower
> > 
> > Note: it's tricky to benchmark this, especially with just a few percent
> > difference, so don't read too much into the exact numbers.
> 
> Here's a v2 for the sha256-avx2 patch, would you mind seeing if this
> closes the performance gap?
> 
> I'm still looking at the other one (sha512-avx2), but so far I haven't
> found a way to speed it back up.

And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
most of the performance lost by v1.

From: Josh Poimboeuf <jpoimboe@redhat.com>
Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Mix things up a little bit to get rid of the RBP usage, without
destroying performance.  Use RDI instead of RBP for the TBL pointer.
That will clobber CTX, so save CTX on the stack and use RDI as CTX
before it gets clobbered, and R12 as CTX after it gets clobbered.

Also remove the unused y4 variable.

Reported-by: Eric Biggers <ebiggers3@gmail.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/sha512-avx2-asm.S | 75 ++++++++++++++++++++-------------------
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/arch/x86/crypto/sha512-avx2-asm.S b/arch/x86/crypto/sha512-avx2-asm.S
index 7f5f6c6ec72e..b16d56005162 100644
--- a/arch/x86/crypto/sha512-avx2-asm.S
+++ b/arch/x86/crypto/sha512-avx2-asm.S
@@ -69,8 +69,9 @@ XFER  = YTMP0
 
 BYTE_FLIP_MASK  = %ymm9
 
-# 1st arg
-CTX         = %rdi
+# 1st arg is %rdi, which is saved to the stack and accessed later via %r12
+CTX1        = %rdi
+CTX2        = %r12
 # 2nd arg
 INP         = %rsi
 # 3rd arg
@@ -81,7 +82,7 @@ d           = %r8
 e           = %rdx
 y3          = %rsi
 
-TBL   = %rbp
+TBL   = %rdi # clobbers CTX1
 
 a     = %rax
 b     = %rbx
@@ -91,26 +92,26 @@ g     = %r10
 h     = %r11
 old_h = %r11
 
-T1    = %r12
+T1    = %r12 # clobbers CTX2
 y0    = %r13
 y1    = %r14
 y2    = %r15
 
-y4    = %r12
-
 # Local variables (stack frame)
 XFER_SIZE = 4*8
 SRND_SIZE = 1*8
 INP_SIZE = 1*8
 INPEND_SIZE = 1*8
+CTX_SIZE = 1*8
 RSPSAVE_SIZE = 1*8
-GPRSAVE_SIZE = 6*8
+GPRSAVE_SIZE = 5*8
 
 frame_XFER = 0
 frame_SRND = frame_XFER + XFER_SIZE
 frame_INP = frame_SRND + SRND_SIZE
 frame_INPEND = frame_INP + INP_SIZE
-frame_RSPSAVE = frame_INPEND + INPEND_SIZE
+frame_CTX = frame_INPEND + INPEND_SIZE
+frame_RSPSAVE = frame_CTX + CTX_SIZE
 frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE
 frame_size = frame_GPRSAVE + GPRSAVE_SIZE
 
@@ -576,12 +577,11 @@ ENTRY(sha512_transform_rorx)
 	mov	%rax, frame_RSPSAVE(%rsp)
 
 	# Save GPRs
-	mov	%rbp, frame_GPRSAVE(%rsp)
-	mov	%rbx, 8*1+frame_GPRSAVE(%rsp)
-	mov	%r12, 8*2+frame_GPRSAVE(%rsp)
-	mov	%r13, 8*3+frame_GPRSAVE(%rsp)
-	mov	%r14, 8*4+frame_GPRSAVE(%rsp)
-	mov	%r15, 8*5+frame_GPRSAVE(%rsp)
+	mov	%rbx, 8*0+frame_GPRSAVE(%rsp)
+	mov	%r12, 8*1+frame_GPRSAVE(%rsp)
+	mov	%r13, 8*2+frame_GPRSAVE(%rsp)
+	mov	%r14, 8*3+frame_GPRSAVE(%rsp)
+	mov	%r15, 8*4+frame_GPRSAVE(%rsp)
 
 	shl	$7, NUM_BLKS	# convert to bytes
 	jz	done_hash
@@ -589,14 +589,17 @@ ENTRY(sha512_transform_rorx)
 	mov	NUM_BLKS, frame_INPEND(%rsp)
 
 	## load initial digest
-	mov	8*0(CTX),a
-	mov	8*1(CTX),b
-	mov	8*2(CTX),c
-	mov	8*3(CTX),d
-	mov	8*4(CTX),e
-	mov	8*5(CTX),f
-	mov	8*6(CTX),g
-	mov	8*7(CTX),h
+	mov	8*0(CTX1), a
+	mov	8*1(CTX1), b
+	mov	8*2(CTX1), c
+	mov	8*3(CTX1), d
+	mov	8*4(CTX1), e
+	mov	8*5(CTX1), f
+	mov	8*6(CTX1), g
+	mov	8*7(CTX1), h
+
+	# save %rdi (CTX) before it gets clobbered
+	mov	%rdi, frame_CTX(%rsp)
 
 	vmovdqa	PSHUFFLE_BYTE_FLIP_MASK(%rip), BYTE_FLIP_MASK
 
@@ -652,14 +655,15 @@ loop2:
 	subq	$1, frame_SRND(%rsp)
 	jne	loop2
 
-	addm	8*0(CTX),a
-	addm	8*1(CTX),b
-	addm	8*2(CTX),c
-	addm	8*3(CTX),d
-	addm	8*4(CTX),e
-	addm	8*5(CTX),f
-	addm	8*6(CTX),g
-	addm	8*7(CTX),h
+	mov	frame_CTX(%rsp), CTX2
+	addm	8*0(CTX2), a
+	addm	8*1(CTX2), b
+	addm	8*2(CTX2), c
+	addm	8*3(CTX2), d
+	addm	8*4(CTX2), e
+	addm	8*5(CTX2), f
+	addm	8*6(CTX2), g
+	addm	8*7(CTX2), h
 
 	mov	frame_INP(%rsp), INP
 	add	$128, INP
@@ -669,12 +673,11 @@ loop2:
 done_hash:
 
 # Restore GPRs
-	mov	frame_GPRSAVE(%rsp)     ,%rbp
-	mov	8*1+frame_GPRSAVE(%rsp) ,%rbx
-	mov	8*2+frame_GPRSAVE(%rsp) ,%r12
-	mov	8*3+frame_GPRSAVE(%rsp) ,%r13
-	mov	8*4+frame_GPRSAVE(%rsp) ,%r14
-	mov	8*5+frame_GPRSAVE(%rsp) ,%r15
+	mov	8*0+frame_GPRSAVE(%rsp), %rbx
+	mov	8*1+frame_GPRSAVE(%rsp), %r12
+	mov	8*2+frame_GPRSAVE(%rsp), %r13
+	mov	8*3+frame_GPRSAVE(%rsp), %r14
+	mov	8*4+frame_GPRSAVE(%rsp), %r15
 
 	# Restore Stack Pointer
 	mov	frame_RSPSAVE(%rsp), %rsp
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-13 21:24           ` Josh Poimboeuf
  2017-09-13 22:33             ` Josh Poimboeuf
@ 2017-09-14  9:16             ` Ingo Molnar
  2017-09-14  9:28               ` Ingo Molnar
  2017-09-14 13:28               ` Josh Poimboeuf
  1 sibling, 2 replies; 31+ messages in thread
From: Ingo Molnar @ 2017-09-14  9:16 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Eric Biggers, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> I'm still looking at the other one (sha512-avx2), but so far I haven't
> found a way to speed it back up.

Here's a couple of very quick observations with possible optimizations:

AFAICS the main effect of the RBP fixes is the introduction of a memory load into 
the critical path, into the body unrolled loop:

+       mov frame_TBL(%rsp), TBL
        vpaddq  (TBL), Y_0, XFER
        vmovdqa XFER, frame_XFER(%rsp)
        FOUR_ROUNDS_AND_SCHED

Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be 
reloaded from the stack.

1)

Note how R12 is used immediately, right in the next instruction:

        vpaddq  (TBL), Y_0, XFER

I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
constraint and a few extra cycles per loop iteration if the workload is 
address-generator bandwidth limited on that.

A simple way to ease that constraint would be to move the 'TLB' load up into the 
loop, body, to the point where 'T1' is used for the last time - which is:


        mov     a, T1           # T1 = a                                # MAJB
        and     c, T1           # T1 = a&c                              # MAJB

        add     y0, y2          # y2 = S1 + CH                          # --
        or      T1, y3          # y3 = MAJ = (a|c)&b)|(a&c)             # MAJ

+       mov frame_TBL(%rsp), TBL

        add     y1, h           # h = k + w + h + S0                    # --

        add     y2, d           # d = k + w + h + d + S1 + CH = d + t1  # --

        add     y2, h           # h = k + w + h + S0 + S1 + CH = t1 + S0# --
        add     y3, h           # h = t1 + S0 + MAJ                     # --

Note how this moves up the 'TLB' reload by 4 instructions.

2)

If this does not get back performance, then maybe another reason is that it's 
cache access latency limited, in which case a more involved optimization would be 
to look at the register patterns and usages:


			first-use	last-use		use-length
	a:		#10		#29			20
	b:		#24		#24			 1
	c:		#14		#30			17
	d:		#23		#34			12
	e:		#11		#20			10
	f:		#15		#15			 1
	g:		#18		#27			10
	h:		#13		#36			24

	y0:		#11		#31			21
	y1:		#12		#33			22
	y2:		#15		#35			21
	y3:		#10		#36			27

	T1:		#16		#32			17

The 'first-use' colums shows the number of the instruction within the loop body 
that the register gets used - with '#1' denoting the first instruction ad #36 the 
last instruction, the 'last-use' column is showing the last instruction, and the 
'use-length' colum shows the 'window' in which a register is used.

What we want are the registers that are used the most tightly, i.e. these two:

	b:		#24		#24			 1
	f:		#15		#15			 1

Of these two 'f' is the best one, because it has an earlier use and longer 
cooldown.

If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very 
early on:

        mov     f, y2           # y2 = f                                # CH
+       mov frame_TBL(%rsp), TBL
        rorx    $34, a, T1      # T1 = a >> 34                          # S0B

And there will be 21 instructions that don't depend on TLB after this, plenty of 
time for the load to be generated and propagated.

NOTE: my pseudo-patch is naive, due to the complication caused by the RotateState 
macro name rotation. It's still fundamentally possible I believe, it's just that 
'TBL' has to be rotated too, together with the other varibles.

3)

If even this does not help, because the workload is ucode-cache limited, and the 
extra reloads pushed the critical path just beyond some cache limit, then another 
experiment to try would be to roll _back_ the loop some more: instead of 4x 
FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2.

The CPU should still be smart enough with 2x interleaving of the loop body, and 
the extra branches should be relatively small and we could get back some 
performance.

In theory ...

4)

If the workload is fundamentally cache-port bandwidth limited, then the extra 
loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy 
fix for that, but to find an unused register.

Here's the (initial, pre-rotation) integer register mappings:

	a:	RAX
	b:	RBX
	c:	RCX
	d:	R8
	e:	RDX
	f:	R9
	g:	R10
	h:	R11

	y0:	R13
	y1:	R14
	y2:	R15
	y3:	RSI

	T1:	R12

	TLB:	R12 # aliased to T1

Look what's missing: I don't see RDI being used in the loop.

RDI is allocated to 'CTX', but that's only used in higher level glue code, it does 
not appear to be used in the inner loops (explicitly at least).

So if this observation of mine is true we could go back to the old code for the 
hotpath, but use RDI for TBL and not reload it in the hotpath.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-14  9:16             ` Ingo Molnar
@ 2017-09-14  9:28               ` Ingo Molnar
  2017-09-14 13:28               ` Josh Poimboeuf
  1 sibling, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2017-09-14  9:28 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Eric Biggers, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby


* Ingo Molnar <mingo@kernel.org> wrote:

> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
>         vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
>         mov     a, T1           # T1 = a                                # MAJB
>         and     c, T1           # T1 = a&c                              # MAJB
> 
>         add     y0, y2          # y2 = S1 + CH                          # --
>         or      T1, y3          # y3 = MAJ = (a|c)&b)|(a&c)             # MAJ
> 
> +       mov frame_TBL(%rsp), TBL
> 
>         add     y1, h           # h = k + w + h + S0                    # --
> 
>         add     y2, d           # d = k + w + h + d + S1 + CH = d + t1  # --
> 
>         add     y2, h           # h = k + w + h + S0 + S1 + CH = t1 + S0# --
>         add     y3, h           # h = t1 + S0 + MAJ                     # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.

Note that in this case 'TBL' would have to be initialized before the 1st 
iteration, via something like:

        movq    $4, frame_SRND(%rsp)

+	mov frame_TBL(%rsp), TBL

.align 16
loop1:
        vpaddq  (TBL), Y_0, XFER
        vmovdqa XFER, frame_XFER(%rsp)
        FOUR_ROUNDS_AND_SCHED

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-14  9:16             ` Ingo Molnar
  2017-09-14  9:28               ` Ingo Molnar
@ 2017-09-14 13:28               ` Josh Poimboeuf
  2017-09-15  5:37                 ` Ingo Molnar
  1 sibling, 1 reply; 31+ messages in thread
From: Josh Poimboeuf @ 2017-09-14 13:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Biggers, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby

On Thu, Sep 14, 2017 at 11:16:12AM +0200, Ingo Molnar wrote:
> 
> * Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > I'm still looking at the other one (sha512-avx2), but so far I haven't
> > found a way to speed it back up.
> 
> Here's a couple of very quick observations with possible optimizations:
> 
> AFAICS the main effect of the RBP fixes is the introduction of a memory load into 
> the critical path, into the body unrolled loop:
> 
> +       mov frame_TBL(%rsp), TBL
>         vpaddq  (TBL), Y_0, XFER
>         vmovdqa XFER, frame_XFER(%rsp)
>         FOUR_ROUNDS_AND_SCHED
> 
> Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be 
> reloaded from the stack.
> 
> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
>         vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
>         mov     a, T1           # T1 = a                                # MAJB
>         and     c, T1           # T1 = a&c                              # MAJB
> 
>         add     y0, y2          # y2 = S1 + CH                          # --
>         or      T1, y3          # y3 = MAJ = (a|c)&b)|(a&c)             # MAJ
> 
> +       mov frame_TBL(%rsp), TBL
> 
>         add     y1, h           # h = k + w + h + S0                    # --
> 
>         add     y2, d           # d = k + w + h + d + S1 + CH = d + t1  # --
> 
>         add     y2, h           # h = k + w + h + S0 + S1 + CH = t1 + S0# --
>         add     y3, h           # h = t1 + S0 + MAJ                     # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.
> 
> 2)
> 
> If this does not get back performance, then maybe another reason is that it's 
> cache access latency limited, in which case a more involved optimization would be 
> to look at the register patterns and usages:
> 
> 
> 			first-use	last-use		use-length
> 	a:		#10		#29			20
> 	b:		#24		#24			 1
> 	c:		#14		#30			17
> 	d:		#23		#34			12
> 	e:		#11		#20			10
> 	f:		#15		#15			 1
> 	g:		#18		#27			10
> 	h:		#13		#36			24
> 
> 	y0:		#11		#31			21
> 	y1:		#12		#33			22
> 	y2:		#15		#35			21
> 	y3:		#10		#36			27
> 
> 	T1:		#16		#32			17
> 
> The 'first-use' colums shows the number of the instruction within the loop body 
> that the register gets used - with '#1' denoting the first instruction ad #36 the 
> last instruction, the 'last-use' column is showing the last instruction, and the 
> 'use-length' colum shows the 'window' in which a register is used.
> 
> What we want are the registers that are used the most tightly, i.e. these two:
> 
> 	b:		#24		#24			 1
> 	f:		#15		#15			 1
> 
> Of these two 'f' is the best one, because it has an earlier use and longer 
> cooldown.
> 
> If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very 
> early on:
> 
>         mov     f, y2           # y2 = f                                # CH
> +       mov frame_TBL(%rsp), TBL
>         rorx    $34, a, T1      # T1 = a >> 34                          # S0B
> 
> And there will be 21 instructions that don't depend on TLB after this, plenty of 
> time for the load to be generated and propagated.
> 
> NOTE: my pseudo-patch is naive, due to the complication caused by the RotateState 
> macro name rotation. It's still fundamentally possible I believe, it's just that 
> 'TBL' has to be rotated too, together with the other varibles.
> 
> 3)
> 
> If even this does not help, because the workload is ucode-cache limited, and the 
> extra reloads pushed the critical path just beyond some cache limit, then another 
> experiment to try would be to roll _back_ the loop some more: instead of 4x 
> FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2.
> 
> The CPU should still be smart enough with 2x interleaving of the loop body, and 
> the extra branches should be relatively small and we could get back some 
> performance.
> 
> In theory ...
> 
> 4)
> 
> If the workload is fundamentally cache-port bandwidth limited, then the extra 
> loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy 
> fix for that, but to find an unused register.
> 
> Here's the (initial, pre-rotation) integer register mappings:
> 
> 	a:	RAX
> 	b:	RBX
> 	c:	RCX
> 	d:	R8
> 	e:	RDX
> 	f:	R9
> 	g:	R10
> 	h:	R11
> 
> 	y0:	R13
> 	y1:	R14
> 	y2:	R15
> 	y3:	RSI
> 
> 	T1:	R12
> 
> 	TLB:	R12 # aliased to T1
> 
> Look what's missing: I don't see RDI being used in the loop.
> 
> RDI is allocated to 'CTX', but that's only used in higher level glue code, it does 
> not appear to be used in the inner loops (explicitly at least).
> 
> So if this observation of mine is true we could go back to the old code for the 
> hotpath, but use RDI for TBL and not reload it in the hotpath.

Thanks for the excellent breakdown.

When I looked at the patch again, I came to the same conclusion as your
#4, which is that RDI isn't being used in the inner loops.  It *is* used
in the outermost loop, however.

So v2 of my sha512-avx2-asm.S patch spilled CTX onto the stack, instead
of TBL:

  https://lkml.kernel.org/r/20170913223303.pskmy2v7nto6rvtg@treble

That should result in fewer stack accesses, since it's only loaded from
the stack once in the outer loop.

If we still need more performance after that, then we can attempt
something similar to your other suggestions, except with CTX instead of
TBL.

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-13 22:33             ` Josh Poimboeuf
@ 2017-09-15  4:54               ` Eric Biggers
  2017-09-15  5:34                 ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Biggers @ 2017-09-15  4:54 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Ingo Molnar, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby

Hi Josh,

On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> most of the performance lost by v1.
> 
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> 
> Using RBP as a temporary register breaks frame pointer convention and
> breaks stack traces when unwinding from an interrupt in the crypto code.
> 
> Mix things up a little bit to get rid of the RBP usage, without
> destroying performance.  Use RDI instead of RBP for the TBL pointer.
> That will clobber CTX, so save CTX on the stack and use RDI as CTX
> before it gets clobbered, and R12 as CTX after it gets clobbered.
> 
> Also remove the unused y4 variable.
> 

I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  They
both pass the crypto self-tests, and there was no noticable performance
difference compared to the unpatched versions.  Thanks!

Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-15  4:54               ` Eric Biggers
@ 2017-09-15  5:34                 ` Ingo Molnar
  2017-09-15 16:07                   ` Eric Biggers
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2017-09-15  5:34 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Josh Poimboeuf, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby


* Eric Biggers <ebiggers3@gmail.com> wrote:

> Hi Josh,
> 
> On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> > And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> > most of the performance lost by v1.
> > 
> > From: Josh Poimboeuf <jpoimboe@redhat.com>
> > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> > 
> > Using RBP as a temporary register breaks frame pointer convention and
> > breaks stack traces when unwinding from an interrupt in the crypto code.
> > 
> > Mix things up a little bit to get rid of the RBP usage, without
> > destroying performance.  Use RDI instead of RBP for the TBL pointer.
> > That will clobber CTX, so save CTX on the stack and use RDI as CTX
> > before it gets clobbered, and R12 as CTX after it gets clobbered.
> > 
> > Also remove the unused y4 variable.
> > 
> 
> I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  They
> both pass the crypto self-tests, and there was no noticable performance
> difference compared to the unpatched versions.  Thanks!

Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by tags 
to the patches?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-14 13:28               ` Josh Poimboeuf
@ 2017-09-15  5:37                 ` Ingo Molnar
  0 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2017-09-15  5:37 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Eric Biggers, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby


* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> > So if this observation of mine is true we could go back to the old code for the 
> > hotpath, but use RDI for TBL and not reload it in the hotpath.
> 
> Thanks for the excellent breakdown.
> 
> When I looked at the patch again, I came to the same conclusion as your
> #4, which is that RDI isn't being used in the inner loops.  It *is* used
> in the outermost loop, however.
> 
> So v2 of my sha512-avx2-asm.S patch spilled CTX onto the stack, instead
> of TBL:
> 
>   https://lkml.kernel.org/r/20170913223303.pskmy2v7nto6rvtg@treble

Indeed - I should have checked your v2 patch, but somehow missed it. Would have 
saved me some looking+typing ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-15  5:34                 ` Ingo Molnar
@ 2017-09-15 16:07                   ` Eric Biggers
  2017-09-15 21:06                     ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Biggers @ 2017-09-15 16:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Josh Poimboeuf, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby

On Fri, Sep 15, 2017 at 07:34:31AM +0200, Ingo Molnar wrote:
> 
> * Eric Biggers <ebiggers3@gmail.com> wrote:
> 
> > Hi Josh,
> > 
> > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> > > And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> > > most of the performance lost by v1.
> > > 
> > > From: Josh Poimboeuf <jpoimboe@redhat.com>
> > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> > > 
> > > Using RBP as a temporary register breaks frame pointer convention and
> > > breaks stack traces when unwinding from an interrupt in the crypto code.
> > > 
> > > Mix things up a little bit to get rid of the RBP usage, without
> > > destroying performance.  Use RDI instead of RBP for the TBL pointer.
> > > That will clobber CTX, so save CTX on the stack and use RDI as CTX
> > > before it gets clobbered, and R12 as CTX after it gets clobbered.
> > > 
> > > Also remove the unused y4 variable.
> > > 
> > 
> > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  They
> > both pass the crypto self-tests, and there was no noticable performance
> > difference compared to the unpatched versions.  Thanks!
> 
> Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by tags 
> to the patches?
> 

Yes, that's fine for all the patches in the series.

Will these patches go in through the crypto tree or through the x86 tree?

Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-15 16:07                   ` Eric Biggers
@ 2017-09-15 21:06                     ` Ingo Molnar
  2017-09-19  3:00                       ` Herbert Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2017-09-15 21:06 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Josh Poimboeuf, x86, linux-kernel, Tim Chen, Mathias Krause,
	Chandramouli Narayanan, Jussi Kivilinna, Peter Zijlstra,
	Herbert Xu, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby


* Eric Biggers <ebiggers3@gmail.com> wrote:

> On Fri, Sep 15, 2017 at 07:34:31AM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers <ebiggers3@gmail.com> wrote:
> > 
> > > Hi Josh,
> > > 
> > > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> > > > And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> > > > most of the performance lost by v1.
> > > > 
> > > > From: Josh Poimboeuf <jpoimboe@redhat.com>
> > > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> > > > 
> > > > Using RBP as a temporary register breaks frame pointer convention and
> > > > breaks stack traces when unwinding from an interrupt in the crypto code.
> > > > 
> > > > Mix things up a little bit to get rid of the RBP usage, without
> > > > destroying performance.  Use RDI instead of RBP for the TBL pointer.
> > > > That will clobber CTX, so save CTX on the stack and use RDI as CTX
> > > > before it gets clobbered, and R12 as CTX after it gets clobbered.
> > > > 
> > > > Also remove the unused y4 variable.
> > > > 
> > > 
> > > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  They
> > > both pass the crypto self-tests, and there was no noticable performance
> > > difference compared to the unpatched versions.  Thanks!
> > 
> > Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by tags 
> > to the patches?
> > 
> 
> Yes, that's fine for all the patches in the series.
> 
> Will these patches go in through the crypto tree or through the x86 tree?

Indeed, I suspect they should go through the crypto tree, these fixes are 
independent, they don't depend on anything in the x86 tree.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
  2017-09-15 21:06                     ` Ingo Molnar
@ 2017-09-19  3:00                       ` Herbert Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Herbert Xu @ 2017-09-19  3:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Biggers, Josh Poimboeuf, x86, linux-kernel, Tim Chen,
	Mathias Krause, Chandramouli Narayanan, Jussi Kivilinna,
	Peter Zijlstra, David S. Miller, linux-crypto, Eric Biggers,
	Andy Lutomirski, Jiri Slaby

On Fri, Sep 15, 2017 at 11:06:29PM +0200, Ingo Molnar wrote:
>
> Indeed, I suspect they should go through the crypto tree, these fixes are 
> independent, they don't depend on anything in the x86 tree.

Sure I can pick them up through cryptodev.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-09-19  3:01 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-29 18:05 [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 01/12] x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 02/12] x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 03/12] x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 04/12] x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 05/12] x86/crypto: Fix RBP usage in des3_ede-asm_64.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 06/12] x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S Josh Poimboeuf
2017-09-06 16:11   ` Tim Chen
2017-08-29 18:05 ` [PATCH 07/12] x86/crypto: Fix RBP usage in sha1_ssse3_asm.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 08/12] x86/crypto: Fix RBP usage in sha256-avx-asm.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 09/12] x86/crypto: Fix RBP usage in sha256-avx2-asm.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 10/12] x86/crypto: Fix RBP usage in sha256-ssse3-asm.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 11/12] x86/crypto: Fix RBP usage in sha512-avx2-asm.S Josh Poimboeuf
2017-08-29 18:05 ` [PATCH 12/12] x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S Josh Poimboeuf
2017-09-02  0:09 ` [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files Eric Biggers
2017-09-07  0:15   ` Josh Poimboeuf
2017-09-07  7:15   ` Ingo Molnar
2017-09-07 17:58     ` Eric Biggers
2017-09-07 21:26       ` Ingo Molnar
2017-09-08 17:57         ` Eric Biggers
2017-09-13 21:24           ` Josh Poimboeuf
2017-09-13 22:33             ` Josh Poimboeuf
2017-09-15  4:54               ` Eric Biggers
2017-09-15  5:34                 ` Ingo Molnar
2017-09-15 16:07                   ` Eric Biggers
2017-09-15 21:06                     ` Ingo Molnar
2017-09-19  3:00                       ` Herbert Xu
2017-09-14  9:16             ` Ingo Molnar
2017-09-14  9:28               ` Ingo Molnar
2017-09-14 13:28               ` Josh Poimboeuf
2017-09-15  5:37                 ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.