All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/17] x86: entry.S optimizations
@ 2014-08-08 17:44 Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 01/17] x86: entry_64.S: delete unused code Denys Vlasenko
                   ` (17 more replies)
  0 siblings, 18 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

Version 4 of the patchset.

Please consider applying at least two first patches, they are definitely safe,
and the second one fixes a latent bug.

Changes since v3:
= simplified iret stack handling on SYSCALL64 fastpath:
  got rid of FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK,
  got rid of thread_struct::usersp.
= save_paranoid cleaned up.
= folded test_in_nmi and IA32_ARG_FIXUP macros into their users.
= got rid of KERNEL_STACK_OFFSET.
= other small optimizations.
= fixed comments about SYSCALL from 32bit mode.

Changes since v2:
= fixed pre-existing latent bug: wrong symbolic constant usage: R11->ARGOFFSET
= per Oleg's request, added comments about various syscall instructions

Changes since v1 (mainly in patch 4/6):
= Reverted changes to "partial pt_regs saving" on interrupt path.
= Folded PARTIAL_FRAME macro into DEFAULT_FRAME. Patch v1
  had a bug (incorrect dwarf debug info generation).
= Corrected more comments.
= Added use of SAVE_* macros to error_entry and save_paranoid.
= Added another cleanup rename patch.

TODO:
= "idtentry" macro uses a subroutine to factor out a largish
  common code block. "interrupt" macro inlines a similar
  block every time (~20 instances). Maybe factor it out?

CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org


Denys Vlasenko (17):
  x86: entry_64.S: delete unused code
  x86: ia32entry.S: fix wrong symbolic constant usage: R11->ARGOFFSET
  x86: open-code register save/restore in trace_hardirqs thunks
  x86: entry_64.S: fold SAVE_ARGS_IRQ macro into its sole user
  x86: entry_64.S: always allocate complete "struct pt_regs"
  x86: mass removal of ARGOFFSET
  x86: rename some macros and labels, no code changes
  x86: add comments about various syscall instructions, no code changes
  x86: entry_64.S: move save_paranoid and ret_from_fork closer to their
    users
  x86: entry_64.S: rename save_paranoid to paranoid_entry, no code
    changes
  x86: entry_64.S: fold test_in_nmi macro into its only user
  x86: get rid of KERNEL_STACK_OFFSET
  x86: ia32entry.S: fold IA32_ARG_FIXUP macro into its callers
  x86: ia32entry.S: use mov instead of push/pop where possible
  x86: code shrink in paranoid_exit
  x86: entry_64.S: trivial optimization for ENOSYS
  x86: simplify iret stack handling on SYSCALL64 fastpath

 arch/x86/ia32/ia32entry.S              | 366 +++++++++--------
 arch/x86/include/asm/calling.h         | 226 +++++------
 arch/x86/include/asm/compat.h          |   2 +-
 arch/x86/include/asm/irqflags.h        |   4 +-
 arch/x86/include/asm/processor.h       |   1 -
 arch/x86/include/asm/ptrace.h          |  21 +-
 arch/x86/include/asm/thread_info.h     |   8 +-
 arch/x86/include/uapi/asm/ptrace-abi.h |  16 +-
 arch/x86/include/uapi/asm/ptrace.h     |  13 +-
 arch/x86/kernel/cpu/common.c           |   2 +-
 arch/x86/kernel/entry_64.S             | 708 ++++++++++++++-------------------
 arch/x86/kernel/preempt.S              |  16 +-
 arch/x86/kernel/process_32.c           |   3 +-
 arch/x86/kernel/process_64.c           |  11 +-
 arch/x86/kernel/smpboot.c              |   3 +-
 arch/x86/lib/thunk_64.S                |  29 +-
 arch/x86/syscalls/syscall_64.tbl       |   2 +-
 arch/x86/um/sys_call_table_64.c        |   2 +-
 arch/x86/xen/smp.c                     |   3 +-
 19 files changed, 699 insertions(+), 737 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/17] x86: entry_64.S: delete unused code
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 02/17] x86: ia32entry.S: fix wrong symbolic constant usage: R11->ARGOFFSET Denys Vlasenko
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

A define, two macros and an unreferenced bit of assembly are gone.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/calling.h |  1 -
 arch/x86/kernel/entry_64.S     | 34 ----------------------------------
 2 files changed, 35 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index cb4c73b..e176cea 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -83,7 +83,6 @@ For 32-bit we have the following conventions - kernel is built with
 #define SS		160
 
 #define ARGOFFSET	R11
-#define SWFRAME		ORIG_RAX
 
 	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
 	subq  $9*8+\addskip, %rsp
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b25ca96..dbfd037 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -155,27 +155,6 @@ ENDPROC(native_usergs_sysret64)
 	movq \tmp,R11+\offset(%rsp)
 	.endm
 
-	.macro FAKE_STACK_FRAME child_rip
-	/* push in order ss, rsp, eflags, cs, rip */
-	xorl %eax, %eax
-	pushq_cfi $__KERNEL_DS /* ss */
-	/*CFI_REL_OFFSET	ss,0*/
-	pushq_cfi %rax /* rsp */
-	CFI_REL_OFFSET	rsp,0
-	pushq_cfi $(X86_EFLAGS_IF|X86_EFLAGS_FIXED) /* eflags - interrupts on */
-	/*CFI_REL_OFFSET	rflags,0*/
-	pushq_cfi $__KERNEL_CS /* cs */
-	/*CFI_REL_OFFSET	cs,0*/
-	pushq_cfi \child_rip /* rip */
-	CFI_REL_OFFSET	rip,0
-	pushq_cfi %rax /* orig rax */
-	.endm
-
-	.macro UNFAKE_STACK_FRAME
-	addq $8*6, %rsp
-	CFI_ADJUST_CFA_OFFSET	-(6*8)
-	.endm
-
 /*
  * initial frame state for interrupts (and exceptions without error code)
  */
@@ -640,19 +619,6 @@ END(\label)
 	FORK_LIKE  vfork
 	FIXED_FRAME stub_iopl, sys_iopl
 
-ENTRY(ptregscall_common)
-	DEFAULT_FRAME 1 8	/* offset 8: return address */
-	RESTORE_TOP_OF_STACK %r11, 8
-	movq_cfi_restore R15+8, r15
-	movq_cfi_restore R14+8, r14
-	movq_cfi_restore R13+8, r13
-	movq_cfi_restore R12+8, r12
-	movq_cfi_restore RBP+8, rbp
-	movq_cfi_restore RBX+8, rbx
-	ret $REST_SKIP		/* pop extended registers */
-	CFI_ENDPROC
-END(ptregscall_common)
-
 ENTRY(stub_execve)
 	CFI_STARTPROC
 	addq $8, %rsp
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/17] x86: ia32entry.S: fix wrong symbolic constant usage: R11->ARGOFFSET
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 01/17] x86: entry_64.S: delete unused code Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 03/17] x86: open-code register save/restore in trace_hardirqs thunks Denys Vlasenko
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

The values of these two constants are the same, the meaning is different.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/ia32/ia32entry.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 4299eb0..59c8d3a 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -169,8 +169,8 @@ sysenter_dispatch:
 sysexit_from_sys_call:
 	andl    $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 	/* clear IF, that popfq doesn't enable interrupts early */
-	andl  $~0x200,EFLAGS-R11(%rsp) 
-	movl	RIP-R11(%rsp),%edx		/* User %eip */
+	andl	$~0x200,EFLAGS-ARGOFFSET(%rsp)
+	movl	RIP-ARGOFFSET(%rsp),%edx		/* User %eip */
 	CFI_REGISTER rip,rdx
 	RESTORE_ARGS 0,24,0,0,0,0
 	xorq	%r8,%r8
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/17] x86: open-code register save/restore in trace_hardirqs thunks
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 01/17] x86: entry_64.S: delete unused code Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 02/17] x86: ia32entry.S: fix wrong symbolic constant usage: R11->ARGOFFSET Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 04/17] x86: entry_64.S: fold SAVE_ARGS_IRQ macro into its sole user Denys Vlasenko
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

This is a preparatory patch for change in "struct pt_regs"
handling in entry_64.S.

trace_hardirqs thunks were (ab)using a part of pt_regs
handling code, namely SAVE_ARGS/RESTORE_ARGS macros,
to save/restore registers across C function calls.

Since SAVE_ARGS is going to be changed, open-code
register saving/restoring here.

Incidentally, this removes a bit of dead code:
one SAVE_ARGS was just to emit a CFI annotation,
but it also generated unreachable assembly insns.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/lib/thunk_64.S | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/lib/thunk_64.S b/arch/x86/lib/thunk_64.S
index 92d9fea..c1c9131 100644
--- a/arch/x86/lib/thunk_64.S
+++ b/arch/x86/lib/thunk_64.S
@@ -16,10 +16,20 @@
 \name:
 	CFI_STARTPROC
 
-	/* this one pushes 9 elems, the next one would be %rIP */
-	SAVE_ARGS
+	subq  $9*8, %rsp
+	CFI_ADJUST_CFA_OFFSET 9*8
+	movq_cfi rdi, 8*8
+	movq_cfi rsi, 7*8
+	movq_cfi rdx, 6*8
+	movq_cfi rcx, 5*8
+	movq_cfi rax, 4*8
+	movq_cfi r8,  3*8
+	movq_cfi r9,  2*8
+	movq_cfi r10, 1*8
+	movq_cfi r11, 0*8
 
 	.if \put_ret_addr_in_rdi
+	/* 9*8(%rsp) is return addr on stack */
 	movq_cfi_restore 9*8, rdi
 	.endif
 
@@ -38,11 +48,20 @@
 	THUNK lockdep_sys_exit_thunk,lockdep_sys_exit
 #endif
 
-	/* SAVE_ARGS below is used only for the .cfi directives it contains. */
 	CFI_STARTPROC
-	SAVE_ARGS
+	CFI_ADJUST_CFA_OFFSET 9*8
 restore:
-	RESTORE_ARGS
+	movq_cfi_restore 0*8, r11
+	movq_cfi_restore 1*8, r10
+	movq_cfi_restore 2*8, r9
+	movq_cfi_restore 3*8, r8
+	movq_cfi_restore 4*8, rax
+	movq_cfi_restore 5*8, rcx
+	movq_cfi_restore 6*8, rdx
+	movq_cfi_restore 7*8, rsi
+	movq_cfi_restore 8*8, rdi
+	addq 9*8, %rsp
+	CFI_ADJUST_CFA_OFFSET -9*8
 	ret
 	CFI_ENDPROC
 	_ASM_NOKPROBE(restore)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/17] x86: entry_64.S: fold SAVE_ARGS_IRQ macro into its sole user
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (2 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 03/17] x86: open-code register save/restore in trace_hardirqs thunks Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 05/17] x86: entry_64.S: always allocate complete "struct pt_regs" Denys Vlasenko
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

No code changes.

This is a preparatory patch for change in "struct pt_regs" handling.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/entry_64.S | 88 ++++++++++++++++++++++------------------------
 1 file changed, 42 insertions(+), 46 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index dbfd037..37f7d95 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -218,51 +218,6 @@ ENDPROC(native_usergs_sysret64)
 	CFI_REL_OFFSET r15, R15+\offset
 	.endm
 
-/* save partial stack frame */
-	.macro SAVE_ARGS_IRQ
-	cld
-	/* start from rbp in pt_regs and jump over */
-	movq_cfi rdi, (RDI-RBP)
-	movq_cfi rsi, (RSI-RBP)
-	movq_cfi rdx, (RDX-RBP)
-	movq_cfi rcx, (RCX-RBP)
-	movq_cfi rax, (RAX-RBP)
-	movq_cfi  r8,  (R8-RBP)
-	movq_cfi  r9,  (R9-RBP)
-	movq_cfi r10, (R10-RBP)
-	movq_cfi r11, (R11-RBP)
-
-	/* Save rbp so that we can unwind from get_irq_regs() */
-	movq_cfi rbp, 0
-
-	/* Save previous stack value */
-	movq %rsp, %rsi
-
-	leaq -RBP(%rsp),%rdi	/* arg1 for handler */
-	testl $3, CS-RBP(%rsi)
-	je 1f
-	SWAPGS
-	/*
-	 * irq_count is used to check if a CPU is already on an interrupt stack
-	 * or not. While this is essentially redundant with preempt_count it is
-	 * a little cheaper to use a separate counter in the PDA (short of
-	 * moving irq_enter into assembly, which would be too much work)
-	 */
-1:	incl PER_CPU_VAR(irq_count)
-	cmovzq PER_CPU_VAR(irq_stack_ptr),%rsp
-	CFI_DEF_CFA_REGISTER	rsi
-
-	/* Store previous stack value */
-	pushq %rsi
-	CFI_ESCAPE	0x0f /* DW_CFA_def_cfa_expression */, 6, \
-			0x77 /* DW_OP_breg7 */, 0, \
-			0x06 /* DW_OP_deref */, \
-			0x08 /* DW_OP_const1u */, SS+8-RBP, \
-			0x22 /* DW_OP_plus */
-	/* We entered an interrupt context - irqs are off: */
-	TRACE_IRQS_OFF
-	.endm
-
 ENTRY(save_paranoid)
 	XCPT_FRAME 1 RDI+8
 	cld
@@ -731,7 +686,48 @@ END(interrupt)
 	/* reserve pt_regs for scratch regs and rbp */
 	subq $ORIG_RAX-RBP, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-RBP
-	SAVE_ARGS_IRQ
+	cld
+	/* start from rbp in pt_regs and jump over */
+	movq_cfi rdi, (RDI-RBP)
+	movq_cfi rsi, (RSI-RBP)
+	movq_cfi rdx, (RDX-RBP)
+	movq_cfi rcx, (RCX-RBP)
+	movq_cfi rax, (RAX-RBP)
+	movq_cfi  r8,  (R8-RBP)
+	movq_cfi  r9,  (R9-RBP)
+	movq_cfi r10, (R10-RBP)
+	movq_cfi r11, (R11-RBP)
+
+	/* Save rbp so that we can unwind from get_irq_regs() */
+	movq_cfi rbp, 0
+
+	/* Save previous stack value */
+	movq %rsp, %rsi
+
+	leaq -RBP(%rsp),%rdi	/* arg1 for handler */
+	testl $3, CS-RBP(%rsi)
+	je 1f
+	SWAPGS
+	/*
+	 * irq_count is used to check if a CPU is already on an interrupt stack
+	 * or not. While this is essentially redundant with preempt_count it is
+	 * a little cheaper to use a separate counter in the PDA (short of
+	 * moving irq_enter into assembly, which would be too much work)
+	 */
+1:	incl PER_CPU_VAR(irq_count)
+	cmovzq PER_CPU_VAR(irq_stack_ptr),%rsp
+	CFI_DEF_CFA_REGISTER	rsi
+
+	/* Store previous stack value */
+	pushq %rsi
+	CFI_ESCAPE	0x0f /* DW_CFA_def_cfa_expression */, 6, \
+			0x77 /* DW_OP_breg7 */, 0, \
+			0x06 /* DW_OP_deref */, \
+			0x08 /* DW_OP_const1u */, SS+8-RBP, \
+			0x22 /* DW_OP_plus */
+	/* We entered an interrupt context - irqs are off: */
+	TRACE_IRQS_OFF
+
 	call \func
 	.endm
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/17] x86: entry_64.S: always allocate complete "struct pt_regs"
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (3 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 04/17] x86: entry_64.S: fold SAVE_ARGS_IRQ macro into its sole user Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 06/17] x86: mass removal of ARGOFFSET Denys Vlasenko
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

64-bit code was using six stack slots fewer by not saving/restoring
registers which are callee-preserved according to C ABI,
and not allocating space for them.
Only when syscall needed a complete "struct pt_regs",
the complete area was allocated and filled in.
As an additional twist, on interrupt entry a "slightly less truncated pt_regs"
trick is used, to make nested interrupt stacks easier to unwind.

This proved to be a source of significant obfuscation and subtle bugs.
For example, stub_fork had to pop the return address,
extend the struct, save registers, and push return address back. Ugly.
ia32_ptregs_common pops return address and "returns" via jmp insn,
throwing a wrench into CPU return stack cache.

This patch changes code to always allocate a complete "struct pt_regs".
The saving of registers is still done lazily.

"Partial pt_regs" trick on interrupt stack is retained, but with added comments
which explain what we are doing, and why. Existing comments are improved.

Macros which manipulate "struct pt_regs" on stack are reworked:
ALLOC_PTREGS_ON_STACK allocates the structure.
SAVE_C_REGS saves to it those registers which are clobbered by C code.
SAVE_EXTRA_REGS saves to it all other registers.
Corresponding RESTORE_* and REMOVE_PTREGS_FROM_STACK macros reverse it.

ia32_ptregs_common, stub_fork and friends lost their ugly dance with
return pointer.

LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
instead of magic numbers.

error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
instead of having it open-coded yet again.

Misleading and slightly wrong comments in "struct pt_regs" are fixed
(four instances).

Patch was run-tested: 64-bit executables, 32-bit executables,
strace works.
Timing tests did not show measurable difference in 32-bit
and 64-bit syscalls.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/ia32/ia32entry.S              |  47 +++----
 arch/x86/include/asm/calling.h         | 217 ++++++++++++++++-----------------
 arch/x86/include/asm/irqflags.h        |   4 +-
 arch/x86/include/asm/ptrace.h          |  13 +-
 arch/x86/include/uapi/asm/ptrace-abi.h |  16 ++-
 arch/x86/include/uapi/asm/ptrace.h     |  13 +-
 arch/x86/kernel/entry_64.S             | 191 +++++++++++------------------
 arch/x86/kernel/preempt.S              |  16 ++-
 8 files changed, 249 insertions(+), 268 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 59c8d3a..d0ca63b 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -62,12 +62,12 @@
 	 */
 	.macro LOAD_ARGS32 offset, _r9=0
 	.if \_r9
-	movl \offset+16(%rsp),%r9d
+	movl \offset+R9(%rsp),%r9d
 	.endif
-	movl \offset+40(%rsp),%ecx
-	movl \offset+48(%rsp),%edx
-	movl \offset+56(%rsp),%esi
-	movl \offset+64(%rsp),%edi
+	movl \offset+RCX(%rsp),%ecx
+	movl \offset+RDX(%rsp),%edx
+	movl \offset+RSI(%rsp),%esi
+	movl \offset+RDI(%rsp),%edi
 	movl %eax,%eax			/* zero extension */
 	.endm
 	
@@ -144,7 +144,8 @@ ENTRY(ia32_sysenter_target)
 	CFI_REL_OFFSET rip,0
 	pushq_cfi %rax
 	cld
-	SAVE_ARGS 0,1,0
+	ALLOC_PTREGS_ON_STACK
+	SAVE_C_REGS_EXCEPT_R891011
  	/* no need to do an access_ok check here because rbp has been
  	   32bit zero extended */ 
 	ASM_STAC
@@ -172,7 +173,8 @@ sysexit_from_sys_call:
 	andl	$~0x200,EFLAGS-ARGOFFSET(%rsp)
 	movl	RIP-ARGOFFSET(%rsp),%edx		/* User %eip */
 	CFI_REGISTER rip,rdx
-	RESTORE_ARGS 0,24,0,0,0,0
+	RESTORE_RSI_RDI
+	REMOVE_PTREGS_FROM_STACK 8*3
 	xorq	%r8,%r8
 	xorq	%r9,%r9
 	xorq	%r10,%r10
@@ -240,13 +242,13 @@ sysenter_tracesys:
 	testl	$(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 	jz	sysenter_auditsys
 #endif
-	SAVE_REST
+	SAVE_EXTRA_REGS
 	CLEAR_RREGS
 	movq	$-ENOSYS,RAX(%rsp)/* ptrace can change this for a bad syscall */
 	movq	%rsp,%rdi        /* &pt_regs -> arg1 */
 	call	syscall_trace_enter
 	LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	cmpq	$(IA32_NR_syscalls-1),%rax
 	ja	int_ret_from_sys_call /* sysenter_tracesys has set RAX(%rsp) */
 	jmp	sysenter_do_call
@@ -288,7 +290,8 @@ ENTRY(ia32_cstar_target)
 	 * disabled irqs and here we enable it straight after entry:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_ARGS 8,0,0
+	ALLOC_PTREGS_ON_STACK 8
+	SAVE_C_REGS_EXCEPT_RCX_R891011
 	movl 	%eax,%eax	/* zero extension */
 	movq	%rax,ORIG_RAX-ARGOFFSET(%rsp)
 	movq	%rcx,RIP-ARGOFFSET(%rsp)
@@ -325,7 +328,7 @@ cstar_dispatch:
 	jnz sysretl_audit
 sysretl_from_sys_call:
 	andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-	RESTORE_ARGS 0,-ARG_SKIP,0,0,0
+	RESTORE_RSI_RDI_RDX
 	movl RIP-ARGOFFSET(%rsp),%ecx
 	CFI_REGISTER rip,rcx
 	movl EFLAGS-ARGOFFSET(%rsp),%r11d	
@@ -356,13 +359,13 @@ cstar_tracesys:
 	jz cstar_auditsys
 #endif
 	xchgl %r9d,%ebp
-	SAVE_REST
+	SAVE_EXTRA_REGS
 	CLEAR_RREGS 0, r9
 	movq $-ENOSYS,RAX(%rsp)	/* ptrace can change this for a bad syscall */
 	movq %rsp,%rdi        /* &pt_regs -> arg1 */
 	call syscall_trace_enter
 	LOAD_ARGS32 ARGOFFSET, 1  /* reload args from stack in case ptrace changed it */
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	xchgl %ebp,%r9d
 	cmpq $(IA32_NR_syscalls-1),%rax
 	ja int_ret_from_sys_call /* cstar_tracesys has set RAX(%rsp) */
@@ -417,7 +420,8 @@ ENTRY(ia32_syscall)
 	cld
 	/* note the registers are not zero extended to the sf.
 	   this could be a problem. */
-	SAVE_ARGS 0,1,0
+	ALLOC_PTREGS_ON_STACK
+	SAVE_C_REGS_EXCEPT_R891011
 	orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 	jnz ia32_tracesys
@@ -430,16 +434,16 @@ ia32_sysret:
 	movq %rax,RAX-ARGOFFSET(%rsp)
 ia32_ret_from_sys_call:
 	CLEAR_RREGS -ARGOFFSET
-	jmp int_ret_from_sys_call 
+	jmp int_ret_from_sys_call
 
-ia32_tracesys:			 
-	SAVE_REST
+ia32_tracesys:
+	SAVE_EXTRA_REGS
 	CLEAR_RREGS
 	movq $-ENOSYS,RAX(%rsp)	/* ptrace can change this for a bad syscall */
 	movq %rsp,%rdi        /* &pt_regs -> arg1 */
 	call syscall_trace_enter
 	LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	cmpq $(IA32_NR_syscalls-1),%rax
 	ja  int_ret_from_sys_call	/* ia32_tracesys has set RAX(%rsp) */
 	jmp ia32_do_call
@@ -475,7 +479,6 @@ GLOBAL(stub32_clone)
 
 	ALIGN
 ia32_ptregs_common:
-	popq %r11
 	CFI_ENDPROC
 	CFI_STARTPROC32	simple
 	CFI_SIGNAL_FRAME
@@ -490,9 +493,9 @@ ia32_ptregs_common:
 /*	CFI_REL_OFFSET	rflags,EFLAGS-ARGOFFSET*/
 	CFI_REL_OFFSET	rsp,RSP-ARGOFFSET
 /*	CFI_REL_OFFSET	ss,SS-ARGOFFSET*/
-	SAVE_REST
+	SAVE_EXTRA_REGS 8
 	call *%rax
-	RESTORE_REST
-	jmp  ia32_sysret	/* misbalances the return cache */
+	RESTORE_EXTRA_REGS 8
+	ret
 	CFI_ENDPROC
 END(ia32_ptregs_common)
diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index e176cea..7642948 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -52,142 +52,135 @@ For 32-bit we have the following conventions - kernel is built with
 
 /*
  * 64-bit system call stack frame layout defines and helpers,
- * for assembly code:
+ * for assembly code.
  */
 
-#define R15		  0
-#define R14		  8
-#define R13		 16
-#define R12		 24
-#define RBP		 32
-#define RBX		 40
-
-/* arguments: interrupts/non tracing syscalls only save up to here: */
-#define R11		 48
-#define R10		 56
-#define R9		 64
-#define R8		 72
-#define RAX		 80
-#define RCX		 88
-#define RDX		 96
-#define RSI		104
-#define RDI		112
-#define ORIG_RAX	120       /* + error_code */
-/* end of arguments */
-
-/* cpu exception frame or undefined in case of fast syscall: */
-#define RIP		128
-#define CS		136
-#define EFLAGS		144
-#define RSP		152
-#define SS		160
-
-#define ARGOFFSET	R11
-
-	.macro SAVE_ARGS addskip=0, save_rcx=1, save_r891011=1
-	subq  $9*8+\addskip, %rsp
-	CFI_ADJUST_CFA_OFFSET	9*8+\addskip
-	movq_cfi rdi, 8*8
-	movq_cfi rsi, 7*8
-	movq_cfi rdx, 6*8
-
-	.if \save_rcx
-	movq_cfi rcx, 5*8
-	.endif
-
-	movq_cfi rax, 4*8
+/* The layout forms the "struct pt_regs" on the stack: */
+/*
+ * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
+ * unless syscall needs a complete, fully filled "struct pt_regs".
+ */
+#define R15		0*8
+#define R14		1*8
+#define R13		2*8
+#define R12		3*8
+#define RBP		4*8
+#define RBX		5*8
+/* These regs are callee-clobbered. Always saved on kernel entry. */
+#define R11		6*8
+#define R10		7*8
+#define R9		8*8
+#define R8		9*8
+#define RAX		10*8
+#define RCX		11*8
+#define RDX		12*8
+#define RSI		13*8
+#define RDI		14*8
+/*
+ * On syscall entry, this is syscall#. On CPU exception, this is error code.
+ * On hw interrupt, it's IRQ number:
+ */
+#define ORIG_RAX	15*8
+/* Return frame for iretq */
+#define RIP		16*8
+#define CS		17*8
+#define EFLAGS		18*8
+#define RSP		19*8
+#define SS		20*8
+
+#define ARGOFFSET	0
+
+	.macro ALLOC_PTREGS_ON_STACK addskip=0
+	subq	$15*8+\addskip, %rsp
+	CFI_ADJUST_CFA_OFFSET 15*8+\addskip
+	.endm
 
-	.if \save_r891011
-	movq_cfi r8,  3*8
-	movq_cfi r9,  2*8
-	movq_cfi r10, 1*8
-	movq_cfi r11, 0*8
+	.macro SAVE_C_REGS_HELPER offset=0 rcx=1 r8plus=1
+	movq_cfi rdi, 14*8+\offset
+	movq_cfi rsi, 13*8+\offset
+	movq_cfi rdx, 12*8+\offset
+	.if \rcx
+	movq_cfi rcx, 11*8+\offset
+	.endif
+	movq_cfi rax, 10*8+\offset
+	.if \r8plus
+	movq_cfi r8,  9*8+\offset
+	movq_cfi r9,  8*8+\offset
+	movq_cfi r10, 7*8+\offset
+	movq_cfi r11, 6*8+\offset
 	.endif
+	.endm
+	.macro SAVE_C_REGS offset=0
+	SAVE_C_REGS_HELPER \offset, 1, 1
+	.endm
+	.macro SAVE_C_REGS_EXCEPT_R891011
+	SAVE_C_REGS_HELPER 0, 1, 0
+	.endm
+	.macro SAVE_C_REGS_EXCEPT_RCX_R891011
+	SAVE_C_REGS_HELPER 0, 0, 0
+	.endm
 
+	.macro SAVE_EXTRA_REGS offset=0
+	movq_cfi rbx, 5*8+\offset
+	movq_cfi rbp, 4*8+\offset
+	movq_cfi r12, 3*8+\offset
+	movq_cfi r13, 2*8+\offset
+	movq_cfi r14, 1*8+\offset
+	movq_cfi r15, 0*8+\offset
+	.endm
+	.macro SAVE_EXTRA_REGS_RBP offset=0
+	movq_cfi rbp, 4*8+\offset
 	.endm
 
-#define ARG_SKIP	(9*8)
+	.macro RESTORE_EXTRA_REGS offset=0
+	movq_cfi_restore 0*8+\offset, r15
+	movq_cfi_restore 1*8+\offset, r14
+	movq_cfi_restore 2*8+\offset, r13
+	movq_cfi_restore 3*8+\offset, r12
+	movq_cfi_restore 4*8+\offset, rbp
+	movq_cfi_restore 5*8+\offset, rbx
+	.endm
 
-	.macro RESTORE_ARGS rstor_rax=1, addskip=0, rstor_rcx=1, rstor_r11=1, \
-			    rstor_r8910=1, rstor_rdx=1
+	.macro RESTORE_C_REGS_HELPER rstor_rax=1, rstor_rcx=1, rstor_r11=1, rstor_r8910=1, rstor_rdx=1
 	.if \rstor_r11
-	movq_cfi_restore 0*8, r11
+	movq_cfi_restore 6*8, r11
 	.endif
-
 	.if \rstor_r8910
-	movq_cfi_restore 1*8, r10
-	movq_cfi_restore 2*8, r9
-	movq_cfi_restore 3*8, r8
+	movq_cfi_restore 7*8, r10
+	movq_cfi_restore 8*8, r9
+	movq_cfi_restore 9*8, r8
 	.endif
-
 	.if \rstor_rax
-	movq_cfi_restore 4*8, rax
+	movq_cfi_restore 10*8, rax
 	.endif
-
 	.if \rstor_rcx
-	movq_cfi_restore 5*8, rcx
+	movq_cfi_restore 11*8, rcx
 	.endif
-
 	.if \rstor_rdx
-	movq_cfi_restore 6*8, rdx
-	.endif
-
-	movq_cfi_restore 7*8, rsi
-	movq_cfi_restore 8*8, rdi
-
-	.if ARG_SKIP+\addskip > 0
-	addq $ARG_SKIP+\addskip, %rsp
-	CFI_ADJUST_CFA_OFFSET	-(ARG_SKIP+\addskip)
+	movq_cfi_restore 12*8, rdx
 	.endif
+	movq_cfi_restore 13*8, rsi
+	movq_cfi_restore 14*8, rdi
 	.endm
-
-	.macro LOAD_ARGS offset, skiprax=0
-	movq \offset(%rsp),    %r11
-	movq \offset+8(%rsp),  %r10
-	movq \offset+16(%rsp), %r9
-	movq \offset+24(%rsp), %r8
-	movq \offset+40(%rsp), %rcx
-	movq \offset+48(%rsp), %rdx
-	movq \offset+56(%rsp), %rsi
-	movq \offset+64(%rsp), %rdi
-	.if \skiprax
-	.else
-	movq \offset+72(%rsp), %rax
-	.endif
+	.macro RESTORE_C_REGS
+	RESTORE_C_REGS_HELPER 1,1,1,1,1
 	.endm
-
-#define REST_SKIP	(6*8)
-
-	.macro SAVE_REST
-	subq $REST_SKIP, %rsp
-	CFI_ADJUST_CFA_OFFSET	REST_SKIP
-	movq_cfi rbx, 5*8
-	movq_cfi rbp, 4*8
-	movq_cfi r12, 3*8
-	movq_cfi r13, 2*8
-	movq_cfi r14, 1*8
-	movq_cfi r15, 0*8
+	.macro RESTORE_C_REGS_EXCEPT_RAX
+	RESTORE_C_REGS_HELPER 0,1,1,1,1
 	.endm
-
-	.macro RESTORE_REST
-	movq_cfi_restore 0*8, r15
-	movq_cfi_restore 1*8, r14
-	movq_cfi_restore 2*8, r13
-	movq_cfi_restore 3*8, r12
-	movq_cfi_restore 4*8, rbp
-	movq_cfi_restore 5*8, rbx
-	addq $REST_SKIP, %rsp
-	CFI_ADJUST_CFA_OFFSET	-(REST_SKIP)
+	.macro RESTORE_C_REGS_EXCEPT_RCX
+	RESTORE_C_REGS_HELPER 1,0,1,1,1
 	.endm
-
-	.macro SAVE_ALL
-	SAVE_ARGS
-	SAVE_REST
+	.macro RESTORE_RSI_RDI
+	RESTORE_C_REGS_HELPER 0,0,0,0,0
+	.endm
+	.macro RESTORE_RSI_RDI_RDX
+	RESTORE_C_REGS_HELPER 0,0,0,0,1
 	.endm
 
-	.macro RESTORE_ALL addskip=0
-	RESTORE_REST
-	RESTORE_ARGS 1, \addskip
+	.macro REMOVE_PTREGS_FROM_STACK addskip=0
+	addq $15*8+\addskip, %rsp
+	CFI_ADJUST_CFA_OFFSET -(15*8+\addskip)
 	.endm
 
 	.macro icebp
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index bba3cf8..6f98c16 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
 #define ARCH_LOCKDEP_SYS_EXIT_IRQ	\
 	TRACE_IRQS_ON; \
 	sti; \
-	SAVE_REST; \
+	SAVE_EXTRA_REGS; \
 	LOCKDEP_SYS_EXIT; \
-	RESTORE_REST; \
+	RESTORE_EXTRA_REGS; \
 	cli; \
 	TRACE_IRQS_OFF;
 
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 6205f0c..c822b35 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -31,13 +31,17 @@ struct pt_regs {
 #else /* __i386__ */
 
 struct pt_regs {
+/*
+ * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
+ * unless syscall needs a complete, fully filled "struct pt_regs".
+ */
 	unsigned long r15;
 	unsigned long r14;
 	unsigned long r13;
 	unsigned long r12;
 	unsigned long bp;
 	unsigned long bx;
-/* arguments: non interrupts/non tracing syscalls only save up to here*/
+/* These regs are callee-clobbered. Always saved on kernel entry. */
 	unsigned long r11;
 	unsigned long r10;
 	unsigned long r9;
@@ -47,9 +51,12 @@ struct pt_regs {
 	unsigned long dx;
 	unsigned long si;
 	unsigned long di;
+/*
+ * On syscall entry, this is syscall#. On CPU exception, this is error code.
+ * On hw interrupt, it's IRQ number:
+ */
 	unsigned long orig_ax;
-/* end of arguments */
-/* cpu exception frame or undefined */
+/* Return frame for iretq */
 	unsigned long ip;
 	unsigned long cs;
 	unsigned long flags;
diff --git a/arch/x86/include/uapi/asm/ptrace-abi.h b/arch/x86/include/uapi/asm/ptrace-abi.h
index 7b0a55a..580aee3 100644
--- a/arch/x86/include/uapi/asm/ptrace-abi.h
+++ b/arch/x86/include/uapi/asm/ptrace-abi.h
@@ -25,13 +25,17 @@
 #else /* __i386__ */
 
 #if defined(__ASSEMBLY__) || defined(__FRAME_OFFSETS)
+/*
+ * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
+ * unless syscall needs a complete, fully filled "struct pt_regs".
+ */
 #define R15 0
 #define R14 8
 #define R13 16
 #define R12 24
 #define RBP 32
 #define RBX 40
-/* arguments: interrupts/non tracing syscalls only save up to here*/
+/* These regs are callee-clobbered. Always saved on kernel entry. */
 #define R11 48
 #define R10 56
 #define R9 64
@@ -41,15 +45,17 @@
 #define RDX 96
 #define RSI 104
 #define RDI 112
-#define ORIG_RAX 120       /* = ERROR */
-/* end of arguments */
-/* cpu exception frame or undefined in case of fast syscall. */
+/*
+ * On syscall entry, this is syscall#. On CPU exception, this is error code.
+ * On hw interrupt, it's IRQ number:
+ */
+#define ORIG_RAX 120
+/* Return frame for iretq */
 #define RIP 128
 #define CS 136
 #define EFLAGS 144
 #define RSP 152
 #define SS 160
-#define ARGOFFSET R11
 #endif /* __ASSEMBLY__ */
 
 /* top of stack page */
diff --git a/arch/x86/include/uapi/asm/ptrace.h b/arch/x86/include/uapi/asm/ptrace.h
index ac4b9aa..bc16115 100644
--- a/arch/x86/include/uapi/asm/ptrace.h
+++ b/arch/x86/include/uapi/asm/ptrace.h
@@ -41,13 +41,17 @@ struct pt_regs {
 #ifndef __KERNEL__
 
 struct pt_regs {
+/*
+ * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
+ * unless syscall needs a complete, fully filled "struct pt_regs".
+ */
 	unsigned long r15;
 	unsigned long r14;
 	unsigned long r13;
 	unsigned long r12;
 	unsigned long rbp;
 	unsigned long rbx;
-/* arguments: non interrupts/non tracing syscalls only save up to here*/
+/* These regs are callee-clobbered. Always saved on kernel entry. */
 	unsigned long r11;
 	unsigned long r10;
 	unsigned long r9;
@@ -57,9 +61,12 @@ struct pt_regs {
 	unsigned long rdx;
 	unsigned long rsi;
 	unsigned long rdi;
+/*
+ * On syscall entry, this is syscall#. On CPU exception, this is error code.
+ * On hw interrupt, it's IRQ number:
+ */
 	unsigned long orig_rax;
-/* end of arguments */
-/* cpu exception frame or undefined */
+/* Return frame for iretq */
 	unsigned long rip;
 	unsigned long cs;
 	unsigned long eflags;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 37f7d95..c489a2d 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -14,9 +14,6 @@
  * NOTE: This code handles signal-recognition, which happens every time
  * after an interrupt and after each system call.
  *
- * Normal syscalls and interrupts don't save a full stack frame, this is
- * only done for syscall tracing, signals or fork/exec et.al.
- *
  * A note on terminology:
  * - top of stack: Architecture defined interrupt frame from SS to RIP
  * at the top of the kernel process stack.
@@ -26,12 +23,6 @@
  * Some macro usage:
  * - CFI macros are used to generate dwarf2 unwind information for better
  * backtraces. They don't change any code.
- * - SAVE_ALL/RESTORE_ALL - Save/restore all registers
- * - SAVE_ARGS/RESTORE_ARGS - Save/restore registers that C functions modify.
- * There are unfortunately lots of special cases where some registers
- * not touched. The macro is a big mess that should be cleaned up.
- * - SAVE_REST/RESTORE_REST - Handle the registers not saved by SAVE_ARGS.
- * Gives a full stack frame.
  * - ENTRY/END Define functions in the symbol table.
  * - FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK - Fix up the hardware stack
  * frame that is otherwise undefined after a SYSCALL
@@ -156,7 +147,7 @@ ENDPROC(native_usergs_sysret64)
 	.endm
 
 /*
- * initial frame state for interrupts (and exceptions without error code)
+ * empty frame
  */
 	.macro EMPTY_FRAME start=1 offset=0
 	.if \start
@@ -190,9 +181,9 @@ ENDPROC(native_usergs_sysret64)
 	.endm
 
 /*
- * frame that enables calling into C.
+ * frame that enables passing a complete pt_regs to a C function.
  */
-	.macro PARTIAL_FRAME start=1 offset=0
+	.macro DEFAULT_FRAME start=1 offset=0
 	XCPT_FRAME \start, ORIG_RAX+\offset-ARGOFFSET
 	CFI_REL_OFFSET rdi, RDI+\offset-ARGOFFSET
 	CFI_REL_OFFSET rsi, RSI+\offset-ARGOFFSET
@@ -203,13 +194,6 @@ ENDPROC(native_usergs_sysret64)
 	CFI_REL_OFFSET r9, R9+\offset-ARGOFFSET
 	CFI_REL_OFFSET r10, R10+\offset-ARGOFFSET
 	CFI_REL_OFFSET r11, R11+\offset-ARGOFFSET
-	.endm
-
-/*
- * frame that enables passing a complete pt_regs to a C function.
- */
-	.macro DEFAULT_FRAME start=1 offset=0
-	PARTIAL_FRAME \start, R11+\offset-R15
 	CFI_REL_OFFSET rbx, RBX+\offset
 	CFI_REL_OFFSET rbp, RBP+\offset
 	CFI_REL_OFFSET r12, R12+\offset
@@ -221,21 +205,8 @@ ENDPROC(native_usergs_sysret64)
 ENTRY(save_paranoid)
 	XCPT_FRAME 1 RDI+8
 	cld
-	movq_cfi rdi, RDI+8
-	movq_cfi rsi, RSI+8
-	movq_cfi rdx, RDX+8
-	movq_cfi rcx, RCX+8
-	movq_cfi rax, RAX+8
-	movq_cfi r8, R8+8
-	movq_cfi r9, R9+8
-	movq_cfi r10, R10+8
-	movq_cfi r11, R11+8
-	movq_cfi rbx, RBX+8
-	movq_cfi rbp, RBP+8
-	movq_cfi r12, R12+8
-	movq_cfi r13, R13+8
-	movq_cfi r14, R14+8
-	movq_cfi r15, R15+8
+	SAVE_C_REGS 8
+	SAVE_EXTRA_REGS 8
 	movl $1,%ebx
 	movl $MSR_GS_BASE,%ecx
 	rdmsr
@@ -264,7 +235,7 @@ ENTRY(ret_from_fork)
 
 	GET_THREAD_INFO(%rcx)
 
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 
 	testl $3, CS-ARGOFFSET(%rsp)		# from kernel_thread?
 	jz   1f
@@ -276,12 +247,10 @@ ENTRY(ret_from_fork)
 	jmp ret_from_sys_call			# go to the SYSRET fastpath
 
 1:
-	subq $REST_SKIP, %rsp	# leave space for volatiles
-	CFI_ADJUST_CFA_OFFSET	REST_SKIP
 	movq %rbp, %rdi
 	call *%rbx
 	movl $0, RAX(%rsp)
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	jmp int_ret_from_sys_call
 	CFI_ENDPROC
 END(ret_from_fork)
@@ -339,7 +308,8 @@ GLOBAL(system_call_after_swapgs)
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_ARGS 8,0
+	ALLOC_PTREGS_ON_STACK 8
+	SAVE_C_REGS
 	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
 	movq  %rcx,RIP-ARGOFFSET(%rsp)
 	CFI_REL_OFFSET rip,RIP-ARGOFFSET
@@ -375,9 +345,9 @@ sysret_check:
 	 * sysretq will re-enable interrupts:
 	 */
 	TRACE_IRQS_ON
+	RESTORE_C_REGS_EXCEPT_RCX
 	movq RIP-ARGOFFSET(%rsp),%rcx
 	CFI_REGISTER	rip,rcx
-	RESTORE_ARGS 1,-ARG_SKIP,0
 	/*CFI_REGISTER	rflags,r11*/
 	movq	PER_CPU_VAR(old_rsp), %rsp
 	USERGS_SYSRET64
@@ -429,7 +399,7 @@ auditsys:
 	movq %rax,%rsi			/* 2nd arg: syscall number */
 	movl $AUDIT_ARCH_X86_64,%edi	/* 1st arg: audit arch */
 	call __audit_syscall_entry
-	LOAD_ARGS 0		/* reload call-clobbered registers */
+	RESTORE_C_REGS			/* reload call-clobbered registers */
 	jmp system_call_fastpath
 
 	/*
@@ -453,7 +423,7 @@ tracesys:
 	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 	jz auditsys
 #endif
-	SAVE_REST
+	SAVE_EXTRA_REGS
 	movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
 	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
@@ -463,8 +433,8 @@ tracesys:
 	 * We don't reload %rax because syscall_trace_enter() returned
 	 * the value it wants us to use in the table lookup.
 	 */
-	LOAD_ARGS ARGOFFSET, 1
-	RESTORE_REST
+	RESTORE_C_REGS_EXCEPT_RAX
+	RESTORE_EXTRA_REGS
 #if __SYSCALL_MASK == ~0
 	cmpq $__NR_syscall_max,%rax
 #else
@@ -515,7 +485,7 @@ int_very_careful:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 int_check_syscall_exit_work:
-	SAVE_REST
+	SAVE_EXTRA_REGS
 	/* Check for syscall exit trace */
 	testl $_TIF_WORK_SYSCALL_EXIT,%edx
 	jz int_signal
@@ -534,7 +504,7 @@ int_signal:
 	call do_notify_resume
 1:	movl $_TIF_WORK_MASK,%edi
 int_restore_rest:
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	jmp int_with_check
@@ -544,15 +514,12 @@ END(system_call)
 	.macro FORK_LIKE func
 ENTRY(stub_\func)
 	CFI_STARTPROC
-	popq	%r11			/* save return address */
-	PARTIAL_FRAME 0
-	SAVE_REST
-	pushq	%r11			/* put it back on stack */
+	DEFAULT_FRAME 0, 8		/* offset 8: return address */
+	SAVE_EXTRA_REGS 8
 	FIXUP_TOP_OF_STACK %r11, 8
-	DEFAULT_FRAME 0 8		/* offset 8: return address */
 	call sys_\func
 	RESTORE_TOP_OF_STACK %r11, 8
-	ret $REST_SKIP		/* pop extended registers */
+	ret
 	CFI_ENDPROC
 END(stub_\func)
 	.endm
@@ -560,7 +527,7 @@ END(stub_\func)
 	.macro FIXED_FRAME label,func
 ENTRY(\label)
 	CFI_STARTPROC
-	PARTIAL_FRAME 0 8		/* offset 8: return address */
+	DEFAULT_FRAME 0, 8		/* offset 8: return address */
 	FIXUP_TOP_OF_STACK %r11, 8-ARGOFFSET
 	call \func
 	RESTORE_TOP_OF_STACK %r11, 8-ARGOFFSET
@@ -577,12 +544,12 @@ END(\label)
 ENTRY(stub_execve)
 	CFI_STARTPROC
 	addq $8, %rsp
-	PARTIAL_FRAME 0
-	SAVE_REST
+	DEFAULT_FRAME 0
+	SAVE_EXTRA_REGS
 	FIXUP_TOP_OF_STACK %r11
 	call sys_execve
 	movq %rax,RAX(%rsp)
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	jmp int_ret_from_sys_call
 	CFI_ENDPROC
 END(stub_execve)
@@ -594,12 +561,12 @@ END(stub_execve)
 ENTRY(stub_rt_sigreturn)
 	CFI_STARTPROC
 	addq $8, %rsp
-	PARTIAL_FRAME 0
-	SAVE_REST
+	DEFAULT_FRAME 0
+	SAVE_EXTRA_REGS
 	FIXUP_TOP_OF_STACK %r11
 	call sys_rt_sigreturn
 	movq %rax,RAX(%rsp) # fixme, this could be done at the higher layer
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	jmp int_ret_from_sys_call
 	CFI_ENDPROC
 END(stub_rt_sigreturn)
@@ -608,12 +575,12 @@ END(stub_rt_sigreturn)
 ENTRY(stub_x32_rt_sigreturn)
 	CFI_STARTPROC
 	addq $8, %rsp
-	PARTIAL_FRAME 0
-	SAVE_REST
+	DEFAULT_FRAME 0
+	SAVE_EXTRA_REGS
 	FIXUP_TOP_OF_STACK %r11
 	call sys32_x32_rt_sigreturn
 	movq %rax,RAX(%rsp) # fixme, this could be done at the higher layer
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	jmp int_ret_from_sys_call
 	CFI_ENDPROC
 END(stub_x32_rt_sigreturn)
@@ -621,13 +588,13 @@ END(stub_x32_rt_sigreturn)
 ENTRY(stub_x32_execve)
 	CFI_STARTPROC
 	addq $8, %rsp
-	PARTIAL_FRAME 0
-	SAVE_REST
+	DEFAULT_FRAME 0
+	SAVE_EXTRA_REGS
 	FIXUP_TOP_OF_STACK %r11
 	call compat_sys_execve
 	RESTORE_TOP_OF_STACK %r11
 	movq %rax,RAX(%rsp)
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	jmp int_ret_from_sys_call
 	CFI_ENDPROC
 END(stub_x32_execve)
@@ -683,42 +650,36 @@ END(interrupt)
 
 /* 0(%rsp): ~(interrupt number) */
 	.macro interrupt func
-	/* reserve pt_regs for scratch regs and rbp */
-	subq $ORIG_RAX-RBP, %rsp
-	CFI_ADJUST_CFA_OFFSET ORIG_RAX-RBP
 	cld
-	/* start from rbp in pt_regs and jump over */
-	movq_cfi rdi, (RDI-RBP)
-	movq_cfi rsi, (RSI-RBP)
-	movq_cfi rdx, (RDX-RBP)
-	movq_cfi rcx, (RCX-RBP)
-	movq_cfi rax, (RAX-RBP)
-	movq_cfi  r8,  (R8-RBP)
-	movq_cfi  r9,  (R9-RBP)
-	movq_cfi r10, (R10-RBP)
-	movq_cfi r11, (R11-RBP)
-
-	/* Save rbp so that we can unwind from get_irq_regs() */
-	movq_cfi rbp, 0
-
-	/* Save previous stack value */
-	movq %rsp, %rsi
+	/*
+	 * Since nothing in interrupt handling code touches r12...r15 members
+	 * of "struct pt_regs", and since interrupts can nest, we can save
+	 * four stack slots and simultaneously provide
+	 * an unwind-friendly stack layout by saving "truncated" pt_regs
+	 * exactly up to rbp slot, without these members.
+	 */
+	ALLOC_PTREGS_ON_STACK -RBP
+	SAVE_C_REGS -RBP
+	/* this goes to 0(%rsp) for unwinder, not for saving the value: */
+	SAVE_EXTRA_REGS_RBP -RBP
 
-	leaq -RBP(%rsp),%rdi	/* arg1 for handler */
-	testl $3, CS-RBP(%rsi)
+	leaq -RBP(%rsp),%rdi	/* arg1 for \func (pointer to pt_regs) */
+
+	testl $3, CS-RBP(%rsp)
 	je 1f
 	SWAPGS
+1:
 	/*
+	 * Save previous stack pointer, optionally switch to interrupt stack.
 	 * irq_count is used to check if a CPU is already on an interrupt stack
 	 * or not. While this is essentially redundant with preempt_count it is
 	 * a little cheaper to use a separate counter in the PDA (short of
 	 * moving irq_enter into assembly, which would be too much work)
 	 */
-1:	incl PER_CPU_VAR(irq_count)
+	movq %rsp, %rsi
+	incl PER_CPU_VAR(irq_count)
 	cmovzq PER_CPU_VAR(irq_stack_ptr),%rsp
 	CFI_DEF_CFA_REGISTER	rsi
-
-	/* Store previous stack value */
 	pushq %rsi
 	CFI_ESCAPE	0x0f /* DW_CFA_def_cfa_expression */, 6, \
 			0x77 /* DW_OP_breg7 */, 0, \
@@ -750,6 +711,7 @@ ret_from_intr:
 	/* Restore saved previous stack */
 	popq %rsi
 	CFI_DEF_CFA rsi,SS+8-RBP	/* reg/off reset after def_cfa_expr */
+	/* return code expects complete pt_regs - adjust rsp accordingly: */
 	leaq ARGOFFSET-RBP(%rsi), %rsp
 	CFI_DEF_CFA_REGISTER	rsp
 	CFI_ADJUST_CFA_OFFSET	RBP-ARGOFFSET
@@ -761,7 +723,7 @@ exit_intr:
 
 	/* Interrupt came from user space */
 	/*
-	 * Has a correct top of stack, but a partial stack frame
+	 * Has a correct top of stack.
 	 * %rcx: thread info. Interrupts off.
 	 */
 retint_with_reschedule:
@@ -789,7 +751,8 @@ retint_restore_args:	/* return to kernel space */
 	 */
 	TRACE_IRQS_IRETQ
 restore_args:
-	RESTORE_ARGS 1,8,1
+	RESTORE_C_REGS
+	REMOVE_PTREGS_FROM_STACK 8
 
 irq_return:
 	/*
@@ -876,12 +839,12 @@ retint_signal:
 	jz    retint_swapgs
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	SAVE_REST
+	SAVE_EXTRA_REGS
 	movq $-1,ORIG_RAX(%rsp)
 	xorl %esi,%esi		# oldset
 	movq %rsp,%rdi		# &pt_regs
 	call do_notify_resume
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	GET_THREAD_INFO(%rcx)
@@ -1044,8 +1007,7 @@ ENTRY(\sym)
 	pushq_cfi $-1			/* ORIG_RAX: no syscall to restart */
 	.endif
 
-	subq $ORIG_RAX-R15, %rsp
-	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
+	ALLOC_PTREGS_ON_STACK
 
 	.if \paranoid
 	call save_paranoid
@@ -1256,7 +1218,9 @@ ENTRY(xen_failsafe_callback)
 	addq $0x30,%rsp
 	CFI_ADJUST_CFA_OFFSET -0x30
 	pushq_cfi $-1 /* orig_ax = -1 => not a system call */
-	SAVE_ALL
+	ALLOC_PTREGS_ON_STACK
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS
 	jmp error_exit
 	CFI_ENDPROC
 END(xen_failsafe_callback)
@@ -1313,11 +1277,15 @@ ENTRY(paranoid_exit)
 paranoid_swapgs:
 	TRACE_IRQS_IRETQ 0
 	SWAPGS_UNSAFE_STACK
-	RESTORE_ALL 8
+	RESTORE_EXTRA_REGS
+	RESTORE_C_REGS
+	REMOVE_PTREGS_FROM_STACK 8
 	jmp irq_return
 paranoid_restore:
 	TRACE_IRQS_IRETQ_DEBUG 0
-	RESTORE_ALL 8
+	RESTORE_EXTRA_REGS
+	RESTORE_C_REGS
+	REMOVE_PTREGS_FROM_STACK 8
 	jmp irq_return
 paranoid_userspace:
 	GET_THREAD_INFO(%rcx)
@@ -1357,21 +1325,8 @@ ENTRY(error_entry)
 	CFI_ADJUST_CFA_OFFSET 15*8
 	/* oldrax contains error code */
 	cld
-	movq_cfi rdi, RDI+8
-	movq_cfi rsi, RSI+8
-	movq_cfi rdx, RDX+8
-	movq_cfi rcx, RCX+8
-	movq_cfi rax, RAX+8
-	movq_cfi  r8,  R8+8
-	movq_cfi  r9,  R9+8
-	movq_cfi r10, R10+8
-	movq_cfi r11, R11+8
-	movq_cfi rbx, RBX+8
-	movq_cfi rbp, RBP+8
-	movq_cfi r12, R12+8
-	movq_cfi r13, R13+8
-	movq_cfi r14, R14+8
-	movq_cfi r15, R15+8
+	SAVE_C_REGS 8
+	SAVE_EXTRA_REGS 8
 	xorl %ebx,%ebx
 	testl $3,CS+8(%rsp)
 	je error_kernelspace
@@ -1412,7 +1367,7 @@ END(error_entry)
 ENTRY(error_exit)
 	DEFAULT_FRAME
 	movl %ebx,%eax
-	RESTORE_REST
+	RESTORE_EXTRA_REGS
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	GET_THREAD_INFO(%rcx)
@@ -1631,8 +1586,8 @@ end_repeat_nmi:
 	 * so that we repeat another NMI.
 	 */
 	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
-	subq $ORIG_RAX-R15, %rsp
-	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
+	ALLOC_PTREGS_ON_STACK
+
 	/*
 	 * Use save_paranoid to handle SWAPGS, but no need to use paranoid_exit
 	 * as we should not be calling schedule in NMI context.
@@ -1671,8 +1626,10 @@ end_repeat_nmi:
 nmi_swapgs:
 	SWAPGS_UNSAFE_STACK
 nmi_restore:
+	RESTORE_EXTRA_REGS
+	RESTORE_C_REGS
 	/* Pop the extra iret frame at once */
-	RESTORE_ALL 6*8
+	REMOVE_PTREGS_FROM_STACK 6*8
 
 	/* Clear the NMI executing stack variable */
 	movq $0, 5*8(%rsp)
diff --git a/arch/x86/kernel/preempt.S b/arch/x86/kernel/preempt.S
index ca7f0d5..673da2f 100644
--- a/arch/x86/kernel/preempt.S
+++ b/arch/x86/kernel/preempt.S
@@ -6,9 +6,13 @@
 
 ENTRY(___preempt_schedule)
 	CFI_STARTPROC
-	SAVE_ALL
+	ALLOC_PTREGS_ON_STACK
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS
 	call preempt_schedule
-	RESTORE_ALL
+	RESTORE_EXTRA_REGS
+	RESTORE_C_REGS
+	REMOVE_PTREGS_FROM_STACK
 	ret
 	CFI_ENDPROC
 
@@ -16,9 +20,13 @@ ENTRY(___preempt_schedule)
 
 ENTRY(___preempt_schedule_context)
 	CFI_STARTPROC
-	SAVE_ALL
+	ALLOC_PTREGS_ON_STACK
+	SAVE_C_REGS
+	SAVE_EXTRA_REGS
 	call preempt_schedule_context
-	RESTORE_ALL
+	RESTORE_EXTRA_REGS
+	RESTORE_C_REGS
+	REMOVE_PTREGS_FROM_STACK
 	ret
 	CFI_ENDPROC
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/17] x86: mass removal of ARGOFFSET
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (4 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 05/17] x86: entry_64.S: always allocate complete "struct pt_regs" Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 07/17] x86: rename some macros and labels, no code changes Denys Vlasenko
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

ARGOFFSET is zero now, removing it changes no code.
A few macros lost "offset" parameter, since it is always zero now too.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/ia32/ia32entry.S      | 140 ++++++++++++++++++++---------------------
 arch/x86/include/asm/calling.h |   2 -
 arch/x86/kernel/entry_64.S     |  66 +++++++++----------
 3 files changed, 103 insertions(+), 105 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index d0ca63b..20eba33 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -41,13 +41,13 @@
 	movl	%edx,%edx	/* zero extension */
 	.endm 
 
-	/* clobbers %eax */	
-	.macro  CLEAR_RREGS offset=0, _r9=rax
+	/* clobbers %rax */
+	.macro  CLEAR_RREGS _r9=rax
 	xorl 	%eax,%eax
-	movq	%rax,\offset+R11(%rsp)
-	movq	%rax,\offset+R10(%rsp)
-	movq	%\_r9,\offset+R9(%rsp)
-	movq	%rax,\offset+R8(%rsp)
+	movq	%rax,R11(%rsp)
+	movq	%rax,R10(%rsp)
+	movq	%\_r9,R9(%rsp)
+	movq	%rax,R8(%rsp)
 	.endm
 
 	/*
@@ -60,14 +60,14 @@
 	 * If it's -1 to make us punt the syscall, then (u32)-1 is still
 	 * an appropriately invalid value.
 	 */
-	.macro LOAD_ARGS32 offset, _r9=0
+	.macro LOAD_ARGS32 _r9=0
 	.if \_r9
-	movl \offset+R9(%rsp),%r9d
+	movl R9(%rsp),%r9d
 	.endif
-	movl \offset+RCX(%rsp),%ecx
-	movl \offset+RDX(%rsp),%edx
-	movl \offset+RSI(%rsp),%esi
-	movl \offset+RDI(%rsp),%edi
+	movl RCX(%rsp),%ecx
+	movl RDX(%rsp),%edx
+	movl RSI(%rsp),%esi
+	movl RDI(%rsp),%edi
 	movl %eax,%eax			/* zero extension */
 	.endm
 	
@@ -152,8 +152,8 @@ ENTRY(ia32_sysenter_target)
 1:	movl	(%rbp),%ebp
 	_ASM_EXTABLE(1b,ia32_badarg)
 	ASM_CLAC
-	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
+	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
 	CFI_REMEMBER_STATE
 	jnz  sysenter_tracesys
 	cmpq	$(IA32_NR_syscalls-1),%rax
@@ -162,16 +162,16 @@ sysenter_do_call:
 	IA32_ARG_FIXUP
 sysenter_dispatch:
 	call	*ia32_sys_call_table(,%rax,8)
-	movq	%rax,RAX-ARGOFFSET(%rsp)
+	movq	%rax,RAX(%rsp)
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	testl	$_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
 	jnz	sysexit_audit
 sysexit_from_sys_call:
-	andl    $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	andl    $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
 	/* clear IF, that popfq doesn't enable interrupts early */
-	andl	$~0x200,EFLAGS-ARGOFFSET(%rsp)
-	movl	RIP-ARGOFFSET(%rsp),%edx		/* User %eip */
+	andl	$~0x200,EFLAGS(%rsp)
+	movl	RIP(%rsp),%edx		/* User %eip */
 	CFI_REGISTER rip,rdx
 	RESTORE_RSI_RDI
 	REMOVE_PTREGS_FROM_STACK 8*3
@@ -195,18 +195,18 @@ sysexit_from_sys_call:
 	movl %eax,%esi			/* 2nd arg: syscall number */
 	movl $AUDIT_ARCH_I386,%edi	/* 1st arg: audit arch */
 	call __audit_syscall_entry
-	movl RAX-ARGOFFSET(%rsp),%eax	/* reload syscall number */
+	movl RAX(%rsp),%eax		/* reload syscall number */
 	cmpq $(IA32_NR_syscalls-1),%rax
 	ja ia32_badsys
 	movl %ebx,%edi			/* reload 1st syscall arg */
-	movl RCX-ARGOFFSET(%rsp),%esi	/* reload 2nd syscall arg */
-	movl RDX-ARGOFFSET(%rsp),%edx	/* reload 3rd syscall arg */
-	movl RSI-ARGOFFSET(%rsp),%ecx	/* reload 4th syscall arg */
-	movl RDI-ARGOFFSET(%rsp),%r8d	/* reload 5th syscall arg */
+	movl RCX(%rsp),%esi		/* reload 2nd syscall arg */
+	movl RDX(%rsp),%edx		/* reload 3rd syscall arg */
+	movl RSI(%rsp),%ecx		/* reload 4th syscall arg */
+	movl RDI(%rsp),%r8d		/* reload 5th syscall arg */
 	.endm
 
 	.macro auditsys_exit exit
-	testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
 	jnz ia32_ret_from_sys_call
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
@@ -217,13 +217,13 @@ sysexit_from_sys_call:
 1:	setbe %al		/* 1 if error, 0 if not */
 	movzbl %al,%edi		/* zero-extend that into %edi */
 	call __audit_syscall_exit
-	movq RAX-ARGOFFSET(%rsp),%rax	/* reload syscall return value */
+	movq RAX(%rsp),%rax	/* reload syscall return value */
 	movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl %edi,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	testl %edi,TI_flags+THREAD_INFO(%rsp,RIP)
 	jz \exit
-	CLEAR_RREGS -ARGOFFSET
+	CLEAR_RREGS
 	jmp int_with_check
 	.endm
 
@@ -239,7 +239,7 @@ sysexit_audit:
 
 sysenter_tracesys:
 #ifdef CONFIG_AUDITSYSCALL
-	testl	$(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	testl	$(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
 	jz	sysenter_auditsys
 #endif
 	SAVE_EXTRA_REGS
@@ -247,7 +247,7 @@ sysenter_tracesys:
 	movq	$-ENOSYS,RAX(%rsp)/* ptrace can change this for a bad syscall */
 	movq	%rsp,%rdi        /* &pt_regs -> arg1 */
 	call	syscall_trace_enter
-	LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
+	LOAD_ARGS32	/* reload args from stack in case ptrace changed it */
 	RESTORE_EXTRA_REGS
 	cmpq	$(IA32_NR_syscalls-1),%rax
 	ja	int_ret_from_sys_call /* sysenter_tracesys has set RAX(%rsp) */
@@ -293,17 +293,17 @@ ENTRY(ia32_cstar_target)
 	ALLOC_PTREGS_ON_STACK 8
 	SAVE_C_REGS_EXCEPT_RCX_R891011
 	movl 	%eax,%eax	/* zero extension */
-	movq	%rax,ORIG_RAX-ARGOFFSET(%rsp)
-	movq	%rcx,RIP-ARGOFFSET(%rsp)
-	CFI_REL_OFFSET rip,RIP-ARGOFFSET
-	movq	%rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */
+	movq	%rax,ORIG_RAX(%rsp)
+	movq	%rcx,RIP(%rsp)
+	CFI_REL_OFFSET rip,RIP
+	movq	%rbp,RCX(%rsp) /* this lies slightly to ptrace */
 	movl	%ebp,%ecx
-	movq	$__USER32_CS,CS-ARGOFFSET(%rsp)
-	movq	$__USER32_DS,SS-ARGOFFSET(%rsp)
-	movq	%r11,EFLAGS-ARGOFFSET(%rsp)
-	/*CFI_REL_OFFSET rflags,EFLAGS-ARGOFFSET*/
-	movq	%r8,RSP-ARGOFFSET(%rsp)	
-	CFI_REL_OFFSET rsp,RSP-ARGOFFSET
+	movq	$__USER32_CS,CS(%rsp)
+	movq	$__USER32_DS,SS(%rsp)
+	movq	%r11,EFLAGS(%rsp)
+	/*CFI_REL_OFFSET rflags,EFLAGS*/
+	movq	%r8,RSP(%rsp)
+	CFI_REL_OFFSET rsp,RSP
 	/* no need to do an access_ok check here because r8 has been
 	   32bit zero extended */ 
 	/* hardware stack frame is complete now */	
@@ -311,8 +311,8 @@ ENTRY(ia32_cstar_target)
 1:	movl	(%r8),%r9d
 	_ASM_EXTABLE(1b,ia32_badarg)
 	ASM_CLAC
-	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
+	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
 	CFI_REMEMBER_STATE
 	jnz   cstar_tracesys
 	cmpq $IA32_NR_syscalls-1,%rax
@@ -321,32 +321,32 @@ cstar_do_call:
 	IA32_ARG_FIXUP 1
 cstar_dispatch:
 	call *ia32_sys_call_table(,%rax,8)
-	movq %rax,RAX-ARGOFFSET(%rsp)
+	movq %rax,RAX(%rsp)
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
 	jnz sysretl_audit
 sysretl_from_sys_call:
-	andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
 	RESTORE_RSI_RDI_RDX
-	movl RIP-ARGOFFSET(%rsp),%ecx
+	movl RIP(%rsp),%ecx
 	CFI_REGISTER rip,rcx
-	movl EFLAGS-ARGOFFSET(%rsp),%r11d	
+	movl EFLAGS(%rsp),%r11d
 	/*CFI_REGISTER rflags,r11*/
 	xorq	%r10,%r10
 	xorq	%r9,%r9
 	xorq	%r8,%r8
 	TRACE_IRQS_ON
-	movl RSP-ARGOFFSET(%rsp),%esp
+	movl RSP(%rsp),%esp
 	CFI_RESTORE rsp
 	USERGS_SYSRET32
 	
 #ifdef CONFIG_AUDITSYSCALL
 cstar_auditsys:
 	CFI_RESTORE_STATE
-	movl %r9d,R9-ARGOFFSET(%rsp)	/* register to be clobbered by call */
+	movl %r9d,R9(%rsp)	/* register to be clobbered by call */
 	auditsys_entry_common
-	movl R9-ARGOFFSET(%rsp),%r9d	/* reload 6th syscall arg */
+	movl R9(%rsp),%r9d	/* reload 6th syscall arg */
 	jmp cstar_dispatch
 
 sysretl_audit:
@@ -355,16 +355,16 @@ sysretl_audit:
 
 cstar_tracesys:
 #ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
 	jz cstar_auditsys
 #endif
 	xchgl %r9d,%ebp
 	SAVE_EXTRA_REGS
-	CLEAR_RREGS 0, r9
+	CLEAR_RREGS r9
 	movq $-ENOSYS,RAX(%rsp)	/* ptrace can change this for a bad syscall */
 	movq %rsp,%rdi        /* &pt_regs -> arg1 */
 	call syscall_trace_enter
-	LOAD_ARGS32 ARGOFFSET, 1  /* reload args from stack in case ptrace changed it */
+	LOAD_ARGS32 1	/* reload args from stack in case ptrace changed it */
 	RESTORE_EXTRA_REGS
 	xchgl %ebp,%r9d
 	cmpq $(IA32_NR_syscalls-1),%rax
@@ -422,8 +422,8 @@ ENTRY(ia32_syscall)
 	   this could be a problem. */
 	ALLOC_PTREGS_ON_STACK
 	SAVE_C_REGS_EXCEPT_R891011
-	orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP-ARGOFFSET)
-	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
+	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
 	jnz ia32_tracesys
 	cmpq $(IA32_NR_syscalls-1),%rax
 	ja ia32_badsys
@@ -431,9 +431,9 @@ ia32_do_call:
 	IA32_ARG_FIXUP
 	call *ia32_sys_call_table(,%rax,8) # xxx: rip relative
 ia32_sysret:
-	movq %rax,RAX-ARGOFFSET(%rsp)
+	movq %rax,RAX(%rsp)
 ia32_ret_from_sys_call:
-	CLEAR_RREGS -ARGOFFSET
+	CLEAR_RREGS
 	jmp int_ret_from_sys_call
 
 ia32_tracesys:
@@ -442,7 +442,7 @@ ia32_tracesys:
 	movq $-ENOSYS,RAX(%rsp)	/* ptrace can change this for a bad syscall */
 	movq %rsp,%rdi        /* &pt_regs -> arg1 */
 	call syscall_trace_enter
-	LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
+	LOAD_ARGS32	/* reload args from stack in case ptrace changed it */
 	RESTORE_EXTRA_REGS
 	cmpq $(IA32_NR_syscalls-1),%rax
 	ja  int_ret_from_sys_call	/* ia32_tracesys has set RAX(%rsp) */
@@ -450,7 +450,7 @@ ia32_tracesys:
 END(ia32_syscall)
 
 ia32_badsys:
-	movq $0,ORIG_RAX-ARGOFFSET(%rsp)
+	movq $0,ORIG_RAX(%rsp)
 	movq $-ENOSYS,%rax
 	jmp ia32_sysret
 
@@ -482,17 +482,17 @@ ia32_ptregs_common:
 	CFI_ENDPROC
 	CFI_STARTPROC32	simple
 	CFI_SIGNAL_FRAME
-	CFI_DEF_CFA	rsp,SS+8-ARGOFFSET
-	CFI_REL_OFFSET	rax,RAX-ARGOFFSET
-	CFI_REL_OFFSET	rcx,RCX-ARGOFFSET
-	CFI_REL_OFFSET	rdx,RDX-ARGOFFSET
-	CFI_REL_OFFSET	rsi,RSI-ARGOFFSET
-	CFI_REL_OFFSET	rdi,RDI-ARGOFFSET
-	CFI_REL_OFFSET	rip,RIP-ARGOFFSET
-/*	CFI_REL_OFFSET	cs,CS-ARGOFFSET*/
-/*	CFI_REL_OFFSET	rflags,EFLAGS-ARGOFFSET*/
-	CFI_REL_OFFSET	rsp,RSP-ARGOFFSET
-/*	CFI_REL_OFFSET	ss,SS-ARGOFFSET*/
+	CFI_DEF_CFA	rsp,SS+8
+	CFI_REL_OFFSET	rax,RAX
+	CFI_REL_OFFSET	rcx,RCX
+	CFI_REL_OFFSET	rdx,RDX
+	CFI_REL_OFFSET	rsi,RSI
+	CFI_REL_OFFSET	rdi,RDI
+	CFI_REL_OFFSET	rip,RIP
+/*	CFI_REL_OFFSET	cs,CS*/
+/*	CFI_REL_OFFSET	rflags,EFLAGS*/
+	CFI_REL_OFFSET	rsp,RSP
+/*	CFI_REL_OFFSET	ss,SS*/
 	SAVE_EXTRA_REGS 8
 	call *%rax
 	RESTORE_EXTRA_REGS 8
diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index 7642948..e8e2e41 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -88,8 +88,6 @@ For 32-bit we have the following conventions - kernel is built with
 #define RSP		19*8
 #define SS		20*8
 
-#define ARGOFFSET	0
-
 	.macro ALLOC_PTREGS_ON_STACK addskip=0
 	subq	$15*8+\addskip, %rsp
 	CFI_ADJUST_CFA_OFFSET 15*8+\addskip
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index c489a2d..8c6a01d 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -73,7 +73,7 @@ ENDPROC(native_usergs_sysret64)
 #endif /* CONFIG_PARAVIRT */
 
 
-.macro TRACE_IRQS_IRETQ offset=ARGOFFSET
+.macro TRACE_IRQS_IRETQ offset=0
 #ifdef CONFIG_TRACE_IRQFLAGS
 	bt   $9,EFLAGS-\offset(%rsp)	/* interrupts off? */
 	jnc  1f
@@ -107,7 +107,7 @@ ENDPROC(native_usergs_sysret64)
 	call debug_stack_reset
 .endm
 
-.macro TRACE_IRQS_IRETQ_DEBUG offset=ARGOFFSET
+.macro TRACE_IRQS_IRETQ_DEBUG offset=0
 	bt   $9,EFLAGS-\offset(%rsp)	/* interrupts off? */
 	jnc  1f
 	TRACE_IRQS_ON_DEBUG
@@ -184,16 +184,16 @@ ENDPROC(native_usergs_sysret64)
  * frame that enables passing a complete pt_regs to a C function.
  */
 	.macro DEFAULT_FRAME start=1 offset=0
-	XCPT_FRAME \start, ORIG_RAX+\offset-ARGOFFSET
-	CFI_REL_OFFSET rdi, RDI+\offset-ARGOFFSET
-	CFI_REL_OFFSET rsi, RSI+\offset-ARGOFFSET
-	CFI_REL_OFFSET rdx, RDX+\offset-ARGOFFSET
-	CFI_REL_OFFSET rcx, RCX+\offset-ARGOFFSET
-	CFI_REL_OFFSET rax, RAX+\offset-ARGOFFSET
-	CFI_REL_OFFSET r8, R8+\offset-ARGOFFSET
-	CFI_REL_OFFSET r9, R9+\offset-ARGOFFSET
-	CFI_REL_OFFSET r10, R10+\offset-ARGOFFSET
-	CFI_REL_OFFSET r11, R11+\offset-ARGOFFSET
+	XCPT_FRAME \start, ORIG_RAX+\offset
+	CFI_REL_OFFSET rdi, RDI+\offset
+	CFI_REL_OFFSET rsi, RSI+\offset
+	CFI_REL_OFFSET rdx, RDX+\offset
+	CFI_REL_OFFSET rcx, RCX+\offset
+	CFI_REL_OFFSET rax, RAX+\offset
+	CFI_REL_OFFSET r8, R8+\offset
+	CFI_REL_OFFSET r9, R9+\offset
+	CFI_REL_OFFSET r10, R10+\offset
+	CFI_REL_OFFSET r11, R11+\offset
 	CFI_REL_OFFSET rbx, RBX+\offset
 	CFI_REL_OFFSET rbp, RBP+\offset
 	CFI_REL_OFFSET r12, R12+\offset
@@ -237,13 +237,13 @@ ENTRY(ret_from_fork)
 
 	RESTORE_EXTRA_REGS
 
-	testl $3, CS-ARGOFFSET(%rsp)		# from kernel_thread?
+	testl $3, CS(%rsp)			# from kernel_thread?
 	jz   1f
 
 	testl $_TIF_IA32, TI_flags(%rcx)	# 32-bit compat task needs IRET
 	jnz  int_ret_from_sys_call
 
-	RESTORE_TOP_OF_STACK %rdi, -ARGOFFSET
+	RESTORE_TOP_OF_STACK %rdi
 	jmp ret_from_sys_call			# go to the SYSRET fastpath
 
 1:
@@ -310,10 +310,10 @@ GLOBAL(system_call_after_swapgs)
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	ALLOC_PTREGS_ON_STACK 8
 	SAVE_C_REGS
-	movq  %rax,ORIG_RAX-ARGOFFSET(%rsp)
-	movq  %rcx,RIP-ARGOFFSET(%rsp)
-	CFI_REL_OFFSET rip,RIP-ARGOFFSET
-	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	movq  %rax,ORIG_RAX(%rsp)
+	movq  %rcx,RIP(%rsp)
+	CFI_REL_OFFSET rip,RIP
+	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
 	jnz tracesys
 system_call_fastpath:
 #if __SYSCALL_MASK == ~0
@@ -325,7 +325,7 @@ system_call_fastpath:
 	ja badsys
 	movq %r10,%rcx
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
-	movq %rax,RAX-ARGOFFSET(%rsp)
+	movq %rax,RAX(%rsp)
 /*
  * Syscall return path ending with SYSRET (fast path)
  * Has incomplete stack frame and undefined top of stack.
@@ -337,7 +337,7 @@ sysret_check:
 	LOCKDEP_SYS_EXIT
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
+	movl TI_flags+THREAD_INFO(%rsp,RIP),%edx
 	andl %edi,%edx
 	jnz  sysret_careful
 	CFI_REMEMBER_STATE
@@ -346,7 +346,7 @@ sysret_check:
 	 */
 	TRACE_IRQS_ON
 	RESTORE_C_REGS_EXCEPT_RCX
-	movq RIP-ARGOFFSET(%rsp),%rcx
+	movq RIP(%rsp),%rcx
 	CFI_REGISTER	rip,rcx
 	/*CFI_REGISTER	rflags,r11*/
 	movq	PER_CPU_VAR(old_rsp), %rsp
@@ -378,11 +378,11 @@ sysret_signal:
 	 * These all wind up with the iret return path anyway,
 	 * so just join that path right now.
 	 */
-	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
+	FIXUP_TOP_OF_STACK %r11
 	jmp int_check_syscall_exit_work
 
 badsys:
-	movq $-ENOSYS,RAX-ARGOFFSET(%rsp)
+	movq $-ENOSYS,RAX(%rsp)
 	jmp ret_from_sys_call
 
 #ifdef CONFIG_AUDITSYSCALL
@@ -408,7 +408,7 @@ auditsys:
 	 * masked off.
 	 */
 sysret_audit:
-	movq RAX-ARGOFFSET(%rsp),%rsi	/* second arg, syscall return value */
+	movq RAX(%rsp),%rsi	/* second arg, syscall return value */
 	cmpq $-MAX_ERRNO,%rsi	/* is it < -MAX_ERRNO? */
 	setbe %al		/* 1 if so, 0 if not */
 	movzbl %al,%edi		/* zero-extend that into %edi */
@@ -420,7 +420,7 @@ sysret_audit:
 	/* Do syscall tracing */
 tracesys:
 #ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
 	jz auditsys
 #endif
 	SAVE_EXTRA_REGS
@@ -444,7 +444,7 @@ tracesys:
 	ja   int_ret_from_sys_call	/* RAX(%rsp) set to -ENOSYS above */
 	movq %r10,%rcx	/* fixup for C */
 	call *sys_call_table(,%rax,8)
-	movq %rax,RAX-ARGOFFSET(%rsp)
+	movq %rax,RAX(%rsp)
 	/* Use IRET because user could have changed frame */
 
 /*
@@ -528,9 +528,9 @@ END(stub_\func)
 ENTRY(\label)
 	CFI_STARTPROC
 	DEFAULT_FRAME 0, 8		/* offset 8: return address */
-	FIXUP_TOP_OF_STACK %r11, 8-ARGOFFSET
+	FIXUP_TOP_OF_STACK %r11, 8
 	call \func
-	RESTORE_TOP_OF_STACK %r11, 8-ARGOFFSET
+	RESTORE_TOP_OF_STACK %r11, 8
 	ret
 	CFI_ENDPROC
 END(\label)
@@ -702,7 +702,7 @@ common_interrupt:
 	ASM_CLAC
 	addq $-0x80,(%rsp)		/* Adjust vector to [-256,-1] range */
 	interrupt do_IRQ
-	/* 0(%rsp): old_rsp-ARGOFFSET */
+	/* 0(%rsp): old_rsp */
 ret_from_intr:
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
@@ -712,13 +712,13 @@ ret_from_intr:
 	popq %rsi
 	CFI_DEF_CFA rsi,SS+8-RBP	/* reg/off reset after def_cfa_expr */
 	/* return code expects complete pt_regs - adjust rsp accordingly: */
-	leaq ARGOFFSET-RBP(%rsi), %rsp
+	leaq -RBP(%rsi), %rsp
 	CFI_DEF_CFA_REGISTER	rsp
-	CFI_ADJUST_CFA_OFFSET	RBP-ARGOFFSET
+	CFI_ADJUST_CFA_OFFSET	RBP
 
 exit_intr:
 	GET_THREAD_INFO(%rcx)
-	testl $3,CS-ARGOFFSET(%rsp)
+	testl $3,CS(%rsp)
 	je retint_kernel
 
 	/* Interrupt came from user space */
@@ -856,7 +856,7 @@ retint_signal:
 ENTRY(retint_kernel)
 	cmpl $0,PER_CPU_VAR(__preempt_count)
 	jnz  retint_restore_args
-	bt   $9,EFLAGS-ARGOFFSET(%rsp)	/* interrupts off? */
+	bt   $9,EFLAGS(%rsp)	/* interrupts off? */
 	jnc  retint_restore_args
 	call preempt_schedule_irq
 	jmp exit_intr
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/17] x86: rename some macros and labels, no code changes
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (5 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 06/17] x86: mass removal of ARGOFFSET Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 08/17] x86: add comments about various syscall instructions, " Denys Vlasenko
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

Rename LOAD_ARGS32 to RESTORE_REGS32 to match other RESTORE_* macros.
The "ARGS" part was misleading anyway - we are restoring registers,
not arguments.

Similarly, rename [retint_]restore_args to [retint_]restore_c_regs:
at these labels, we restore registers clobbered by C ABI;
rename int_restore_rest to int_restore_extra_regs.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/ia32/ia32entry.S  | 10 +++++-----
 arch/x86/kernel/entry_64.S | 16 ++++++++--------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 20eba33..4402bbe 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -51,7 +51,7 @@
 	.endm
 
 	/*
-	 * Reload arg registers from stack in case ptrace changed them.
+	 * Reload registers from stack in case ptrace changed them.
 	 * We don't reload %eax because syscall_trace_enter() returned
 	 * the %rax value we should see.  Instead, we just truncate that
 	 * value to 32 bits again as we did on entry from user mode.
@@ -60,7 +60,7 @@
 	 * If it's -1 to make us punt the syscall, then (u32)-1 is still
 	 * an appropriately invalid value.
 	 */
-	.macro LOAD_ARGS32 _r9=0
+	.macro RESTORE_REGS32 _r9=0
 	.if \_r9
 	movl R9(%rsp),%r9d
 	.endif
@@ -247,7 +247,7 @@ sysenter_tracesys:
 	movq	$-ENOSYS,RAX(%rsp)/* ptrace can change this for a bad syscall */
 	movq	%rsp,%rdi        /* &pt_regs -> arg1 */
 	call	syscall_trace_enter
-	LOAD_ARGS32	/* reload args from stack in case ptrace changed it */
+	RESTORE_REGS32	/* reload regs from stack in case ptrace changed it */
 	RESTORE_EXTRA_REGS
 	cmpq	$(IA32_NR_syscalls-1),%rax
 	ja	int_ret_from_sys_call /* sysenter_tracesys has set RAX(%rsp) */
@@ -364,7 +364,7 @@ cstar_tracesys:
 	movq $-ENOSYS,RAX(%rsp)	/* ptrace can change this for a bad syscall */
 	movq %rsp,%rdi        /* &pt_regs -> arg1 */
 	call syscall_trace_enter
-	LOAD_ARGS32 1	/* reload args from stack in case ptrace changed it */
+	RESTORE_REGS32 1	/* reload regs from stack in case ptrace changed it */
 	RESTORE_EXTRA_REGS
 	xchgl %ebp,%r9d
 	cmpq $(IA32_NR_syscalls-1),%rax
@@ -442,7 +442,7 @@ ia32_tracesys:
 	movq $-ENOSYS,RAX(%rsp)	/* ptrace can change this for a bad syscall */
 	movq %rsp,%rdi        /* &pt_regs -> arg1 */
 	call syscall_trace_enter
-	LOAD_ARGS32	/* reload args from stack in case ptrace changed it */
+	RESTORE_REGS32	/* reload args from stack in case ptrace changed it */
 	RESTORE_EXTRA_REGS
 	cmpq $(IA32_NR_syscalls-1),%rax
 	ja  int_ret_from_sys_call	/* ia32_tracesys has set RAX(%rsp) */
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 8c6a01d..e9bbe02 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -62,7 +62,7 @@
 
 
 #ifndef CONFIG_PREEMPT
-#define retint_kernel retint_restore_args
+#define retint_kernel retint_restore_c_regs
 #endif
 
 #ifdef CONFIG_PARAVIRT
@@ -494,7 +494,7 @@ int_check_syscall_exit_work:
 	call syscall_trace_leave
 	popq_cfi %rdi
 	andl $~(_TIF_WORK_SYSCALL_EXIT|_TIF_SYSCALL_EMU),%edi
-	jmp int_restore_rest
+	jmp int_restore_extra_regs
 
 int_signal:
 	testl $_TIF_DO_NOTIFY_MASK,%edx
@@ -503,7 +503,7 @@ int_signal:
 	xorl %esi,%esi		# oldset -> arg2
 	call do_notify_resume
 1:	movl $_TIF_WORK_MASK,%edi
-int_restore_rest:
+int_restore_extra_regs:
 	RESTORE_EXTRA_REGS
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
@@ -742,15 +742,15 @@ retint_swapgs:		/* return to user-space */
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_IRETQ
 	SWAPGS
-	jmp restore_args
+	jmp restore_c_regs
 
-retint_restore_args:	/* return to kernel space */
+retint_restore_c_regs:	/* return to kernel space */
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	/*
 	 * The iretq could re-enable interrupts:
 	 */
 	TRACE_IRQS_IRETQ
-restore_args:
+restore_c_regs:
 	RESTORE_C_REGS
 	REMOVE_PTREGS_FROM_STACK 8
 
@@ -855,9 +855,9 @@ retint_signal:
 	/* rcx:	 threadinfo. interrupts off. */
 ENTRY(retint_kernel)
 	cmpl $0,PER_CPU_VAR(__preempt_count)
-	jnz  retint_restore_args
+	jnz  retint_restore_c_regs
 	bt   $9,EFLAGS(%rsp)	/* interrupts off? */
-	jnc  retint_restore_args
+	jnc  retint_restore_c_regs
 	call preempt_schedule_irq
 	jmp exit_intr
 #endif
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/17] x86: add comments about various syscall instructions, no code changes
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (6 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 07/17] x86: rename some macros and labels, no code changes Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 09/17] x86: entry_64.S: move save_paranoid and ret_from_fork closer to their users Denys Vlasenko
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

SYSCALL/SYSRET and SYSENTER/SYSEXIT have weird semantics.
Moreover, they differ in 32- and 64-bit mode.
What is saved? What is not? Is rsp set? Are interrupts disabled?
People tend to not remember these details well enough.

This patch adds comments which explain in detail
what registers are modified by each of these instructions.
The comments are placed immediately before corresponding
entry and exit points.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/ia32/ia32entry.S  | 133 ++++++++++++++++++++++++++++-----------------
 arch/x86/kernel/entry_64.S |  32 ++++++-----
 2 files changed, 102 insertions(+), 63 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 4402bbe..fe6eaf7 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -99,22 +99,25 @@ ENDPROC(native_irq_enable_sysexit)
 /*
  * 32bit SYSENTER instruction entry.
  *
+ * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
+ * IF and VM in rflags are cleared (IOW: interrupts are off).
+ * SYSENTER does not save anything on the stack,
+ * and does not save old rip (!!!) and rflags.
+ *
  * Arguments:
- * %eax	System call number.
- * %ebx Arg1
- * %ecx Arg2
- * %edx Arg3
- * %esi Arg4
- * %edi Arg5
- * %ebp user stack
- * 0(%ebp) Arg6	
- * 	
- * Interrupts off.
- *	
+ * eax  system call number
+ * ebx  arg1
+ * ecx  arg2
+ * edx  arg3
+ * esi  arg4
+ * edi  arg5
+ * ebp  user stack
+ * 0(%ebp) arg6
+ *
  * This is purely a fast path. For anything complicated we use the int 0x80
- * path below.	Set up a complete hardware stack frame to share code
+ * path below. We set up a complete hardware stack frame to share code
  * with the int 0x80 path.
- */ 	
+ */
 ENTRY(ia32_sysenter_target)
 	CFI_STARTPROC32	simple
 	CFI_SIGNAL_FRAME
@@ -128,6 +131,7 @@ ENTRY(ia32_sysenter_target)
 	 * disabled irqs, here we enable it straight after entry:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
+	/* Construct iret frame (ss,rsp,rflags,cs,rip) */
  	movl	%ebp,%ebp		/* zero extension */
 	pushq_cfi $__USER32_DS
 	/*CFI_REL_OFFSET ss,0*/
@@ -140,14 +144,19 @@ ENTRY(ia32_sysenter_target)
 	pushq_cfi $__USER32_CS
 	/*CFI_REL_OFFSET cs,0*/
 	movl	%eax, %eax
+	/* Store thread_info->sysenter_return in rip stack slot */
 	pushq_cfi %r10
 	CFI_REL_OFFSET rip,0
+	/* Store orig_ax */
 	pushq_cfi %rax
+	/* Construct the rest of "struct pt_regs" */
 	cld
 	ALLOC_PTREGS_ON_STACK
 	SAVE_C_REGS_EXCEPT_R891011
- 	/* no need to do an access_ok check here because rbp has been
- 	   32bit zero extended */ 
+	/*
+	 * no need to do an access_ok check here because rbp has been
+	 * 32bit zero extended
+	 */
 	ASM_STAC
 1:	movl	(%rbp),%ebp
 	_ASM_EXTABLE(1b,ia32_badarg)
@@ -174,6 +183,7 @@ sysexit_from_sys_call:
 	movl	RIP(%rsp),%edx		/* User %eip */
 	CFI_REGISTER rip,rdx
 	RESTORE_RSI_RDI
+	/* pop everything except ss,rsp,rflags slots */
 	REMOVE_PTREGS_FROM_STACK 8*3
 	xorq	%r8,%r8
 	xorq	%r9,%r9
@@ -184,6 +194,10 @@ sysexit_from_sys_call:
 	popq_cfi %rcx				/* User %esp */
 	CFI_REGISTER rsp,rcx
 	TRACE_IRQS_ON
+	/*
+	 * 32bit SYSEXIT restores eip from edx, esp from ecx.
+	 * cs and ss are loaded from MSRs.
+	 */
 	ENABLE_INTERRUPTS_SYSEXIT32
 
 #ifdef CONFIG_AUDITSYSCALL
@@ -258,23 +272,33 @@ ENDPROC(ia32_sysenter_target)
 /*
  * 32bit SYSCALL instruction entry.
  *
+ * 32bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
+ * then loads new ss, cs, and rip from previously programmed MSRs.
+ * rflags gets masked by a value from another MSR (so CLD and CLAC
+ * are not needed). SYSCALL does not save anything on the stack
+ * and does not change rsp.
+ *
+ * Note: rflags saving+masking-with-MSR happens only in Long mode
+ * (in legacy 32bit mode, IF, RF and VM bits are cleared and that's it).
+ * Don't get confused: rflags saving+masking depends on Long Mode Active bit
+ * (EFER.LMA=1), NOT on bitness of userspace where SYSCALL executes
+ * or target CS descriptor's L bit (SYSCALL does not read segment descriptors).
+ *
  * Arguments:
- * %eax	System call number.
- * %ebx Arg1
- * %ecx return EIP 
- * %edx Arg3
- * %esi Arg4
- * %edi Arg5
- * %ebp Arg2    [note: not saved in the stack frame, should not be touched]
- * %esp user stack 
- * 0(%esp) Arg6
- * 	
- * Interrupts off.
- *	
+ * eax  system call number
+ * ecx  return address
+ * ebx  arg1
+ * ebp  arg2	(note: not saved in the stack frame, should not be touched)
+ * edx  arg3
+ * esi  arg4
+ * edi  arg5
+ * esp  user stack
+ * 0(%esp) arg6
+ *
  * This is purely a fast path. For anything complicated we use the int 0x80
- * path below.	Set up a complete hardware stack frame to share code
- * with the int 0x80 path.	
- */ 	
+ * path below. We set up a complete hardware stack frame to share code
+ * with the int 0x80 path.
+ */
 ENTRY(ia32_cstar_target)
 	CFI_STARTPROC32	simple
 	CFI_SIGNAL_FRAME
@@ -290,7 +314,7 @@ ENTRY(ia32_cstar_target)
 	 * disabled irqs and here we enable it straight after entry:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	ALLOC_PTREGS_ON_STACK 8
+	ALLOC_PTREGS_ON_STACK 8	/* +8: space for orig_ax */
 	SAVE_C_REGS_EXCEPT_RCX_R891011
 	movl 	%eax,%eax	/* zero extension */
 	movq	%rax,ORIG_RAX(%rsp)
@@ -304,9 +328,11 @@ ENTRY(ia32_cstar_target)
 	/*CFI_REL_OFFSET rflags,EFLAGS*/
 	movq	%r8,RSP(%rsp)
 	CFI_REL_OFFSET rsp,RSP
-	/* no need to do an access_ok check here because r8 has been
-	   32bit zero extended */ 
-	/* hardware stack frame is complete now */	
+	/* iret stack frame is complete now */
+	/*
+	 * no need to do an access_ok check here because r8 has been
+	 * 32bit zero extended
+	 */
 	ASM_STAC
 1:	movl	(%r8),%r9d
 	_ASM_EXTABLE(1b,ia32_badarg)
@@ -339,8 +365,15 @@ sysretl_from_sys_call:
 	TRACE_IRQS_ON
 	movl RSP(%rsp),%esp
 	CFI_RESTORE rsp
+	/*
+	 * 64bit->32bit SYSRET restores eip from ecx,
+	 * eflags from r11 (but RF and VM bits are forced to 0),
+	 * cs and ss are loaded from MSRs.
+	 * (Note: 32bit->32bit SYSRET is different: since r11
+	 * does not exist, it merely sets eflags.IF=1).
+	 */
 	USERGS_SYSRET32
-	
+
 #ifdef CONFIG_AUDITSYSCALL
 cstar_auditsys:
 	CFI_RESTORE_STATE
@@ -378,26 +411,26 @@ ia32_badarg:
 	jmp ia32_sysret
 	CFI_ENDPROC
 
-/* 
- * Emulated IA32 system calls via int 0x80. 
+/*
+ * Emulated IA32 system calls via int 0x80.
  *
- * Arguments:	 
- * %eax	System call number.
- * %ebx Arg1
- * %ecx Arg2
- * %edx Arg3
- * %esi Arg4
- * %edi Arg5
- * %ebp Arg6    [note: not saved in the stack frame, should not be touched]
+ * Arguments:
+ * eax  system call number
+ * ebx  arg1
+ * ecx  arg2
+ * edx  arg3
+ * esi  arg4
+ * edi  arg5
+ * ebp  arg6	(note: not saved in the stack frame, should not be touched)
  *
  * Notes:
- * Uses the same stack frame as the x86-64 version.	
- * All registers except %eax must be saved (but ptrace may violate that)
+ * Uses the same stack frame as the x86-64 version.
+ * All registers except eax must be saved (but ptrace may violate that).
  * Arguments are zero extended. For system calls that want sign extension and
  * take long arguments a wrapper is needed. Most calls can just be called
  * directly.
- * Assumes it is only called from user space and entered with interrupts off.	
- */ 				
+ * Assumes it is only called from user space and entered with interrupts off.
+ */
 
 ENTRY(ia32_syscall)
 	CFI_STARTPROC32	simple
@@ -416,7 +449,7 @@ ENTRY(ia32_syscall)
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	movl %eax,%eax
-	pushq_cfi %rax
+	pushq_cfi %rax		/* store orig_ax */
 	cld
 	/* note the registers are not zero extended to the sf.
 	   this could be a problem. */
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index e9bbe02..4a2e3ab 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -256,25 +256,25 @@ ENTRY(ret_from_fork)
 END(ret_from_fork)
 
 /*
- * System call entry. Up to 6 arguments in registers are supported.
+ * 64bit SYSCALL instruction entry. Up to 6 arguments in registers.
  *
- * SYSCALL does not save anything on the stack and does not change the
- * stack pointer.  However, it does mask the flags register for us, so
- * CLD and CLAC are not needed.
- */
-
-/*
- * Register setup:
+ * 64bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
+ * then loads new ss, cs, and rip from previously programmed MSRs.
+ * rflags gets masked by a value from another MSR (so CLD and CLAC
+ * are not needed). SYSCALL does not save anything on the stack
+ * and does not change rsp.
+ *
+ * Registers on entry:
  * rax  system call number
+ * rcx  return address
+ * r11  saved rflags (note: r11 is callee-clobbered register in C ABI)
  * rdi  arg0
- * rcx  return address for syscall/sysret, C arg3
  * rsi  arg1
  * rdx  arg2
- * r10  arg3 	(--> moved to rcx for C)
+ * r10  arg3 (needs to be moved to rcx to conform to C ABI)
  * r8   arg4
  * r9   arg5
- * r11  eflags for syscall/sysret, temporary for C
- * r12-r15,rbp,rbx saved by C code, not touched.
+ * (note: r12-r15,rbp,rbx are callee-preserved in C ABI)
  *
  * Interrupts are off on entry.
  * Only called from user space.
@@ -302,13 +302,14 @@ ENTRY(system_call)
 GLOBAL(system_call_after_swapgs)
 
 	movq	%rsp,PER_CPU_VAR(old_rsp)
+	/* kernel_stack is set so that 5 slots (iret frame) are preallocated */
 	movq	PER_CPU_VAR(kernel_stack),%rsp
 	/*
 	 * No need to follow this irqs off/on section - it's straight
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	ALLOC_PTREGS_ON_STACK 8
+	ALLOC_PTREGS_ON_STACK 8		/* +8: space for orig_ax */
 	SAVE_C_REGS
 	movq  %rax,ORIG_RAX(%rsp)
 	movq  %rcx,RIP(%rsp)
@@ -350,6 +351,11 @@ sysret_check:
 	CFI_REGISTER	rip,rcx
 	/*CFI_REGISTER	rflags,r11*/
 	movq	PER_CPU_VAR(old_rsp), %rsp
+	/*
+	 * 64bit SYSRET restores rip from rcx,
+	 * rflags from r11 (but RF and VM bits are forced to 0),
+	 * cs and ss are loaded from MSRs.
+	 */
 	USERGS_SYSRET64
 
 	CFI_RESTORE_STATE
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/17] x86: entry_64.S: move save_paranoid and ret_from_fork closer to their users
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (7 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 08/17] x86: add comments about various syscall instructions, " Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 10/17] x86: entry_64.S: rename save_paranoid to paranoid_entry, no code changes Denys Vlasenko
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

For some odd reasons, these two functions are at the very top of the file,
whereas their callers are approximately in the middle of it.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/entry_64.S | 106 ++++++++++++++++++++++-----------------------
 1 file changed, 53 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 4a2e3ab..5ad9b28 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -202,59 +202,6 @@ ENDPROC(native_usergs_sysret64)
 	CFI_REL_OFFSET r15, R15+\offset
 	.endm
 
-ENTRY(save_paranoid)
-	XCPT_FRAME 1 RDI+8
-	cld
-	SAVE_C_REGS 8
-	SAVE_EXTRA_REGS 8
-	movl $1,%ebx
-	movl $MSR_GS_BASE,%ecx
-	rdmsr
-	testl %edx,%edx
-	js 1f	/* negative -> in kernel */
-	SWAPGS
-	xorl %ebx,%ebx
-1:	ret
-	CFI_ENDPROC
-END(save_paranoid)
-
-/*
- * A newly forked process directly context switches into this address.
- *
- * rdi: prev task we switched from
- */
-ENTRY(ret_from_fork)
-	DEFAULT_FRAME
-
-	LOCK ; btr $TIF_FORK,TI_flags(%r8)
-
-	pushq_cfi $0x0002
-	popfq_cfi				# reset kernel eflags
-
-	call schedule_tail			# rdi: 'prev' task parameter
-
-	GET_THREAD_INFO(%rcx)
-
-	RESTORE_EXTRA_REGS
-
-	testl $3, CS(%rsp)			# from kernel_thread?
-	jz   1f
-
-	testl $_TIF_IA32, TI_flags(%rcx)	# 32-bit compat task needs IRET
-	jnz  int_ret_from_sys_call
-
-	RESTORE_TOP_OF_STACK %rdi
-	jmp ret_from_sys_call			# go to the SYSRET fastpath
-
-1:
-	movq %rbp, %rdi
-	call *%rbx
-	movl $0, RAX(%rsp)
-	RESTORE_EXTRA_REGS
-	jmp int_ret_from_sys_call
-	CFI_ENDPROC
-END(ret_from_fork)
-
 /*
  * 64bit SYSCALL instruction entry. Up to 6 arguments in registers.
  *
@@ -608,6 +555,43 @@ END(stub_x32_execve)
 #endif
 
 /*
+ * A newly forked process directly context switches into this address.
+ *
+ * rdi: prev task we switched from
+ */
+ENTRY(ret_from_fork)
+	DEFAULT_FRAME
+
+	LOCK ; btr $TIF_FORK,TI_flags(%r8)
+
+	pushq_cfi $0x0002
+	popfq_cfi				# reset kernel eflags
+
+	call schedule_tail			# rdi: 'prev' task parameter
+
+	GET_THREAD_INFO(%rcx)
+
+	RESTORE_EXTRA_REGS
+
+	testl $3, CS(%rsp)			# from kernel_thread?
+	jz   1f
+
+	testl $_TIF_IA32, TI_flags(%rcx)	# 32-bit compat task needs IRET
+	jnz  int_ret_from_sys_call
+
+	RESTORE_TOP_OF_STACK %rdi
+	jmp ret_from_sys_call			# go to the SYSRET fastpath
+
+1:
+	movq %rbp, %rdi
+	call *%rbx
+	movl $0, RAX(%rsp)
+	RESTORE_EXTRA_REGS
+	jmp int_ret_from_sys_call
+	CFI_ENDPROC
+END(ret_from_fork)
+
+/*
  * Build the entry stubs and pointer table with some assembler magic.
  * We pack 7 stubs into a single 32-byte chunk, which will fit in a
  * single cache line on all modern x86 implementations.
@@ -1258,6 +1242,22 @@ idtentry async_page_fault do_async_page_fault has_error_code=1
 idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(%rip)
 #endif
 
+ENTRY(save_paranoid)
+	XCPT_FRAME 1 RDI+8
+	cld
+	SAVE_C_REGS 8
+	SAVE_EXTRA_REGS 8
+	movl $1,%ebx
+	movl $MSR_GS_BASE,%ecx
+	rdmsr
+	testl %edx,%edx
+	js 1f	/* negative -> in kernel */
+	SWAPGS
+	xorl %ebx,%ebx
+1:	ret
+	CFI_ENDPROC
+END(save_paranoid)
+
 	/*
 	 * "Paranoid" exit path from exception stack.
 	 * Paranoid because this is used by NMIs and cannot take
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/17] x86: entry_64.S: rename save_paranoid to paranoid_entry, no code changes
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (8 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 09/17] x86: entry_64.S: move save_paranoid and ret_from_fork closer to their users Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 11/17] x86: entry_64.S: fold test_in_nmi macro into its only user Denys Vlasenko
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

This patch does a lot of cleanup in comments and formatting,
but it does not change any code.

Rename save_paranoid to paranoid_entry: this makes naming
similar to its "non-paranoid" sibling, error_entry,
and to it counterpart, paranoid_exit.

Use the same CFI annotation atop paranoid_entry and error_entry.

Fix irregular indentation of assembler operands.

Add/fix comments on top of paranoid_entry and error_entry.
Remove stale comment about "oldrax".
Make comments about "no swapgs" flag in ebx more prominent.
Deindent wrongly indented top-level comment atop paranoid_exit.
Indent wrongly deindented comment inside error_entry.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/entry_64.S | 117 +++++++++++++++++++++++----------------------
 1 file changed, 59 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 5ad9b28..decfe25 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1000,10 +1000,11 @@ ENTRY(\sym)
 	ALLOC_PTREGS_ON_STACK
 
 	.if \paranoid
-	call save_paranoid
+	call paranoid_entry
 	.else
 	call error_entry
 	.endif
+	/* returned flag: ebx=0: need swapgs on exit, ebx=1: don't need it */
 
 	DEFAULT_FRAME 0
 
@@ -1034,10 +1035,11 @@ ENTRY(\sym)
 	addq $EXCEPTION_STKSZ, INIT_TSS_IST(\shift_ist)
 	.endif
 
+	/* these procedures expect "no swapgs" flag in ebx */
 	.if \paranoid
-	jmp paranoid_exit		/* %ebx: no swapgs flag */
+	jmp paranoid_exit
 	.else
-	jmp error_exit			/* %ebx: no swapgs flag */
+	jmp error_exit
 	.endif
 
 	CFI_ENDPROC
@@ -1242,36 +1244,40 @@ idtentry async_page_fault do_async_page_fault has_error_code=1
 idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(%rip)
 #endif
 
-ENTRY(save_paranoid)
-	XCPT_FRAME 1 RDI+8
+/*
+ * Save all registers in pt_regs, and switch gs if needed.
+ * Use slow, but surefire "are we in kernel?" check.
+ * Return: ebx=0: need swapgs on exit, ebx=1: otherwise
+ */
+ENTRY(paranoid_entry)
+	XCPT_FRAME 1 15*8
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
-	movl $1,%ebx
-	movl $MSR_GS_BASE,%ecx
+	movl	$1,%ebx
+	movl	$MSR_GS_BASE,%ecx
 	rdmsr
-	testl %edx,%edx
-	js 1f	/* negative -> in kernel */
+	testl	%edx,%edx
+	js	1f	/* negative -> in kernel */
 	SWAPGS
-	xorl %ebx,%ebx
+	xorl	%ebx,%ebx
 1:	ret
 	CFI_ENDPROC
-END(save_paranoid)
-
-	/*
-	 * "Paranoid" exit path from exception stack.
-	 * Paranoid because this is used by NMIs and cannot take
-	 * any kernel state for granted.
-	 * We don't do kernel preemption checks here, because only
-	 * NMI should be common and it does not enable IRQs and
-	 * cannot get reschedule ticks.
-	 *
-	 * "trace" is 0 for the NMI handler only, because irq-tracing
-	 * is fundamentally NMI-unsafe. (we cannot change the soft and
-	 * hard flags at once, atomically)
-	 */
+END(paranoid_entry)
 
-	/* ebx:	no swapgs flag */
+/*
+ * "Paranoid" exit path from exception stack.
+ * Paranoid because this is used by NMIs and cannot take
+ * any kernel state for granted.
+ * We don't do kernel preemption checks here, because only
+ * NMI should be common and it does not enable IRQs and
+ * cannot get reschedule ticks.
+ *
+ * "trace" is 0 for the NMI handler only, because irq-tracing
+ * is fundamentally NMI-unsafe. (we cannot change the soft and
+ * hard flags at once, atomically)
+ */
+/* On entry, ebx is "no swapgs" flag (1: don't need swapgs, 0: need it) */
 ENTRY(paranoid_exit)
 	DEFAULT_FRAME
 	DISABLE_INTERRUPTS(CLBR_NONE)
@@ -1323,53 +1329,48 @@ paranoid_schedule:
 END(paranoid_exit)
 
 /*
- * Exception entry point. This expects an error code/orig_rax on the stack.
- * returns in "no swapgs flag" in %ebx.
+ * Save all registers in pt_regs, and switch gs if needed.
+ * Return: ebx=0: need swapgs on exit, ebx=1: otherwise
  */
 ENTRY(error_entry)
-	XCPT_FRAME
-	CFI_ADJUST_CFA_OFFSET 15*8
-	/* oldrax contains error code */
+	XCPT_FRAME 1 15*8
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
-	xorl %ebx,%ebx
-	testl $3,CS+8(%rsp)
-	je error_kernelspace
+	xorl	%ebx,%ebx
+	testl	$3,CS+8(%rsp)
+	jz error_kernelspace
 error_swapgs:
 	SWAPGS
 error_sti:
 	TRACE_IRQS_OFF
 	ret
-
-/*
- * There are two places in the kernel that can potentially fault with
- * usergs. Handle them here. The exception handlers after iret run with
- * kernel gs again, so don't set the user space flag. B stepping K8s
- * sometimes report an truncated RIP for IRET exceptions returning to
- * compat mode. Check for these here too.
- */
+	/*
+	 * There are two places in the kernel that can potentially fault with
+	 * usergs. Handle them here. The exception handlers after iret run with
+	 * kernel gs again, so don't set the user space flag. B stepping K8s
+	 * sometimes report an truncated RIP for IRET exceptions returning to
+	 * compat mode. Check for these here too.
+	 */
 error_kernelspace:
-	incl %ebx
-	leaq irq_return_iret(%rip),%rcx
-	cmpq %rcx,RIP+8(%rsp)
-	je error_swapgs
-	movl %ecx,%eax	/* zero extend */
-	cmpq %rax,RIP+8(%rsp)
-	je bstep_iret
-	cmpq $gs_change,RIP+8(%rsp)
-	je error_swapgs
-	jmp error_sti
-
+	incl	%ebx
+	leaq	irq_return_iret(%rip),%rcx
+	cmpq	%rcx,RIP+8(%rsp)
+	je	error_swapgs
+	movl	%ecx,%eax	/* zero extend */
+	cmpq	%rax,RIP+8(%rsp)
+	je	bstep_iret
+	cmpq	$gs_change,RIP+8(%rsp)
+	je	error_swapgs
+	jmp	error_sti
 bstep_iret:
 	/* Fix truncated RIP */
-	movq %rcx,RIP+8(%rsp)
-	jmp error_swapgs
+	movq	%rcx,RIP+8(%rsp)
+	jmp	error_swapgs
 	CFI_ENDPROC
 END(error_entry)
 
-
-/* ebx:	no swapgs flag (1: don't need swapgs, 0: need it) */
+/* On entry, ebx is "no swapgs" flag (1: don't need swapgs, 0: need it) */
 ENTRY(error_exit)
 	DEFAULT_FRAME
 	movl %ebx,%eax
@@ -1595,13 +1596,13 @@ end_repeat_nmi:
 	ALLOC_PTREGS_ON_STACK
 
 	/*
-	 * Use save_paranoid to handle SWAPGS, but no need to use paranoid_exit
+	 * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
 	 * as we should not be calling schedule in NMI context.
 	 * Even with normal interrupts enabled. An NMI should not be
 	 * setting NEED_RESCHED or anything that normal interrupts and
 	 * exceptions might do.
 	 */
-	call save_paranoid
+	call paranoid_entry
 	DEFAULT_FRAME 0
 
 	/*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 11/17] x86: entry_64.S: fold test_in_nmi macro into its only user
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (9 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 10/17] x86: entry_64.S: rename save_paranoid to paranoid_entry, no code changes Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 12/17] x86: get rid of KERNEL_STACK_OFFSET Denys Vlasenko
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

No code changes.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/entry_64.S | 24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index decfe25..bf882af 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1389,19 +1389,7 @@ ENTRY(error_exit)
 	CFI_ENDPROC
 END(error_exit)
 
-/*
- * Test if a given stack is an NMI stack or not.
- */
-	.macro test_in_nmi reg stack nmi_ret normal_ret
-	cmpq %\reg, \stack
-	ja \normal_ret
-	subq $EXCEPTION_STKSZ, %\reg
-	cmpq %\reg, \stack
-	jb \normal_ret
-	jmp \nmi_ret
-	.endm
-
-	/* runs on exception stack */
+/* Runs on exception stack */
 ENTRY(nmi)
 	INTR_FRAME
 	PARAVIRT_ADJUST_EXCEPTION_FRAME
@@ -1462,8 +1450,14 @@ ENTRY(nmi)
 	 * We check the variable because the first NMI could be in a
 	 * breakpoint routine using a breakpoint stack.
 	 */
-	lea 6*8(%rsp), %rdx
-	test_in_nmi rdx, 4*8(%rsp), nested_nmi, first_nmi
+	lea	6*8(%rsp), %rdx
+	cmpq	%rdx, 4*8(%rsp)
+	ja	first_nmi
+	subq	$EXCEPTION_STKSZ, %rdx
+	cmpq	%rdx, 4*8(%rsp)
+	jb	first_nmi
+	jmp	nested_nmi
+
 	CFI_REMEMBER_STATE
 
 nested_nmi:
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 12/17] x86: get rid of KERNEL_STACK_OFFSET
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (10 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 11/17] x86: entry_64.S: fold test_in_nmi macro into its only user Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 13/17] x86: ia32entry.S: fold IA32_ARG_FIXUP macro into its callers Denys Vlasenko
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

PER_CPU_VAR(kernel_stack) was set up in a way where it points
five stack slots below the top of stack.

Presumably, it was done to avoid one "sub $5*8,%rsp"
in syscall/sysenter code paths, where iret frame needs to be
created by hand.

Ironically, none of them benefit from this optimization,
since all of them need to allocate additional data on stack
(struct pt_regs), so they still have to perform subtraction.
And ia32_sysenter_target even needs to *undo* this optimization:
it constructs iret stack with pushes instead of movs,
so it needs to start right at the top.

This patch eliminates KERNEL_STACK_OFFSET.
PER_CPU_VAR(kernel_stack) now points directly to top of stack.
pt_regs allocations are adjusted to allocate iret frame as well.

Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??"
are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS) -
"rsp is SIZEOF_PTREGS bytes below the top, calculate
thread_info's address using that information"

Net result in code is that one instruction is eliminated,
and constants in several others have changed.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/ia32/ia32entry.S          | 35 +++++++++++++++++------------------
 arch/x86/include/asm/calling.h     |  2 ++
 arch/x86/include/asm/thread_info.h |  8 +++-----
 arch/x86/kernel/cpu/common.c       |  2 +-
 arch/x86/kernel/entry_64.S         | 13 ++++++-------
 arch/x86/kernel/process_32.c       |  3 +--
 arch/x86/kernel/process_64.c       |  3 +--
 arch/x86/kernel/smpboot.c          |  3 +--
 arch/x86/xen/smp.c                 |  3 +--
 9 files changed, 33 insertions(+), 39 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index fe6eaf7..8cd08de 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -125,7 +125,6 @@ ENTRY(ia32_sysenter_target)
 	CFI_REGISTER	rsp,rbp
 	SWAPGS_UNSAFE_STACK
 	movq	PER_CPU_VAR(kernel_stack), %rsp
-	addq	$(KERNEL_STACK_OFFSET),%rsp
 	/*
 	 * No need to follow this irqs on/off section: the syscall
 	 * disabled irqs, here we enable it straight after entry:
@@ -139,7 +138,7 @@ ENTRY(ia32_sysenter_target)
 	CFI_REL_OFFSET rsp,0
 	pushfq_cfi
 	/*CFI_REL_OFFSET rflags,0*/
-	movl	TI_sysenter_return+THREAD_INFO(%rsp,3*8-KERNEL_STACK_OFFSET),%r10d
+	movl	TI_sysenter_return+THREAD_INFO(%rsp,3*8),%r10d
 	CFI_REGISTER rip,r10
 	pushq_cfi $__USER32_CS
 	/*CFI_REL_OFFSET cs,0*/
@@ -161,8 +160,8 @@ ENTRY(ia32_sysenter_target)
 1:	movl	(%rbp),%ebp
 	_ASM_EXTABLE(1b,ia32_badarg)
 	ASM_CLAC
-	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
-	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
+	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	CFI_REMEMBER_STATE
 	jnz  sysenter_tracesys
 	cmpq	$(IA32_NR_syscalls-1),%rax
@@ -174,10 +173,10 @@ sysenter_dispatch:
 	movq	%rax,RAX(%rsp)
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
+	testl	$_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz	sysexit_audit
 sysexit_from_sys_call:
-	andl    $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
+	andl    $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	/* clear IF, that popfq doesn't enable interrupts early */
 	andl	$~0x200,EFLAGS(%rsp)
 	movl	RIP(%rsp),%edx		/* User %eip */
@@ -220,7 +219,7 @@ sysexit_from_sys_call:
 	.endm
 
 	.macro auditsys_exit exit
-	testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
+	testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz ia32_ret_from_sys_call
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
@@ -235,7 +234,7 @@ sysexit_from_sys_call:
 	movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl %edi,TI_flags+THREAD_INFO(%rsp,RIP)
+	testl %edi,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jz \exit
 	CLEAR_RREGS
 	jmp int_with_check
@@ -253,7 +252,7 @@ sysexit_audit:
 
 sysenter_tracesys:
 #ifdef CONFIG_AUDITSYSCALL
-	testl	$(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
+	testl	$(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jz	sysenter_auditsys
 #endif
 	SAVE_EXTRA_REGS
@@ -302,7 +301,7 @@ ENDPROC(ia32_sysenter_target)
 ENTRY(ia32_cstar_target)
 	CFI_STARTPROC32	simple
 	CFI_SIGNAL_FRAME
-	CFI_DEF_CFA	rsp,KERNEL_STACK_OFFSET
+	CFI_DEF_CFA	rsp,0
 	CFI_REGISTER	rip,rcx
 	/*CFI_REGISTER	rflags,r11*/
 	SWAPGS_UNSAFE_STACK
@@ -314,7 +313,7 @@ ENTRY(ia32_cstar_target)
 	 * disabled irqs and here we enable it straight after entry:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	ALLOC_PTREGS_ON_STACK 8	/* +8: space for orig_ax */
+	ALLOC_PTREGS_ON_STACK 6*8	/* space for orig_ax and iret frame */
 	SAVE_C_REGS_EXCEPT_RCX_R891011
 	movl 	%eax,%eax	/* zero extension */
 	movq	%rax,ORIG_RAX(%rsp)
@@ -337,8 +336,8 @@ ENTRY(ia32_cstar_target)
 1:	movl	(%r8),%r9d
 	_ASM_EXTABLE(1b,ia32_badarg)
 	ASM_CLAC
-	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
-	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
+	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	CFI_REMEMBER_STATE
 	jnz   cstar_tracesys
 	cmpq $IA32_NR_syscalls-1,%rax
@@ -350,10 +349,10 @@ cstar_dispatch:
 	movq %rax,RAX(%rsp)
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
+	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz sysretl_audit
 sysretl_from_sys_call:
-	andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
+	andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	RESTORE_RSI_RDI_RDX
 	movl RIP(%rsp),%ecx
 	CFI_REGISTER rip,rcx
@@ -388,7 +387,7 @@ sysretl_audit:
 
 cstar_tracesys:
 #ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
+	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jz cstar_auditsys
 #endif
 	xchgl %r9d,%ebp
@@ -455,8 +454,8 @@ ENTRY(ia32_syscall)
 	   this could be a problem. */
 	ALLOC_PTREGS_ON_STACK
 	SAVE_C_REGS_EXCEPT_R891011
-	orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
-	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+	orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
+	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz ia32_tracesys
 	cmpq $(IA32_NR_syscalls-1),%rax
 	ja ia32_badsys
diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index e8e2e41..aa9113e 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -88,6 +88,8 @@ For 32-bit we have the following conventions - kernel is built with
 #define RSP		19*8
 #define SS		20*8
 
+#define SIZEOF_PTREGS	21*8
+
 	.macro ALLOC_PTREGS_ON_STACK addskip=0
 	subq	$15*8+\addskip, %rsp
 	CFI_ADJUST_CFA_OFFSET 15*8+\addskip
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8540538..af9b74c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -151,7 +151,6 @@ struct thread_info {
 #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)
 
 #define STACK_WARN		(THREAD_SIZE/8)
-#define KERNEL_STACK_OFFSET	(5*(BITS_PER_LONG/8))
 
 /*
  * macros/functions for gaining access to the thread information structure
@@ -165,8 +164,7 @@ DECLARE_PER_CPU(unsigned long, kernel_stack);
 static inline struct thread_info *current_thread_info(void)
 {
 	struct thread_info *ti;
-	ti = (void *)(this_cpu_read_stable(kernel_stack) +
-		      KERNEL_STACK_OFFSET - THREAD_SIZE);
+	ti = (void *)(this_cpu_read_stable(kernel_stack) - THREAD_SIZE);
 	return ti;
 }
 
@@ -175,13 +173,13 @@ static inline struct thread_info *current_thread_info(void)
 /* how to get the thread information struct from ASM */
 #define GET_THREAD_INFO(reg) \
 	_ASM_MOV PER_CPU_VAR(kernel_stack),reg ; \
-	_ASM_SUB $(THREAD_SIZE-KERNEL_STACK_OFFSET),reg ;
+	_ASM_SUB $(THREAD_SIZE),reg ;
 
 /*
  * Same if PER_CPU_VAR(kernel_stack) is, perhaps with some offset, already in
  * a certain register (to be used in assembler memory operands).
  */
-#define THREAD_INFO(reg, off) KERNEL_STACK_OFFSET+(off)-THREAD_SIZE(reg)
+#define THREAD_INFO(reg, off) (off)-THREAD_SIZE(reg)
 
 #endif
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ef1b93f..605034f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1114,7 +1114,7 @@ static __init int setup_disablecpuid(char *arg)
 __setup("clearcpuid=", setup_disablecpuid);
 
 DEFINE_PER_CPU(unsigned long, kernel_stack) =
-	(unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE;
+	(unsigned long)&init_thread_union + THREAD_SIZE;
 EXPORT_PER_CPU_SYMBOL(kernel_stack);
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index bf882af..f652ac6 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -237,7 +237,7 @@ ENDPROC(native_usergs_sysret64)
 ENTRY(system_call)
 	CFI_STARTPROC	simple
 	CFI_SIGNAL_FRAME
-	CFI_DEF_CFA	rsp,KERNEL_STACK_OFFSET
+	CFI_DEF_CFA	rsp,0
 	CFI_REGISTER	rip,rcx
 	/*CFI_REGISTER	rflags,r11*/
 	SWAPGS_UNSAFE_STACK
@@ -249,19 +249,18 @@ ENTRY(system_call)
 GLOBAL(system_call_after_swapgs)
 
 	movq	%rsp,PER_CPU_VAR(old_rsp)
-	/* kernel_stack is set so that 5 slots (iret frame) are preallocated */
 	movq	PER_CPU_VAR(kernel_stack),%rsp
 	/*
 	 * No need to follow this irqs off/on section - it's straight
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	ALLOC_PTREGS_ON_STACK 8		/* +8: space for orig_ax */
+	ALLOC_PTREGS_ON_STACK 6*8	/* space for orig_ax and iret frame */
 	SAVE_C_REGS
 	movq  %rax,ORIG_RAX(%rsp)
 	movq  %rcx,RIP(%rsp)
 	CFI_REL_OFFSET rip,RIP
-	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz tracesys
 system_call_fastpath:
 #if __SYSCALL_MASK == ~0
@@ -285,7 +284,7 @@ sysret_check:
 	LOCKDEP_SYS_EXIT
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	movl TI_flags+THREAD_INFO(%rsp,RIP),%edx
+	movl TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS),%edx
 	andl %edi,%edx
 	jnz  sysret_careful
 	CFI_REMEMBER_STATE
@@ -373,7 +372,7 @@ sysret_audit:
 	/* Do syscall tracing */
 tracesys:
 #ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
+	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jz auditsys
 #endif
 	SAVE_EXTRA_REGS
@@ -879,7 +878,7 @@ __do_double_fault:
 	jne do_double_fault		/* This shouldn't happen... */
 1:
 	movq PER_CPU_VAR(kernel_stack),%rax
-	subq $(6*8-KERNEL_STACK_OFFSET),%rax	/* Reset to original stack */
+	subq $(6*8),%rax		/* Reset to original stack */
 	movq %rax,RSP(%rdi)
 	movq $0,(%rax)			/* Missing (lost) #GP error code */
 	movq $general_protection,RIP(%rdi)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 7bc86bb..e72b776 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -315,8 +315,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	arch_end_context_switch(next_p);
 
 	this_cpu_write(kernel_stack,
-		  (unsigned long)task_stack_page(next_p) +
-		  THREAD_SIZE - KERNEL_STACK_OFFSET);
+		(unsigned long)task_stack_page(next_p) + THREAD_SIZE);
 
 	/*
 	 * Restore %gs if needed (which is common)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ca5b02d..2105487e 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -372,8 +372,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	this_cpu_write(__preempt_count, task_thread_info(next_p)->saved_preempt_count);
 
 	this_cpu_write(kernel_stack,
-		  (unsigned long)task_stack_page(next_p) +
-		  THREAD_SIZE - KERNEL_STACK_OFFSET);
+		(unsigned long)task_stack_page(next_p) + THREAD_SIZE);
 
 	/*
 	 * Now maybe reload the debug registers and handle I/O bitmaps
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5492798..788e215 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -775,8 +775,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 	initial_gs = per_cpu_offset(cpu);
 #endif
 	per_cpu(kernel_stack, cpu) =
-		(unsigned long)task_stack_page(idle) -
-		KERNEL_STACK_OFFSET + THREAD_SIZE;
+		(unsigned long)task_stack_page(idle) + THREAD_SIZE;
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
 	initial_code = (unsigned long)start_secondary;
 	stack_start  = idle->thread.sp;
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 7005974..bf549d5 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -443,8 +443,7 @@ static int xen_cpu_up(unsigned int cpu, struct task_struct *idle)
 	clear_tsk_thread_flag(idle, TIF_FORK);
 #endif
 	per_cpu(kernel_stack, cpu) =
-		(unsigned long)task_stack_page(idle) -
-		KERNEL_STACK_OFFSET + THREAD_SIZE;
+		(unsigned long)task_stack_page(idle) + THREAD_SIZE;
 
 	xen_setup_runstate_info(cpu);
 	xen_setup_timer(cpu);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 13/17] x86: ia32entry.S: fold IA32_ARG_FIXUP macro into its callers
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (11 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 12/17] x86: get rid of KERNEL_STACK_OFFSET Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 14/17] x86: ia32entry.S: use mov instead of push/pop where possible Denys Vlasenko
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

Use of a small macro - one with conditional expansion - does more harm
than good. It obfuscates code, with minimal code reuse.

For example, because of obfuscation it's not obvious that
in ia32_sysenter_target, we can optimize loading of r9 -
currently it is loaded with a detour through ebp.

This patch folds IA32_ARG_FIXUP macro into its callers.

No code changes.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/ia32/ia32entry.S | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 8cd08de..c70c9a0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -30,17 +30,6 @@
 
 	.section .entry.text, "ax"
 
-	.macro IA32_ARG_FIXUP noebp=0
-	movl	%edi,%r8d
-	.if \noebp
-	.else
-	movl	%ebp,%r9d
-	.endif
-	xchg	%ecx,%esi
-	movl	%ebx,%edi
-	movl	%edx,%edx	/* zero extension */
-	.endm 
-
 	/* clobbers %rax */
 	.macro  CLEAR_RREGS _r9=rax
 	xorl 	%eax,%eax
@@ -167,7 +156,12 @@ ENTRY(ia32_sysenter_target)
 	cmpq	$(IA32_NR_syscalls-1),%rax
 	ja	ia32_badsys
 sysenter_do_call:
-	IA32_ARG_FIXUP
+	/* 32bit syscall -> 64bit C ABI argument conversion */
+	movl	%edi,%r8d	/* arg5 */
+	movl	%ebp,%r9d	/* arg6 */
+	xchg	%ecx,%esi	/* rsi:arg2, rcx:arg4 */
+	movl	%ebx,%edi	/* arg1 */
+	movl	%edx,%edx	/* arg3 (zero extension) */
 sysenter_dispatch:
 	call	*ia32_sys_call_table(,%rax,8)
 	movq	%rax,RAX(%rsp)
@@ -343,7 +337,12 @@ ENTRY(ia32_cstar_target)
 	cmpq $IA32_NR_syscalls-1,%rax
 	ja  ia32_badsys
 cstar_do_call:
-	IA32_ARG_FIXUP 1
+	/* 32bit syscall -> 64bit C ABI argument conversion */
+	movl	%edi,%r8d	/* arg5 */
+	/* r9 already loaded */	/* arg6 */
+	xchg	%ecx,%esi	/* rsi:arg2, rcx:arg4 */
+	movl	%ebx,%edi	/* arg1 */
+	movl	%edx,%edx	/* arg3 (zero extension) */
 cstar_dispatch:
 	call *ia32_sys_call_table(,%rax,8)
 	movq %rax,RAX(%rsp)
@@ -460,7 +459,12 @@ ENTRY(ia32_syscall)
 	cmpq $(IA32_NR_syscalls-1),%rax
 	ja ia32_badsys
 ia32_do_call:
-	IA32_ARG_FIXUP
+	/* 32bit syscall -> 64bit C ABI argument conversion */
+	movl %edi,%r8d	/* arg5 */
+	movl %ebp,%r9d	/* arg6 */
+	xchg %ecx,%esi	/* rsi:arg2, rcx:arg4 */
+	movl %ebx,%edi	/* arg1 */
+	movl %edx,%edx	/* arg3 (zero extension) */
 	call *ia32_sys_call_table(,%rax,8) # xxx: rip relative
 ia32_sysret:
 	movq %rax,RAX(%rsp)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 14/17] x86: ia32entry.S: use mov instead of push/pop where possible
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (12 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 13/17] x86: ia32entry.S: fold IA32_ARG_FIXUP macro into its callers Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 15/17] x86: code shrink in paranoid_exit Denys Vlasenko
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

mov insns are faster than push/pops: some CPUs can execute
two movs per cycle, but only one push/pop.

Logic is not changed by this patch.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/ia32/ia32entry.S | 54 ++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 29 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index c70c9a0..844ef4f 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -119,28 +119,25 @@ ENTRY(ia32_sysenter_target)
 	 * disabled irqs, here we enable it straight after entry:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	/* Construct iret frame (ss,rsp,rflags,cs,rip) */
+	/* Construct part of iret frame (ss,rsp,rflags) */
  	movl	%ebp,%ebp		/* zero extension */
 	pushq_cfi $__USER32_DS
 	/*CFI_REL_OFFSET ss,0*/
+	movl 	%eax,%eax		/* zero extension */
 	pushq_cfi %rbp
 	CFI_REL_OFFSET rsp,0
 	pushfq_cfi
 	/*CFI_REL_OFFSET rflags,0*/
-	movl	TI_sysenter_return+THREAD_INFO(%rsp,3*8),%r10d
-	CFI_REGISTER rip,r10
-	pushq_cfi $__USER32_CS
-	/*CFI_REL_OFFSET cs,0*/
-	movl	%eax, %eax
-	/* Store thread_info->sysenter_return in rip stack slot */
-	pushq_cfi %r10
-	CFI_REL_OFFSET rip,0
-	/* Store orig_ax */
-	pushq_cfi %rax
-	/* Construct the rest of "struct pt_regs" */
 	cld
-	ALLOC_PTREGS_ON_STACK
+	/* Construct the rest of pt_regs */
+	ALLOC_PTREGS_ON_STACK 3*8 /* 3*8: space for orig_ax,rip,cs */
+	movl	TI_sysenter_return+THREAD_INFO(%rsp,SIZEOF_PTREGS),%r10d
 	SAVE_C_REGS_EXCEPT_R891011
+	movq	%rax,ORIG_RAX(%rsp)
+	/* pt_regs->ip = thread_info->sysenter_return */
+	movq	%r10,RIP(%rsp)
+	CFI_REL_OFFSET rip,RIP
+	movq	$__USER32_CS,CS(%rsp)
 	/*
 	 * no need to do an access_ok check here because rbp has been
 	 * 32bit zero extended
@@ -171,10 +168,12 @@ sysenter_dispatch:
 	jnz	sysexit_audit
 sysexit_from_sys_call:
 	andl    $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
-	/* clear IF, that popfq doesn't enable interrupts early */
+	/* clear IF, so that popfq won't enable interrupts early */
 	andl	$~0x200,EFLAGS(%rsp)
 	movl	RIP(%rsp),%edx		/* User %eip */
 	CFI_REGISTER rip,rdx
+	movl	RSP(%rsp),%ecx		/* User %esp */
+	CFI_REGISTER rsp,rcx
 	RESTORE_RSI_RDI
 	/* pop everything except ss,rsp,rflags slots */
 	REMOVE_PTREGS_FROM_STACK 8*3
@@ -184,8 +183,6 @@ sysexit_from_sys_call:
 	xorq	%r11,%r11
 	popfq_cfi
 	/*CFI_RESTORE rflags*/
-	popq_cfi %rcx				/* User %esp */
-	CFI_REGISTER rsp,rcx
 	TRACE_IRQS_ON
 	/*
 	 * 32bit SYSEXIT restores eip from edx, esp from ecx.
@@ -307,21 +304,21 @@ ENTRY(ia32_cstar_target)
 	 * disabled irqs and here we enable it straight after entry:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	ALLOC_PTREGS_ON_STACK 6*8	/* space for orig_ax and iret frame */
-	SAVE_C_REGS_EXCEPT_RCX_R891011
-	movl 	%eax,%eax	/* zero extension */
+	movl	%eax,%eax	/* zero extension */
+	ALLOC_PTREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
 	movq	%rax,ORIG_RAX(%rsp)
 	movq	%rcx,RIP(%rsp)
 	CFI_REL_OFFSET rip,RIP
-	movq	%rbp,RCX(%rsp) /* this lies slightly to ptrace */
-	movl	%ebp,%ecx
 	movq	$__USER32_CS,CS(%rsp)
-	movq	$__USER32_DS,SS(%rsp)
 	movq	%r11,EFLAGS(%rsp)
 	/*CFI_REL_OFFSET rflags,EFLAGS*/
 	movq	%r8,RSP(%rsp)
 	CFI_REL_OFFSET rsp,RSP
-	/* iret stack frame is complete now */
+	movq	$__USER32_DS,SS(%rsp)
+	/* iret frame is complete now */
+	SAVE_C_REGS_EXCEPT_RCX_R891011
+	movq	%rbp,RCX(%rsp)	/* this lies slightly to ptrace */
+	movl	%ebp,%ecx
 	/*
 	 * no need to do an access_ok check here because r8 has been
 	 * 32bit zero extended
@@ -447,11 +444,11 @@ ENTRY(ia32_syscall)
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	movl %eax,%eax
-	pushq_cfi %rax		/* store orig_ax */
 	cld
 	/* note the registers are not zero extended to the sf.
 	   this could be a problem. */
-	ALLOC_PTREGS_ON_STACK
+	ALLOC_PTREGS_ON_STACK 1*8 /* 1*8: space for orig_ax */
+	movq %rax,ORIG_RAX(%rsp)
 	SAVE_C_REGS_EXCEPT_R891011
 	orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 15/17] x86: code shrink in paranoid_exit
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (13 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 14/17] x86: ia32entry.S: use mov instead of push/pop where possible Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 17:44 ` [PATCH 16/17] x86: entry_64.S: trivial optimization for ENOSYS Denys Vlasenko
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

RESTORE_EXTRA_REGS + RESTORE_C_REGS looks small, but it's
a lot of instructions (fourteen). Let's reuse them.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/entry_64.S | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index f652ac6..7bc8c24 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1288,12 +1288,10 @@ ENTRY(paranoid_exit)
 paranoid_swapgs:
 	TRACE_IRQS_IRETQ 0
 	SWAPGS_UNSAFE_STACK
-	RESTORE_EXTRA_REGS
-	RESTORE_C_REGS
-	REMOVE_PTREGS_FROM_STACK 8
-	jmp irq_return
+	jmp paranoid_restore1
 paranoid_restore:
 	TRACE_IRQS_IRETQ_DEBUG 0
+paranoid_restore1:
 	RESTORE_EXTRA_REGS
 	RESTORE_C_REGS
 	REMOVE_PTREGS_FROM_STACK 8
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 16/17] x86: entry_64.S: trivial optimization for ENOSYS
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (14 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 15/17] x86: code shrink in paranoid_exit Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 22:48   ` Andy Lutomirski
  2014-08-08 17:44 ` [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath Denys Vlasenko
  2014-08-09  0:27 ` [PATCH v4 0/17] x86: entry.S optimizations H. Peter Anvin
  17 siblings, 1 reply; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

This code uses a slightly shorter insn.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/entry_64.S | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 7bc8c24..5d639a6 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -272,6 +272,7 @@ system_call_fastpath:
 	ja badsys
 	movq %r10,%rcx
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
+from_badsys:
 	movq %rax,RAX(%rsp)
 /*
  * Syscall return path ending with SYSRET (fast path)
@@ -334,8 +335,8 @@ sysret_signal:
 	jmp int_check_syscall_exit_work
 
 badsys:
-	movq $-ENOSYS,RAX(%rsp)
-	jmp ret_from_sys_call
+	movq	$-ENOSYS,%rax
+	jmp	from_badsys
 
 #ifdef CONFIG_AUDITSYSCALL
 	/*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (15 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 16/17] x86: entry_64.S: trivial optimization for ENOSYS Denys Vlasenko
@ 2014-08-08 17:44 ` Denys Vlasenko
  2014-08-08 22:59   ` Andy Lutomirski
  2014-08-10 18:47   ` [PATCH 17/17 v2] " Denys Vlasenko
  2014-08-09  0:27 ` [PATCH v4 0/17] x86: entry.S optimizations H. Peter Anvin
  17 siblings, 2 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-08 17:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

Before this patch, rcx and r11 were saved in pt_regs->rcx
and pt_regs->r11. Which looks natural, but requires messy
shuffling to/from iret stack whenever ptrace or e.g. iopl
wants to modify return address or flags - because that's
how these registers are used by SYSCALL/SYSRET.

This patch saves rcx and r11 in pt_regs->rip and pt_regs->flags,
and uses these values for SYSRET64 insn. Shuffling is eliminated.

On slow path, the values are saved in both locations - thus, ptrace
can modify rcx, and on syscall exit rcx will be different from
return address... don't see why that can be useful but anyway,
it works.

The lazy store of pt_regs->cs and pt_regs->ss is retained -
tests have shown that these insns do take ~2 cycles on fast path.

FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK macros are replaced
by a simpler single macro, STORE_IRET_FRAME_CS_SS.

stub_iopl is no longer needed: pt_regs->flags needs no fixing up.

PER_CPU(old_rsp) usage is simplified - now it is used only
as temp storage, and userspace stack pointer is immediately stored
in pt_regs->sp on syscall entry, instead of being used later,
on syscall exit. This allows to get rid of thread_struct::usersp.

Testing shows that syscall fast path is ~54.3 ns before
and after the patch (on 2.7 GHz Sandy Bridge CPU).

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/calling.h   |  18 +++--
 arch/x86/include/asm/compat.h    |   2 +-
 arch/x86/include/asm/processor.h |   1 -
 arch/x86/include/asm/ptrace.h    |   8 +--
 arch/x86/kernel/entry_64.S       | 140 +++++++++++++++++----------------------
 arch/x86/kernel/process_64.c     |   8 +--
 arch/x86/syscalls/syscall_64.tbl |   2 +-
 arch/x86/um/sys_call_table_64.c  |   2 +-
 8 files changed, 78 insertions(+), 103 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index aa9113e..7afbcea 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -95,7 +95,7 @@ For 32-bit we have the following conventions - kernel is built with
 	CFI_ADJUST_CFA_OFFSET 15*8+\addskip
 	.endm
 
-	.macro SAVE_C_REGS_HELPER offset=0 rcx=1 r8plus=1
+	.macro SAVE_C_REGS_HELPER offset=0 rcx=1 r8910=1 r11=1
 	movq_cfi rdi, 14*8+\offset
 	movq_cfi rsi, 13*8+\offset
 	movq_cfi rdx, 12*8+\offset
@@ -103,21 +103,26 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rcx, 11*8+\offset
 	.endif
 	movq_cfi rax, 10*8+\offset
-	.if \r8plus
+	.if \r8910
 	movq_cfi r8,  9*8+\offset
 	movq_cfi r9,  8*8+\offset
 	movq_cfi r10, 7*8+\offset
+	.endif
+	.if \r11
 	movq_cfi r11, 6*8+\offset
 	.endif
 	.endm
 	.macro SAVE_C_REGS offset=0
-	SAVE_C_REGS_HELPER \offset, 1, 1
+	SAVE_C_REGS_HELPER \offset, 1, 1, 1
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_R891011
-	SAVE_C_REGS_HELPER 0, 1, 0
+	SAVE_C_REGS_HELPER 0, 1, 0, 0
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_RCX_R891011
-	SAVE_C_REGS_HELPER 0, 0, 0
+	SAVE_C_REGS_HELPER 0, 0, 0, 0
+	.endm
+	.macro SAVE_C_REGS_EXCEPT_RCX_R11
+	SAVE_C_REGS_HELPER 0, 0, 1, 0
 	.endm
 
 	.macro SAVE_EXTRA_REGS offset=0
@@ -171,6 +176,9 @@ For 32-bit we have the following conventions - kernel is built with
 	.macro RESTORE_C_REGS_EXCEPT_RCX
 	RESTORE_C_REGS_HELPER 1,0,1,1,1
 	.endm
+	.macro RESTORE_C_REGS_EXCEPT_RCX_R11
+	RESTORE_C_REGS_HELPER 1,0,0,1,1
+	.endm
 	.macro RESTORE_RSI_RDI
 	RESTORE_C_REGS_HELPER 0,0,0,0,0
 	.endm
diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h
index 59c6c40..acdee09 100644
--- a/arch/x86/include/asm/compat.h
+++ b/arch/x86/include/asm/compat.h
@@ -301,7 +301,7 @@ static inline void __user *arch_compat_alloc_user_space(long len)
 		sp = task_pt_regs(current)->sp;
 	} else {
 		/* -128 for the x32 ABI redzone */
-		sp = this_cpu_read(old_rsp) - 128;
+		sp = task_pt_regs(current)->sp - 128;
 	}
 
 	return (void __user *)round_down(sp - len, 16);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a4ea023..33c86c6 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -474,7 +474,6 @@ struct thread_struct {
 #ifdef CONFIG_X86_32
 	unsigned long		sysenter_cs;
 #else
-	unsigned long		usersp;	/* Copy from PDA */
 	unsigned short		es;
 	unsigned short		ds;
 	unsigned short		fsindex;
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index c822b35..271c779 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -140,12 +140,8 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
 #endif
 }
 
-#define current_user_stack_pointer()	this_cpu_read(old_rsp)
-/* ia32 vs. x32 difference */
-#define compat_user_stack_pointer()	\
-	(test_thread_flag(TIF_IA32) 	\
-	 ? current_pt_regs()->sp 	\
-	 : this_cpu_read(old_rsp))
+#define current_user_stack_pointer()	current_pt_regs()->sp
+#define compat_user_stack_pointer()	current_pt_regs()->sp
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 5d639a6..efe9780 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -24,8 +24,6 @@
  * - CFI macros are used to generate dwarf2 unwind information for better
  * backtraces. They don't change any code.
  * - ENTRY/END Define functions in the symbol table.
- * - FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK - Fix up the hardware stack
- * frame that is otherwise undefined after a SYSCALL
  * - TRACE_IRQ_* - Trace hard interrupt state for lock debugging.
  * - idtentry - Define exception entry points.
  */
@@ -121,29 +119,13 @@ ENDPROC(native_usergs_sysret64)
 #endif
 
 /*
- * C code is not supposed to know about undefined top of stack. Every time
- * a C function with an pt_regs argument is called from the SYSCALL based
- * fast path FIXUP_TOP_OF_STACK is needed.
- * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs
- * manipulation.
+ * struct pt_regs is not fully populated on the fast path
+ * (rcx, r11, cs and ss are not filled in).
+ * This macro populates segment registers in iret frame.
  */
-
-	/* %rsp:at FRAMEEND */
-	.macro FIXUP_TOP_OF_STACK tmp offset=0
-	movq PER_CPU_VAR(old_rsp),\tmp
-	movq \tmp,RSP+\offset(%rsp)
-	movq $__USER_DS,SS+\offset(%rsp)
-	movq $__USER_CS,CS+\offset(%rsp)
-	movq $-1,RCX+\offset(%rsp)
-	movq R11+\offset(%rsp),\tmp  /* get eflags */
-	movq \tmp,EFLAGS+\offset(%rsp)
-	.endm
-
-	.macro RESTORE_TOP_OF_STACK tmp offset=0
-	movq RSP+\offset(%rsp),\tmp
-	movq \tmp,PER_CPU_VAR(old_rsp)
-	movq EFLAGS+\offset(%rsp),\tmp
-	movq \tmp,R11+\offset(%rsp)
+	.macro STORE_IRET_FRAME_CS_SS offset=0
+	movq	$__USER_CS,CS+\offset(%rsp)
+	movq	$__USER_DS,SS+\offset(%rsp)
 	.endm
 
 /*
@@ -226,12 +208,16 @@ ENDPROC(native_usergs_sysret64)
  * Interrupts are off on entry.
  * Only called from user space.
  *
- * XXX	if we had a free scratch register we could save the RSP into the stack frame
- *      and report it properly in ps. Unfortunately we haven't.
- *
  * When user can change the frames always force IRET. That is because
  * it deals with uncanonical addresses better. SYSRET has trouble
  * with them due to bugs in both AMD and Intel CPUs.
+ *
+ * When returning through fast path, userspace sees rcx = return address,
+ * r11 = rflags. When returning through iret (e.g. if audit is active),
+ * these registers may contain garbage.
+ * For ptrace we manage to avoid that: when we hit slow path on entry,
+ * we do save rcx and r11 in pt_regs, so ptrace on exit also sees them.
+ * If slow path is entered only on exit, there will be garbage.
  */
 
 ENTRY(system_call)
@@ -256,9 +242,16 @@ GLOBAL(system_call_after_swapgs)
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	ALLOC_PTREGS_ON_STACK 6*8	/* space for orig_ax and iret frame */
-	SAVE_C_REGS
-	movq  %rax,ORIG_RAX(%rsp)
-	movq  %rcx,RIP(%rsp)
+	SAVE_C_REGS_EXCEPT_RCX_R11
+	/* Fill orig_ax and, partially, iret frame (rip,flags,rsp) */
+	movq	%rax,ORIG_RAX(%rsp)
+	movq	%rcx,RIP(%rsp)
+	movq	PER_CPU_VAR(old_rsp),%rcx
+	/*movq	$__USER_CS,CS(%rsp) - moved out of fast path */
+	movq	%r11,EFLAGS(%rsp)
+	movq	%rcx,RSP(%rsp)
+	/*movq	$__USER_DS,SS(%rsp) - moved out of fast path */
+
 	CFI_REL_OFFSET rip,RIP
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz tracesys
@@ -276,7 +269,7 @@ from_badsys:
 	movq %rax,RAX(%rsp)
 /*
  * Syscall return path ending with SYSRET (fast path)
- * Has incomplete stack frame and undefined top of stack.
+ * Has incompletely filled pt_regs and iret frame's ss and cs.
  */
 ret_from_sys_call:
 	movl $_TIF_ALLWORK_MASK,%edi
@@ -293,11 +286,12 @@ sysret_check:
 	 * sysretq will re-enable interrupts:
 	 */
 	TRACE_IRQS_ON
-	RESTORE_C_REGS_EXCEPT_RCX
-	movq RIP(%rsp),%rcx
+	RESTORE_C_REGS_EXCEPT_RCX_R11
+	movq	RIP(%rsp),%rcx
 	CFI_REGISTER	rip,rcx
+	movq	EFLAGS(%rsp),%r11
 	/*CFI_REGISTER	rflags,r11*/
-	movq	PER_CPU_VAR(old_rsp), %rsp
+	movq	RSP(%rsp),%rsp
 	/*
 	 * 64bit SYSRET restores rip from rcx,
 	 * rflags from r11 (but RF and VM bits are forced to 0),
@@ -309,6 +303,7 @@ sysret_check:
 	/* Handle reschedules */
 	/* edx:	work, edi: workmask */
 sysret_careful:
+	STORE_IRET_FRAME_CS_SS
 	bt $TIF_NEED_RESCHED,%edx
 	jnc sysret_signal
 	TRACE_IRQS_ON
@@ -331,7 +326,6 @@ sysret_signal:
 	 * These all wind up with the iret return path anyway,
 	 * so just join that path right now.
 	 */
-	FIXUP_TOP_OF_STACK %r11
 	jmp int_check_syscall_exit_work
 
 badsys:
@@ -370,15 +364,18 @@ sysret_audit:
 	jmp sysret_check
 #endif	/* CONFIG_AUDITSYSCALL */
 
-	/* Do syscall tracing */
+	/* Do entry syscall tracing */
 tracesys:
+	movq	RIP(%rsp),%rcx
+	STORE_IRET_FRAME_CS_SS /* fill the rest of iret frame */
+	movq	%r11,R11(%rsp) /* fill the rest of pt_regs */
+	movq	%rcx,RCX(%rsp) /* ditto */
 #ifdef CONFIG_AUDITSYSCALL
 	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jz auditsys
 #endif
 	SAVE_EXTRA_REGS
 	movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
-	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
 	call syscall_trace_enter
 	/*
@@ -402,7 +399,7 @@ tracesys:
 
 /*
  * Syscall return path ending with IRET.
- * Has correct top of stack, but partial stack frame.
+ * Has correct top of stack, but possibly only partially filled pt_regs.
  */
 GLOBAL(int_ret_from_sys_call)
 	DISABLE_INTERRUPTS(CLBR_NONE)
@@ -469,43 +466,15 @@ ENTRY(stub_\func)
 	CFI_STARTPROC
 	DEFAULT_FRAME 0, 8		/* offset 8: return address */
 	SAVE_EXTRA_REGS 8
-	FIXUP_TOP_OF_STACK %r11, 8
-	call sys_\func
-	RESTORE_TOP_OF_STACK %r11, 8
-	ret
+	STORE_IRET_FRAME_CS_SS 8	/* not sure we need this. paranoia */
+	jmp sys_\func
 	CFI_ENDPROC
 END(stub_\func)
 	.endm
 
-	.macro FIXED_FRAME label,func
-ENTRY(\label)
-	CFI_STARTPROC
-	DEFAULT_FRAME 0, 8		/* offset 8: return address */
-	FIXUP_TOP_OF_STACK %r11, 8
-	call \func
-	RESTORE_TOP_OF_STACK %r11, 8
-	ret
-	CFI_ENDPROC
-END(\label)
-	.endm
-
 	FORK_LIKE  clone
 	FORK_LIKE  fork
 	FORK_LIKE  vfork
-	FIXED_FRAME stub_iopl, sys_iopl
-
-ENTRY(stub_execve)
-	CFI_STARTPROC
-	addq $8, %rsp
-	DEFAULT_FRAME 0
-	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
-	call sys_execve
-	movq %rax,RAX(%rsp)
-	RESTORE_EXTRA_REGS
-	jmp int_ret_from_sys_call
-	CFI_ENDPROC
-END(stub_execve)
 
 /*
  * sigreturn is special because it needs to restore all registers on return.
@@ -516,25 +485,35 @@ ENTRY(stub_rt_sigreturn)
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
+	STORE_IRET_FRAME_CS_SS
 	call sys_rt_sigreturn
-	movq %rax,RAX(%rsp) # fixme, this could be done at the higher layer
+stub_return:
+	movq %rax,RAX(%rsp)
 	RESTORE_EXTRA_REGS
 	jmp int_ret_from_sys_call
 	CFI_ENDPROC
 END(stub_rt_sigreturn)
 
+ENTRY(stub_execve)
+	CFI_STARTPROC
+	addq $8, %rsp
+	DEFAULT_FRAME 0
+	SAVE_EXTRA_REGS
+	STORE_IRET_FRAME_CS_SS
+	call sys_execve
+	jmp stub_return
+	CFI_ENDPROC
+END(stub_execve)
+
 #ifdef CONFIG_X86_X32_ABI
 ENTRY(stub_x32_rt_sigreturn)
 	CFI_STARTPROC
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
+	STORE_IRET_FRAME_CS_SS
 	call sys32_x32_rt_sigreturn
-	movq %rax,RAX(%rsp) # fixme, this could be done at the higher layer
-	RESTORE_EXTRA_REGS
-	jmp int_ret_from_sys_call
+	jmp stub_return
 	CFI_ENDPROC
 END(stub_x32_rt_sigreturn)
 
@@ -543,15 +522,11 @@ ENTRY(stub_x32_execve)
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
+	STORE_IRET_FRAME_CS_SS
 	call compat_sys_execve
-	RESTORE_TOP_OF_STACK %r11
-	movq %rax,RAX(%rsp)
-	RESTORE_EXTRA_REGS
-	jmp int_ret_from_sys_call
+	jmp stub_return
 	CFI_ENDPROC
 END(stub_x32_execve)
-
 #endif
 
 /*
@@ -576,10 +551,13 @@ ENTRY(ret_from_fork)
 	testl $3, CS(%rsp)			# from kernel_thread?
 	jz   1f
 
+	/*
+	 * TODO: can we instead check CS(%rsp) == __USER32_CS below?
+	 * We won't need GET_THREAD_INFO(%rcx) then...
+	 */
 	testl $_TIF_IA32, TI_flags(%rcx)	# 32-bit compat task needs IRET
 	jnz  int_ret_from_sys_call
 
-	RESTORE_TOP_OF_STACK %rdi
 	jmp ret_from_sys_call			# go to the SYSRET fastpath
 
 1:
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 2105487e..5a7ba74 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -161,7 +161,6 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	p->thread.sp0 = (unsigned long)task_stack_page(p) + THREAD_SIZE;
 	childregs = task_pt_regs(p);
 	p->thread.sp = (unsigned long) childregs;
-	p->thread.usersp = me->thread.usersp;
 	set_tsk_thread_flag(p, TIF_FORK);
 	p->thread.fpu_counter = 0;
 	p->thread.io_bitmap_ptr = NULL;
@@ -238,10 +237,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 	loadsegment(es, _ds);
 	loadsegment(ds, _ds);
 	load_gs_index(0);
-	current->thread.usersp	= new_sp;
 	regs->ip		= new_ip;
 	regs->sp		= new_sp;
-	this_cpu_write(old_rsp, new_sp);
 	regs->cs		= _cs;
 	regs->ss		= _ss;
 	regs->flags		= X86_EFLAGS_IF;
@@ -359,8 +356,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	/*
 	 * Switch the PDA and FPU contexts.
 	 */
-	prev->usersp = this_cpu_read(old_rsp);
-	this_cpu_write(old_rsp, next->usersp);
 	this_cpu_write(current_task, next_p);
 
 	/*
@@ -559,6 +554,5 @@ long sys_arch_prctl(int code, unsigned long addr)
 
 unsigned long KSTK_ESP(struct task_struct *task)
 {
-	return (test_tsk_thread_flag(task, TIF_IA32)) ?
-			(task_pt_regs(task)->sp) : ((task)->thread.usersp);
+	return task_pt_regs(task)->sp;
 }
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1..8b745ae 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -178,7 +178,7 @@
 169	common	reboot			sys_reboot
 170	common	sethostname		sys_sethostname
 171	common	setdomainname		sys_setdomainname
-172	common	iopl			stub_iopl
+172	common	iopl			sys_iopl
 173	common	ioperm			sys_ioperm
 174	64	create_module
 175	common	init_module		sys_init_module
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index f2f0723..5c81ed2 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -16,7 +16,7 @@
  */
 
 /* Not going to be implemented by UML, since we have no hardware. */
-#define stub_iopl sys_ni_syscall
+#define sys_iopl sys_ni_syscall
 #define sys_ioperm sys_ni_syscall
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 16/17] x86: entry_64.S: trivial optimization for ENOSYS
  2014-08-08 17:44 ` [PATCH 16/17] x86: entry_64.S: trivial optimization for ENOSYS Denys Vlasenko
@ 2014-08-08 22:48   ` Andy Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2014-08-08 22:48 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Frederic Weisbecker, X86 ML, Alexei Starovoitov, Will Drewry,
	Kees Cook

On Sat, Aug 9, 2014 at 2:44 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> This code uses a slightly shorter insn.

See this for an alternative approach:

http://thread.gmane.org/gmane.linux.kernel.cross-arch/24167

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath
  2014-08-08 17:44 ` [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath Denys Vlasenko
@ 2014-08-08 22:59   ` Andy Lutomirski
  2014-08-10 15:00     ` Denys Vlasenko
  2014-08-10 18:47   ` [PATCH 17/17 v2] " Denys Vlasenko
  1 sibling, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2014-08-08 22:59 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Frederic Weisbecker, X86 ML, Alexei Starovoitov, Will Drewry,
	Kees Cook

On Sat, Aug 9, 2014 at 2:44 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> Before this patch, rcx and r11 were saved in pt_regs->rcx
> and pt_regs->r11. Which looks natural, but requires messy
> shuffling to/from iret stack whenever ptrace or e.g. iopl
> wants to modify return address or flags - because that's
> how these registers are used by SYSCALL/SYSRET.
>
> This patch saves rcx and r11 in pt_regs->rip and pt_regs->flags,
> and uses these values for SYSRET64 insn. Shuffling is eliminated.
>
> On slow path, the values are saved in both locations - thus, ptrace
> can modify rcx, and on syscall exit rcx will be different from
> return address... don't see why that can be useful but anyway,
> it works.
>
> The lazy store of pt_regs->cs and pt_regs->ss is retained -
> tests have shown that these insns do take ~2 cycles on fast path.
>
> FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK macros are replaced
> by a simpler single macro, STORE_IRET_FRAME_CS_SS.
>
> stub_iopl is no longer needed: pt_regs->flags needs no fixing up.
>
> PER_CPU(old_rsp) usage is simplified - now it is used only
> as temp storage, and userspace stack pointer is immediately stored
> in pt_regs->sp on syscall entry, instead of being used later,
> on syscall exit. This allows to get rid of thread_struct::usersp.
>
> Testing shows that syscall fast path is ~54.3 ns before
> and after the patch (on 2.7 GHz Sandy Bridge CPU).
>
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: X86 ML <x86@kernel.org>
> CC: Alexei Starovoitov <ast@plumgrid.com>
> CC: Will Drewry <wad@chromium.org>
> CC: Kees Cook <keescook@chromium.org>
> CC: linux-kernel@vger.kernel.org
> ---
>  arch/x86/include/asm/calling.h   |  18 +++--
>  arch/x86/include/asm/compat.h    |   2 +-
>  arch/x86/include/asm/processor.h |   1 -
>  arch/x86/include/asm/ptrace.h    |   8 +--
>  arch/x86/kernel/entry_64.S       | 140 +++++++++++++++++----------------------
>  arch/x86/kernel/process_64.c     |   8 +--
>  arch/x86/syscalls/syscall_64.tbl |   2 +-
>  arch/x86/um/sys_call_table_64.c  |   2 +-
>  8 files changed, 78 insertions(+), 103 deletions(-)
>
> diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
> index aa9113e..7afbcea 100644
> --- a/arch/x86/include/asm/calling.h
> +++ b/arch/x86/include/asm/calling.h
> @@ -95,7 +95,7 @@ For 32-bit we have the following conventions - kernel is built with
>         CFI_ADJUST_CFA_OFFSET 15*8+\addskip
>         .endm
>
> -       .macro SAVE_C_REGS_HELPER offset=0 rcx=1 r8plus=1
> +       .macro SAVE_C_REGS_HELPER offset=0 rcx=1 r8910=1 r11=1
>         movq_cfi rdi, 14*8+\offset
>         movq_cfi rsi, 13*8+\offset
>         movq_cfi rdx, 12*8+\offset
> @@ -103,21 +103,26 @@ For 32-bit we have the following conventions - kernel is built with
>         movq_cfi rcx, 11*8+\offset
>         .endif
>         movq_cfi rax, 10*8+\offset
> -       .if \r8plus
> +       .if \r8910
>         movq_cfi r8,  9*8+\offset
>         movq_cfi r9,  8*8+\offset
>         movq_cfi r10, 7*8+\offset
> +       .endif
> +       .if \r11
>         movq_cfi r11, 6*8+\offset
>         .endif
>         .endm
>         .macro SAVE_C_REGS offset=0
> -       SAVE_C_REGS_HELPER \offset, 1, 1
> +       SAVE_C_REGS_HELPER \offset, 1, 1, 1
>         .endm
>         .macro SAVE_C_REGS_EXCEPT_R891011
> -       SAVE_C_REGS_HELPER 0, 1, 0
> +       SAVE_C_REGS_HELPER 0, 1, 0, 0
>         .endm
>         .macro SAVE_C_REGS_EXCEPT_RCX_R891011
> -       SAVE_C_REGS_HELPER 0, 0, 0
> +       SAVE_C_REGS_HELPER 0, 0, 0, 0
> +       .endm
> +       .macro SAVE_C_REGS_EXCEPT_RCX_R11
> +       SAVE_C_REGS_HELPER 0, 0, 1, 0
>         .endm
>
>         .macro SAVE_EXTRA_REGS offset=0
> @@ -171,6 +176,9 @@ For 32-bit we have the following conventions - kernel is built with
>         .macro RESTORE_C_REGS_EXCEPT_RCX
>         RESTORE_C_REGS_HELPER 1,0,1,1,1
>         .endm
> +       .macro RESTORE_C_REGS_EXCEPT_RCX_R11
> +       RESTORE_C_REGS_HELPER 1,0,0,1,1
> +       .endm
>         .macro RESTORE_RSI_RDI
>         RESTORE_C_REGS_HELPER 0,0,0,0,0
>         .endm
> diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h
> index 59c6c40..acdee09 100644
> --- a/arch/x86/include/asm/compat.h
> +++ b/arch/x86/include/asm/compat.h
> @@ -301,7 +301,7 @@ static inline void __user *arch_compat_alloc_user_space(long len)
>                 sp = task_pt_regs(current)->sp;
>         } else {
>                 /* -128 for the x32 ABI redzone */
> -               sp = this_cpu_read(old_rsp) - 128;
> +               sp = task_pt_regs(current)->sp - 128;
>         }
>
>         return (void __user *)round_down(sp - len, 16);
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index a4ea023..33c86c6 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -474,7 +474,6 @@ struct thread_struct {
>  #ifdef CONFIG_X86_32
>         unsigned long           sysenter_cs;
>  #else
> -       unsigned long           usersp; /* Copy from PDA */
>         unsigned short          es;
>         unsigned short          ds;
>         unsigned short          fsindex;
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index c822b35..271c779 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -140,12 +140,8 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
>  #endif
>  }
>
> -#define current_user_stack_pointer()   this_cpu_read(old_rsp)
> -/* ia32 vs. x32 difference */
> -#define compat_user_stack_pointer()    \
> -       (test_thread_flag(TIF_IA32)     \
> -        ? current_pt_regs()->sp        \
> -        : this_cpu_read(old_rsp))
> +#define current_user_stack_pointer()   current_pt_regs()->sp
> +#define compat_user_stack_pointer()    current_pt_regs()->sp
>  #endif
>
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 5d639a6..efe9780 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -24,8 +24,6 @@
>   * - CFI macros are used to generate dwarf2 unwind information for better
>   * backtraces. They don't change any code.
>   * - ENTRY/END Define functions in the symbol table.
> - * - FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK - Fix up the hardware stack
> - * frame that is otherwise undefined after a SYSCALL
>   * - TRACE_IRQ_* - Trace hard interrupt state for lock debugging.
>   * - idtentry - Define exception entry points.
>   */
> @@ -121,29 +119,13 @@ ENDPROC(native_usergs_sysret64)
>  #endif
>
>  /*
> - * C code is not supposed to know about undefined top of stack. Every time
> - * a C function with an pt_regs argument is called from the SYSCALL based
> - * fast path FIXUP_TOP_OF_STACK is needed.
> - * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs
> - * manipulation.
> + * struct pt_regs is not fully populated on the fast path
> + * (rcx, r11, cs and ss are not filled in).
> + * This macro populates segment registers in iret frame.
>   */
> -
> -       /* %rsp:at FRAMEEND */
> -       .macro FIXUP_TOP_OF_STACK tmp offset=0
> -       movq PER_CPU_VAR(old_rsp),\tmp
> -       movq \tmp,RSP+\offset(%rsp)
> -       movq $__USER_DS,SS+\offset(%rsp)
> -       movq $__USER_CS,CS+\offset(%rsp)
> -       movq $-1,RCX+\offset(%rsp)
> -       movq R11+\offset(%rsp),\tmp  /* get eflags */
> -       movq \tmp,EFLAGS+\offset(%rsp)
> -       .endm
> -
> -       .macro RESTORE_TOP_OF_STACK tmp offset=0
> -       movq RSP+\offset(%rsp),\tmp
> -       movq \tmp,PER_CPU_VAR(old_rsp)
> -       movq EFLAGS+\offset(%rsp),\tmp
> -       movq \tmp,R11+\offset(%rsp)
> +       .macro STORE_IRET_FRAME_CS_SS offset=0
> +       movq    $__USER_CS,CS+\offset(%rsp)
> +       movq    $__USER_DS,SS+\offset(%rsp)
>         .endm
>
>  /*
> @@ -226,12 +208,16 @@ ENDPROC(native_usergs_sysret64)
>   * Interrupts are off on entry.
>   * Only called from user space.
>   *
> - * XXX if we had a free scratch register we could save the RSP into the stack frame
> - *      and report it properly in ps. Unfortunately we haven't.
> - *
>   * When user can change the frames always force IRET. That is because
>   * it deals with uncanonical addresses better. SYSRET has trouble
>   * with them due to bugs in both AMD and Intel CPUs.
> + *
> + * When returning through fast path, userspace sees rcx = return address,
> + * r11 = rflags. When returning through iret (e.g. if audit is active),
> + * these registers may contain garbage.
> + * For ptrace we manage to avoid that: when we hit slow path on entry,
> + * we do save rcx and r11 in pt_regs, so ptrace on exit also sees them.
> + * If slow path is entered only on exit, there will be garbage.

I don't like this.  At least the current code puts something
deterministic in there (-1) for the slow path, even though it's wrong
and makes the slow path behave visibly differently from the fast path.

Leaking uninitialized data here is extra bad, though.  Keep in mind
that, when a syscall entry is interrupted before fully setting up
pt_regs, the interrupt frame overlaps task_pt_regs, so it's possible,
depending on the stack slot ordering, for a kernel secret
(kernel_stack?) to end up somewhere in task_pt_regs.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/17] x86: entry.S optimizations
  2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
                   ` (16 preceding siblings ...)
  2014-08-08 17:44 ` [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath Denys Vlasenko
@ 2014-08-09  0:27 ` H. Peter Anvin
  17 siblings, 0 replies; 28+ messages in thread
From: H. Peter Anvin @ 2014-08-09  0:27 UTC (permalink / raw)
  To: Denys Vlasenko, linux-kernel
  Cc: Linus Torvalds, Oleg Nesterov, Andy Lutomirski,
	Frederic Weisbecker, X86 ML, Alexei Starovoitov, Will Drewry,
	Kees Cook

On 08/08/2014 10:44 AM, Denys Vlasenko wrote:
> Version 4 of the patchset.
> 
> Please consider applying at least two first patches, they are definitely safe,
> and the second one fixes a latent bug.
> 

Please note that the 3.17 merge window is currently open.  This is not
3.17 material at this point, so I'm not going to apply anything until
after Kernel Summit.

In other words, a few weeks to work out the kinks.

	-hpa



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath
  2014-08-08 22:59   ` Andy Lutomirski
@ 2014-08-10 15:00     ` Denys Vlasenko
  2014-08-10 22:42       ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-10 15:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Frederic Weisbecker, X86 ML, Alexei Starovoitov, Will Drewry,
	Kees Cook

On 08/09/2014 12:59 AM, Andy Lutomirski wrote:
> On Sat, Aug 9, 2014 at 2:44 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> Before this patch, rcx and r11 were saved in pt_regs->rcx
>> and pt_regs->r11. Which looks natural, but requires messy
>> shuffling to/from iret stack whenever ptrace or e.g. iopl
>> wants to modify return address or flags - because that's
>> how these registers are used by SYSCALL/SYSRET.
>>
>> This patch saves rcx and r11 in pt_regs->rip and pt_regs->flags,
>> and uses these values for SYSRET64 insn. Shuffling is eliminated.
>>
>> On slow path, the values are saved in both locations - thus, ptrace
>> can modify rcx, and on syscall exit rcx will be different from
>> return address... don't see why that can be useful but anyway,
>> it works.
>>
>> The lazy store of pt_regs->cs and pt_regs->ss is retained -
>> tests have shown that these insns do take ~2 cycles on fast path.
>>
>> FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK macros are replaced
>> by a simpler single macro, STORE_IRET_FRAME_CS_SS.
>>
>> stub_iopl is no longer needed: pt_regs->flags needs no fixing up.
>>
>> PER_CPU(old_rsp) usage is simplified - now it is used only
>> as temp storage, and userspace stack pointer is immediately stored
>> in pt_regs->sp on syscall entry, instead of being used later,
>> on syscall exit. This allows to get rid of thread_struct::usersp.
>>
>> Testing shows that syscall fast path is ~54.3 ns before
>> and after the patch (on 2.7 GHz Sandy Bridge CPU).
>>
>> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
>> CC: Linus Torvalds <torvalds@linux-foundation.org>
>> CC: Oleg Nesterov <oleg@redhat.com>
>> CC: "H. Peter Anvin" <hpa@zytor.com>
>> CC: Andy Lutomirski <luto@amacapital.net>
>> CC: Frederic Weisbecker <fweisbec@gmail.com>
>> CC: X86 ML <x86@kernel.org>
>> CC: Alexei Starovoitov <ast@plumgrid.com>
>> CC: Will Drewry <wad@chromium.org>
>> CC: Kees Cook <keescook@chromium.org>
>> CC: linux-kernel@vger.kernel.org
>> ---
>>  arch/x86/include/asm/calling.h   |  18 +++--
>>  arch/x86/include/asm/compat.h    |   2 +-
>>  arch/x86/include/asm/processor.h |   1 -
>>  arch/x86/include/asm/ptrace.h    |   8 +--
>>  arch/x86/kernel/entry_64.S       | 140 +++++++++++++++++----------------------
>>  arch/x86/kernel/process_64.c     |   8 +--
>>  arch/x86/syscalls/syscall_64.tbl |   2 +-
>>  arch/x86/um/sys_call_table_64.c  |   2 +-
>>  8 files changed, 78 insertions(+), 103 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
>> index aa9113e..7afbcea 100644
>> --- a/arch/x86/include/asm/calling.h
>> +++ b/arch/x86/include/asm/calling.h
>> @@ -95,7 +95,7 @@ For 32-bit we have the following conventions - kernel is built with
>>         CFI_ADJUST_CFA_OFFSET 15*8+\addskip
>>         .endm
>>
>> -       .macro SAVE_C_REGS_HELPER offset=0 rcx=1 r8plus=1
>> +       .macro SAVE_C_REGS_HELPER offset=0 rcx=1 r8910=1 r11=1
>>         movq_cfi rdi, 14*8+\offset
>>         movq_cfi rsi, 13*8+\offset
>>         movq_cfi rdx, 12*8+\offset
>> @@ -103,21 +103,26 @@ For 32-bit we have the following conventions - kernel is built with
>>         movq_cfi rcx, 11*8+\offset
>>         .endif
>>         movq_cfi rax, 10*8+\offset
>> -       .if \r8plus
>> +       .if \r8910
>>         movq_cfi r8,  9*8+\offset
>>         movq_cfi r9,  8*8+\offset
>>         movq_cfi r10, 7*8+\offset
>> +       .endif
>> +       .if \r11
>>         movq_cfi r11, 6*8+\offset
>>         .endif
>>         .endm
>>         .macro SAVE_C_REGS offset=0
>> -       SAVE_C_REGS_HELPER \offset, 1, 1
>> +       SAVE_C_REGS_HELPER \offset, 1, 1, 1
>>         .endm
>>         .macro SAVE_C_REGS_EXCEPT_R891011
>> -       SAVE_C_REGS_HELPER 0, 1, 0
>> +       SAVE_C_REGS_HELPER 0, 1, 0, 0
>>         .endm
>>         .macro SAVE_C_REGS_EXCEPT_RCX_R891011
>> -       SAVE_C_REGS_HELPER 0, 0, 0
>> +       SAVE_C_REGS_HELPER 0, 0, 0, 0
>> +       .endm
>> +       .macro SAVE_C_REGS_EXCEPT_RCX_R11
>> +       SAVE_C_REGS_HELPER 0, 0, 1, 0
>>         .endm
>>
>>         .macro SAVE_EXTRA_REGS offset=0
>> @@ -171,6 +176,9 @@ For 32-bit we have the following conventions - kernel is built with
>>         .macro RESTORE_C_REGS_EXCEPT_RCX
>>         RESTORE_C_REGS_HELPER 1,0,1,1,1
>>         .endm
>> +       .macro RESTORE_C_REGS_EXCEPT_RCX_R11
>> +       RESTORE_C_REGS_HELPER 1,0,0,1,1
>> +       .endm
>>         .macro RESTORE_RSI_RDI
>>         RESTORE_C_REGS_HELPER 0,0,0,0,0
>>         .endm
>> diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h
>> index 59c6c40..acdee09 100644
>> --- a/arch/x86/include/asm/compat.h
>> +++ b/arch/x86/include/asm/compat.h
>> @@ -301,7 +301,7 @@ static inline void __user *arch_compat_alloc_user_space(long len)
>>                 sp = task_pt_regs(current)->sp;
>>         } else {
>>                 /* -128 for the x32 ABI redzone */
>> -               sp = this_cpu_read(old_rsp) - 128;
>> +               sp = task_pt_regs(current)->sp - 128;
>>         }
>>
>>         return (void __user *)round_down(sp - len, 16);
>> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>> index a4ea023..33c86c6 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -474,7 +474,6 @@ struct thread_struct {
>>  #ifdef CONFIG_X86_32
>>         unsigned long           sysenter_cs;
>>  #else
>> -       unsigned long           usersp; /* Copy from PDA */
>>         unsigned short          es;
>>         unsigned short          ds;
>>         unsigned short          fsindex;
>> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
>> index c822b35..271c779 100644
>> --- a/arch/x86/include/asm/ptrace.h
>> +++ b/arch/x86/include/asm/ptrace.h
>> @@ -140,12 +140,8 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
>>  #endif
>>  }
>>
>> -#define current_user_stack_pointer()   this_cpu_read(old_rsp)
>> -/* ia32 vs. x32 difference */
>> -#define compat_user_stack_pointer()    \
>> -       (test_thread_flag(TIF_IA32)     \
>> -        ? current_pt_regs()->sp        \
>> -        : this_cpu_read(old_rsp))
>> +#define current_user_stack_pointer()   current_pt_regs()->sp
>> +#define compat_user_stack_pointer()    current_pt_regs()->sp
>>  #endif
>>
>>  #ifdef CONFIG_X86_32
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index 5d639a6..efe9780 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -24,8 +24,6 @@
>>   * - CFI macros are used to generate dwarf2 unwind information for better
>>   * backtraces. They don't change any code.
>>   * - ENTRY/END Define functions in the symbol table.
>> - * - FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK - Fix up the hardware stack
>> - * frame that is otherwise undefined after a SYSCALL
>>   * - TRACE_IRQ_* - Trace hard interrupt state for lock debugging.
>>   * - idtentry - Define exception entry points.
>>   */
>> @@ -121,29 +119,13 @@ ENDPROC(native_usergs_sysret64)
>>  #endif
>>
>>  /*
>> - * C code is not supposed to know about undefined top of stack. Every time
>> - * a C function with an pt_regs argument is called from the SYSCALL based
>> - * fast path FIXUP_TOP_OF_STACK is needed.
>> - * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs
>> - * manipulation.
>> + * struct pt_regs is not fully populated on the fast path
>> + * (rcx, r11, cs and ss are not filled in).
>> + * This macro populates segment registers in iret frame.
>>   */
>> -
>> -       /* %rsp:at FRAMEEND */
>> -       .macro FIXUP_TOP_OF_STACK tmp offset=0
>> -       movq PER_CPU_VAR(old_rsp),\tmp
>> -       movq \tmp,RSP+\offset(%rsp)
>> -       movq $__USER_DS,SS+\offset(%rsp)
>> -       movq $__USER_CS,CS+\offset(%rsp)
>> -       movq $-1,RCX+\offset(%rsp)
>> -       movq R11+\offset(%rsp),\tmp  /* get eflags */
>> -       movq \tmp,EFLAGS+\offset(%rsp)
>> -       .endm
>> -
>> -       .macro RESTORE_TOP_OF_STACK tmp offset=0
>> -       movq RSP+\offset(%rsp),\tmp
>> -       movq \tmp,PER_CPU_VAR(old_rsp)
>> -       movq EFLAGS+\offset(%rsp),\tmp
>> -       movq \tmp,R11+\offset(%rsp)
>> +       .macro STORE_IRET_FRAME_CS_SS offset=0
>> +       movq    $__USER_CS,CS+\offset(%rsp)
>> +       movq    $__USER_DS,SS+\offset(%rsp)
>>         .endm
>>
>>  /*
>> @@ -226,12 +208,16 @@ ENDPROC(native_usergs_sysret64)
>>   * Interrupts are off on entry.
>>   * Only called from user space.
>>   *
>> - * XXX if we had a free scratch register we could save the RSP into the stack frame
>> - *      and report it properly in ps. Unfortunately we haven't.
>> - *
>>   * When user can change the frames always force IRET. That is because
>>   * it deals with uncanonical addresses better. SYSRET has trouble
>>   * with them due to bugs in both AMD and Intel CPUs.
>> + *
>> + * When returning through fast path, userspace sees rcx = return address,
>> + * r11 = rflags. When returning through iret (e.g. if audit is active),
>> + * these registers may contain garbage.
>> + * For ptrace we manage to avoid that: when we hit slow path on entry,
>> + * we do save rcx and r11 in pt_regs, so ptrace on exit also sees them.
>> + * If slow path is entered only on exit, there will be garbage.
> 
> I don't like this.  At least the current code puts something
> deterministic in there (-1) for the slow path, even though it's wrong
> and makes the slow path behave visibly differently from the fast path.
> 
> Leaking uninitialized data here is extra bad, though.  Keep in mind
> that, when a syscall entry is interrupted before fully setting up
> pt_regs, the interrupt frame overlaps task_pt_regs, so it's possible,
> depending on the stack slot ordering, for a kernel secret
> (kernel_stack?) to end up somewhere in task_pt_regs.

It's easy to fix. We jump off fast path to slow path here:

        movl TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS),%edx
        andl %edi,%edx
        jnz  sysret_careful

This is the only use of "sysret_careful" label.
Therefore, there we don't need to think about any other scenarios
besides "we are returning from syscall here".

My patch only fixes up segment regs:

 	/* Handle reschedules */
 	/* edx:	work, edi: workmask */
 sysret_careful:
+	STORE_IRET_FRAME_CS_SS
 	bt $TIF_NEED_RESCHED,%edx

pt_regs->rcx and ->r11 aren't initialized.

We can fix that e.g. by adding here four more insns:

        movq    RIP(%rsp),%rcx
        movq    %rcx,RCX(%rsp)
        movq    EFLAGS(%rsp),%rcx
        movq    %rcx,R11(%rsp)

similar to what my patch did on slow path on entry (see changes at
"tracesys" label).

Or we can just bite the bullet and save rcx and r11 on fast path
(there it will require just two insns). +2 cycles on most CPUs.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 17/17 v2] x86: simplify iret stack handling on SYSCALL64 fastpath
  2014-08-08 17:44 ` [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath Denys Vlasenko
  2014-08-08 22:59   ` Andy Lutomirski
@ 2014-08-10 18:47   ` Denys Vlasenko
  1 sibling, 0 replies; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-10 18:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denys Vlasenko, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Andy Lutomirski, Frederic Weisbecker, X86 ML, Alexei Starovoitov,
	Will Drewry, Kees Cook

Before this patch, rcx and r11 were saved in pt_regs->rcx
and pt_regs->r11. Which looks natural, but requires messy
shuffling to/from iret frame whenever ptrace or e.g. sys_iopl
wants to modify return address or flags - because that's
how these registers are used by SYSCALL/SYSRET.

This patch saves rcx and r11 in pt_regs->rip and pt_regs->flags,
and uses these values for SYSRET64 insn. Shuffling is eliminated.

On slow path, the values are saved in both locations - thus, ptrace
can modify rcx, and on syscall exit rcx will be different from
return address... don't see why that can be useful but anyway,
it works.

The lazy store of pt_regs->cs and pt_regs->ss is retained -
tests have shown that these insns do take ~2 cycles on fast path.

FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK macros are replaced
by a simpler single macro, STORE_IRET_FRAME_CS_SS.

stub_iopl is no longer needed: pt_regs->flags needs no fixing up.

PER_CPU(old_rsp) usage is simplified - now it is used only
as temp storage, and userspace stack pointer is immediately stored
in pt_regs->sp on syscall entry, instead of being used later,
on syscall exit. This allows to get rid of thread_struct::usersp.

Testing shows that syscall fast path is ~54.3 ns before
and after the patch (on 2.7 GHz Sandy Bridge CPU).

Changes since v1:
= added code to fill pt_regs->rcx and ->r11 on syscall exit slowpath.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: X86 ML <x86@kernel.org>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/calling.h   |  18 +++--
 arch/x86/include/asm/compat.h    |   2 +-
 arch/x86/include/asm/processor.h |   1 -
 arch/x86/include/asm/ptrace.h    |   8 +--
 arch/x86/kernel/entry_64.S       | 139 ++++++++++++++++-----------------------
 arch/x86/kernel/process_64.c     |   8 +--
 arch/x86/syscalls/syscall_64.tbl |   2 +-
 arch/x86/um/sys_call_table_64.c  |   2 +-
 8 files changed, 76 insertions(+), 104 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index aa9113e..7afbcea 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -95,7 +95,7 @@ For 32-bit we have the following conventions - kernel is built with
 	CFI_ADJUST_CFA_OFFSET 15*8+\addskip
 	.endm
 
-	.macro SAVE_C_REGS_HELPER offset=0 rcx=1 r8plus=1
+	.macro SAVE_C_REGS_HELPER offset=0 rcx=1 r8910=1 r11=1
 	movq_cfi rdi, 14*8+\offset
 	movq_cfi rsi, 13*8+\offset
 	movq_cfi rdx, 12*8+\offset
@@ -103,21 +103,26 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rcx, 11*8+\offset
 	.endif
 	movq_cfi rax, 10*8+\offset
-	.if \r8plus
+	.if \r8910
 	movq_cfi r8,  9*8+\offset
 	movq_cfi r9,  8*8+\offset
 	movq_cfi r10, 7*8+\offset
+	.endif
+	.if \r11
 	movq_cfi r11, 6*8+\offset
 	.endif
 	.endm
 	.macro SAVE_C_REGS offset=0
-	SAVE_C_REGS_HELPER \offset, 1, 1
+	SAVE_C_REGS_HELPER \offset, 1, 1, 1
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_R891011
-	SAVE_C_REGS_HELPER 0, 1, 0
+	SAVE_C_REGS_HELPER 0, 1, 0, 0
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_RCX_R891011
-	SAVE_C_REGS_HELPER 0, 0, 0
+	SAVE_C_REGS_HELPER 0, 0, 0, 0
+	.endm
+	.macro SAVE_C_REGS_EXCEPT_RCX_R11
+	SAVE_C_REGS_HELPER 0, 0, 1, 0
 	.endm
 
 	.macro SAVE_EXTRA_REGS offset=0
@@ -171,6 +176,9 @@ For 32-bit we have the following conventions - kernel is built with
 	.macro RESTORE_C_REGS_EXCEPT_RCX
 	RESTORE_C_REGS_HELPER 1,0,1,1,1
 	.endm
+	.macro RESTORE_C_REGS_EXCEPT_RCX_R11
+	RESTORE_C_REGS_HELPER 1,0,0,1,1
+	.endm
 	.macro RESTORE_RSI_RDI
 	RESTORE_C_REGS_HELPER 0,0,0,0,0
 	.endm
diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h
index 59c6c40..acdee09 100644
--- a/arch/x86/include/asm/compat.h
+++ b/arch/x86/include/asm/compat.h
@@ -301,7 +301,7 @@ static inline void __user *arch_compat_alloc_user_space(long len)
 		sp = task_pt_regs(current)->sp;
 	} else {
 		/* -128 for the x32 ABI redzone */
-		sp = this_cpu_read(old_rsp) - 128;
+		sp = task_pt_regs(current)->sp - 128;
 	}
 
 	return (void __user *)round_down(sp - len, 16);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a4ea023..33c86c6 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -474,7 +474,6 @@ struct thread_struct {
 #ifdef CONFIG_X86_32
 	unsigned long		sysenter_cs;
 #else
-	unsigned long		usersp;	/* Copy from PDA */
 	unsigned short		es;
 	unsigned short		ds;
 	unsigned short		fsindex;
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index c822b35..271c779 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -140,12 +140,8 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
 #endif
 }
 
-#define current_user_stack_pointer()	this_cpu_read(old_rsp)
-/* ia32 vs. x32 difference */
-#define compat_user_stack_pointer()	\
-	(test_thread_flag(TIF_IA32) 	\
-	 ? current_pt_regs()->sp 	\
-	 : this_cpu_read(old_rsp))
+#define current_user_stack_pointer()	current_pt_regs()->sp
+#define compat_user_stack_pointer()	current_pt_regs()->sp
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 5d639a6..45e0797 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -24,8 +24,6 @@
  * - CFI macros are used to generate dwarf2 unwind information for better
  * backtraces. They don't change any code.
  * - ENTRY/END Define functions in the symbol table.
- * - FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK - Fix up the hardware stack
- * frame that is otherwise undefined after a SYSCALL
  * - TRACE_IRQ_* - Trace hard interrupt state for lock debugging.
  * - idtentry - Define exception entry points.
  */
@@ -121,29 +119,13 @@ ENDPROC(native_usergs_sysret64)
 #endif
 
 /*
- * C code is not supposed to know about undefined top of stack. Every time
- * a C function with an pt_regs argument is called from the SYSCALL based
- * fast path FIXUP_TOP_OF_STACK is needed.
- * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs
- * manipulation.
+ * struct pt_regs is not fully populated on the fast path
+ * (rcx, r11, cs and ss are not filled in).
+ * This macro populates segment registers in iret frame.
  */
-
-	/* %rsp:at FRAMEEND */
-	.macro FIXUP_TOP_OF_STACK tmp offset=0
-	movq PER_CPU_VAR(old_rsp),\tmp
-	movq \tmp,RSP+\offset(%rsp)
-	movq $__USER_DS,SS+\offset(%rsp)
-	movq $__USER_CS,CS+\offset(%rsp)
-	movq $-1,RCX+\offset(%rsp)
-	movq R11+\offset(%rsp),\tmp  /* get eflags */
-	movq \tmp,EFLAGS+\offset(%rsp)
-	.endm
-
-	.macro RESTORE_TOP_OF_STACK tmp offset=0
-	movq RSP+\offset(%rsp),\tmp
-	movq \tmp,PER_CPU_VAR(old_rsp)
-	movq EFLAGS+\offset(%rsp),\tmp
-	movq \tmp,R11+\offset(%rsp)
+	.macro STORE_IRET_FRAME_CS_SS offset=0
+	movq	$__USER_CS,CS+\offset(%rsp)
+	movq	$__USER_DS,SS+\offset(%rsp)
 	.endm
 
 /*
@@ -226,9 +208,6 @@ ENDPROC(native_usergs_sysret64)
  * Interrupts are off on entry.
  * Only called from user space.
  *
- * XXX	if we had a free scratch register we could save the RSP into the stack frame
- *      and report it properly in ps. Unfortunately we haven't.
- *
  * When user can change the frames always force IRET. That is because
  * it deals with uncanonical addresses better. SYSRET has trouble
  * with them due to bugs in both AMD and Intel CPUs.
@@ -255,11 +234,18 @@ GLOBAL(system_call_after_swapgs)
 	 * and short:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	ALLOC_PTREGS_ON_STACK 6*8	/* space for orig_ax and iret frame */
-	SAVE_C_REGS
-	movq  %rax,ORIG_RAX(%rsp)
-	movq  %rcx,RIP(%rsp)
+	ALLOC_PTREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
+	SAVE_C_REGS_EXCEPT_RCX_R11
+	/* Fill orig_ax and, partially, iret frame (rip,flags,rsp) */
+	movq	%rax,ORIG_RAX(%rsp)
+	movq	%rcx,RIP(%rsp)
 	CFI_REL_OFFSET rip,RIP
+	movq	PER_CPU_VAR(old_rsp),%rcx
+	/*movq	$__USER_CS,CS(%rsp) - moved out of fast path */
+	movq	%r11,EFLAGS(%rsp)
+	movq	%rcx,RSP(%rsp)
+	/*movq	$__USER_DS,SS(%rsp) - moved out of fast path */
+
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz tracesys
 system_call_fastpath:
@@ -276,7 +262,7 @@ from_badsys:
 	movq %rax,RAX(%rsp)
 /*
  * Syscall return path ending with SYSRET (fast path)
- * Has incomplete stack frame and undefined top of stack.
+ * Has incompletely filled pt_regs and iret frame's ss and cs.
  */
 ret_from_sys_call:
 	movl $_TIF_ALLWORK_MASK,%edi
@@ -293,11 +279,12 @@ sysret_check:
 	 * sysretq will re-enable interrupts:
 	 */
 	TRACE_IRQS_ON
-	RESTORE_C_REGS_EXCEPT_RCX
-	movq RIP(%rsp),%rcx
+	RESTORE_C_REGS_EXCEPT_RCX_R11
+	movq	RIP(%rsp),%rcx
 	CFI_REGISTER	rip,rcx
+	movq	EFLAGS(%rsp),%r11
 	/*CFI_REGISTER	rflags,r11*/
-	movq	PER_CPU_VAR(old_rsp), %rsp
+	movq	RSP(%rsp),%rsp
 	/*
 	 * 64bit SYSRET restores rip from rcx,
 	 * rflags from r11 (but RF and VM bits are forced to 0),
@@ -309,6 +296,11 @@ sysret_check:
 	/* Handle reschedules */
 	/* edx:	work, edi: workmask */
 sysret_careful:
+	movq	RIP(%rsp),%rcx
+	movq	EFLAGS(%rsp),%r11
+	STORE_IRET_FRAME_CS_SS /* fill the rest of iret frame */
+	movq	%rcx,RCX(%rsp) /* fill the rest of pt_regs */
+	movq	%r11,R11(%rsp) /* ditto */
 	bt $TIF_NEED_RESCHED,%edx
 	jnc sysret_signal
 	TRACE_IRQS_ON
@@ -331,7 +323,6 @@ sysret_signal:
 	 * These all wind up with the iret return path anyway,
 	 * so just join that path right now.
 	 */
-	FIXUP_TOP_OF_STACK %r11
 	jmp int_check_syscall_exit_work
 
 badsys:
@@ -370,15 +361,18 @@ sysret_audit:
 	jmp sysret_check
 #endif	/* CONFIG_AUDITSYSCALL */
 
-	/* Do syscall tracing */
+	/* Do syscall entry tracing */
 tracesys:
+	movq	RIP(%rsp),%rcx
+	STORE_IRET_FRAME_CS_SS /* fill the rest of iret frame */
+	movq	%r11,R11(%rsp) /* fill the rest of pt_regs */
+	movq	%rcx,RCX(%rsp) /* ditto */
 #ifdef CONFIG_AUDITSYSCALL
 	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jz auditsys
 #endif
 	SAVE_EXTRA_REGS
 	movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
-	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp,%rdi
 	call syscall_trace_enter
 	/*
@@ -402,7 +396,7 @@ tracesys:
 
 /*
  * Syscall return path ending with IRET.
- * Has correct top of stack, but partial stack frame.
+ * Has correct top of stack, but possibly only partially filled pt_regs.
  */
 GLOBAL(int_ret_from_sys_call)
 	DISABLE_INTERRUPTS(CLBR_NONE)
@@ -469,43 +463,15 @@ ENTRY(stub_\func)
 	CFI_STARTPROC
 	DEFAULT_FRAME 0, 8		/* offset 8: return address */
 	SAVE_EXTRA_REGS 8
-	FIXUP_TOP_OF_STACK %r11, 8
-	call sys_\func
-	RESTORE_TOP_OF_STACK %r11, 8
-	ret
+	STORE_IRET_FRAME_CS_SS 8	/* not sure we need this. paranoia */
+	jmp sys_\func
 	CFI_ENDPROC
 END(stub_\func)
 	.endm
 
-	.macro FIXED_FRAME label,func
-ENTRY(\label)
-	CFI_STARTPROC
-	DEFAULT_FRAME 0, 8		/* offset 8: return address */
-	FIXUP_TOP_OF_STACK %r11, 8
-	call \func
-	RESTORE_TOP_OF_STACK %r11, 8
-	ret
-	CFI_ENDPROC
-END(\label)
-	.endm
-
 	FORK_LIKE  clone
 	FORK_LIKE  fork
 	FORK_LIKE  vfork
-	FIXED_FRAME stub_iopl, sys_iopl
-
-ENTRY(stub_execve)
-	CFI_STARTPROC
-	addq $8, %rsp
-	DEFAULT_FRAME 0
-	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
-	call sys_execve
-	movq %rax,RAX(%rsp)
-	RESTORE_EXTRA_REGS
-	jmp int_ret_from_sys_call
-	CFI_ENDPROC
-END(stub_execve)
 
 /*
  * sigreturn is special because it needs to restore all registers on return.
@@ -516,25 +482,35 @@ ENTRY(stub_rt_sigreturn)
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
+	STORE_IRET_FRAME_CS_SS
 	call sys_rt_sigreturn
-	movq %rax,RAX(%rsp) # fixme, this could be done at the higher layer
+stub_return:
+	movq %rax,RAX(%rsp)
 	RESTORE_EXTRA_REGS
 	jmp int_ret_from_sys_call
 	CFI_ENDPROC
 END(stub_rt_sigreturn)
 
+ENTRY(stub_execve)
+	CFI_STARTPROC
+	addq $8, %rsp
+	DEFAULT_FRAME 0
+	SAVE_EXTRA_REGS
+	STORE_IRET_FRAME_CS_SS
+	call sys_execve
+	jmp stub_return
+	CFI_ENDPROC
+END(stub_execve)
+
 #ifdef CONFIG_X86_X32_ABI
 ENTRY(stub_x32_rt_sigreturn)
 	CFI_STARTPROC
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
+	STORE_IRET_FRAME_CS_SS
 	call sys32_x32_rt_sigreturn
-	movq %rax,RAX(%rsp) # fixme, this could be done at the higher layer
-	RESTORE_EXTRA_REGS
-	jmp int_ret_from_sys_call
+	jmp stub_return
 	CFI_ENDPROC
 END(stub_x32_rt_sigreturn)
 
@@ -543,15 +519,11 @@ ENTRY(stub_x32_execve)
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
+	STORE_IRET_FRAME_CS_SS
 	call compat_sys_execve
-	RESTORE_TOP_OF_STACK %r11
-	movq %rax,RAX(%rsp)
-	RESTORE_EXTRA_REGS
-	jmp int_ret_from_sys_call
+	jmp stub_return
 	CFI_ENDPROC
 END(stub_x32_execve)
-
 #endif
 
 /*
@@ -576,10 +548,13 @@ ENTRY(ret_from_fork)
 	testl $3, CS(%rsp)			# from kernel_thread?
 	jz   1f
 
+	/*
+	 * TODO: can we instead check CS(%rsp) == __USER32_CS below?
+	 * We won't need GET_THREAD_INFO(%rcx) then...
+	 */
 	testl $_TIF_IA32, TI_flags(%rcx)	# 32-bit compat task needs IRET
 	jnz  int_ret_from_sys_call
 
-	RESTORE_TOP_OF_STACK %rdi
 	jmp ret_from_sys_call			# go to the SYSRET fastpath
 
 1:
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 2105487e..5a7ba74 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -161,7 +161,6 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
 	p->thread.sp0 = (unsigned long)task_stack_page(p) + THREAD_SIZE;
 	childregs = task_pt_regs(p);
 	p->thread.sp = (unsigned long) childregs;
-	p->thread.usersp = me->thread.usersp;
 	set_tsk_thread_flag(p, TIF_FORK);
 	p->thread.fpu_counter = 0;
 	p->thread.io_bitmap_ptr = NULL;
@@ -238,10 +237,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 	loadsegment(es, _ds);
 	loadsegment(ds, _ds);
 	load_gs_index(0);
-	current->thread.usersp	= new_sp;
 	regs->ip		= new_ip;
 	regs->sp		= new_sp;
-	this_cpu_write(old_rsp, new_sp);
 	regs->cs		= _cs;
 	regs->ss		= _ss;
 	regs->flags		= X86_EFLAGS_IF;
@@ -359,8 +356,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	/*
 	 * Switch the PDA and FPU contexts.
 	 */
-	prev->usersp = this_cpu_read(old_rsp);
-	this_cpu_write(old_rsp, next->usersp);
 	this_cpu_write(current_task, next_p);
 
 	/*
@@ -559,6 +554,5 @@ long sys_arch_prctl(int code, unsigned long addr)
 
 unsigned long KSTK_ESP(struct task_struct *task)
 {
-	return (test_tsk_thread_flag(task, TIF_IA32)) ?
-			(task_pt_regs(task)->sp) : ((task)->thread.usersp);
+	return task_pt_regs(task)->sp;
 }
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1..8b745ae 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -178,7 +178,7 @@
 169	common	reboot			sys_reboot
 170	common	sethostname		sys_sethostname
 171	common	setdomainname		sys_setdomainname
-172	common	iopl			stub_iopl
+172	common	iopl			sys_iopl
 173	common	ioperm			sys_ioperm
 174	64	create_module
 175	common	init_module		sys_init_module
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index f2f0723..5c81ed2 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -16,7 +16,7 @@
  */
 
 /* Not going to be implemented by UML, since we have no hardware. */
-#define stub_iopl sys_ni_syscall
+#define sys_iopl sys_ni_syscall
 #define sys_ioperm sys_ni_syscall
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath
  2014-08-10 15:00     ` Denys Vlasenko
@ 2014-08-10 22:42       ` Andy Lutomirski
  2014-08-11 12:24         ` Denys Vlasenko
  0 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2014-08-10 22:42 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Frederic Weisbecker, X86 ML, Alexei Starovoitov, Will Drewry,
	Kees Cook

On Mon, Aug 11, 2014 at 12:00 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 08/09/2014 12:59 AM, Andy Lutomirski wrote:
>>> + * When returning through fast path, userspace sees rcx = return address,
>>> + * r11 = rflags. When returning through iret (e.g. if audit is active),
>>> + * these registers may contain garbage.
>>> + * For ptrace we manage to avoid that: when we hit slow path on entry,
>>> + * we do save rcx and r11 in pt_regs, so ptrace on exit also sees them.
>>> + * If slow path is entered only on exit, there will be garbage.
>>
>> I don't like this.  At least the current code puts something
>> deterministic in there (-1) for the slow path, even though it's wrong
>> and makes the slow path behave visibly differently from the fast path.
>>
>> Leaking uninitialized data here is extra bad, though.  Keep in mind
>> that, when a syscall entry is interrupted before fully setting up
>> pt_regs, the interrupt frame overlaps task_pt_regs, so it's possible,
>> depending on the stack slot ordering, for a kernel secret
>> (kernel_stack?) to end up somewhere in task_pt_regs.
>
> It's easy to fix. We jump off fast path to slow path here:
>
>         movl TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS),%edx
>         andl %edi,%edx
>         jnz  sysret_careful
>
> This is the only use of "sysret_careful" label.
> Therefore, there we don't need to think about any other scenarios
> besides "we are returning from syscall here".

I may be missing something here (on vacation, not really testing
things, no big monitor, etc), but how is this compatible with things
like rt_sigreturn?  rt_sigreturn is call from the fastpath, via a
stub, and it returns through int_ret_from_syscall.  The C part needs
to modify all the regs, and those regs need to stick, so fixing up rcx
and r11 after rt_sigreturn can't work.

On the flip side, fork probably needs to *read* those regs, so the
fixup needs to happen before fork.

I'm sure it's possible to get this right, but it's complicated and
error-prone, and I'm not at all convinced it's worth it to save two
stores on the fast path.

(I'm pretty sure that current kernels have all kinds of bugs in this area.)

>
> My patch only fixes up segment regs:
>
>         /* Handle reschedules */
>         /* edx: work, edi: workmask */
>  sysret_careful:
> +       STORE_IRET_FRAME_CS_SS
>         bt $TIF_NEED_RESCHED,%edx
>
> pt_regs->rcx and ->r11 aren't initialized.
>
> We can fix that e.g. by adding here four more insns:
>
>         movq    RIP(%rsp),%rcx
>         movq    %rcx,RCX(%rsp)
>         movq    EFLAGS(%rsp),%rcx
>         movq    %rcx,R11(%rsp)
>
> similar to what my patch did on slow path on entry (see changes at
> "tracesys" label).
>
> Or we can just bite the bullet and save rcx and r11 on fast path
> (there it will require just two insns). +2 cycles on most CPUs.
>


--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath
  2014-08-10 22:42       ` Andy Lutomirski
@ 2014-08-11 12:24         ` Denys Vlasenko
  2014-08-11 20:06           ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-11 12:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Frederic Weisbecker, X86 ML, Alexei Starovoitov, Will Drewry,
	Kees Cook

On 08/11/2014 12:42 AM, Andy Lutomirski wrote:
> On Mon, Aug 11, 2014 at 12:00 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> On 08/09/2014 12:59 AM, Andy Lutomirski wrote:
>>>> + * When returning through fast path, userspace sees rcx = return address,
>>>> + * r11 = rflags. When returning through iret (e.g. if audit is active),
>>>> + * these registers may contain garbage.
>>>> + * For ptrace we manage to avoid that: when we hit slow path on entry,
>>>> + * we do save rcx and r11 in pt_regs, so ptrace on exit also sees them.
>>>> + * If slow path is entered only on exit, there will be garbage.
>>>
>>> I don't like this.  At least the current code puts something
>>> deterministic in there (-1) for the slow path, even though it's wrong
>>> and makes the slow path behave visibly differently from the fast path.
>>>
>>> Leaking uninitialized data here is extra bad, though.  Keep in mind
>>> that, when a syscall entry is interrupted before fully setting up
>>> pt_regs, the interrupt frame overlaps task_pt_regs, so it's possible,
>>> depending on the stack slot ordering, for a kernel secret
>>> (kernel_stack?) to end up somewhere in task_pt_regs.
>>
>> It's easy to fix. We jump off fast path to slow path here:
>>
>>         movl TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS),%edx
>>         andl %edi,%edx
>>         jnz  sysret_careful
>>
>> This is the only use of "sysret_careful" label.
>> Therefore, there we don't need to think about any other scenarios
>> besides "we are returning from syscall here".
> 
> I may be missing something here (on vacation, not really testing
> things, no big monitor, etc), but how is this compatible with things
> like rt_sigreturn?  rt_sigreturn is call from the fastpath, via a
> stub, and it returns through int_ret_from_syscall.  The C part needs
> to modify all the regs, and those regs need to stick, so fixing up rcx
> and r11 after rt_sigreturn can't work.

Code at "sysret_careful" label is only reachable
on fast path return.
We don't go down this code path after rt_sigreturn.
stub_rt_sigreturn manually steers into iret code path instead:

ENTRY(stub_rt_sigreturn)
        CFI_STARTPROC
        addq $8, %rsp
        DEFAULT_FRAME 0
        SAVE_EXTRA_REGS
        STORE_IRET_FRAME_CS_SS
        call sys_rt_sigreturn
        movq %rax,RAX(%rsp)
        RESTORE_EXTRA_REGS
        jmp int_ret_from_sys_call   <==== NOTE THIS

So, we don't do any rcx/r11 fixups after sys_rt_sigreturn.

BTW, looks like "SAVE_EXTRA_REGS" here is not necessary:
sys_rt_sigreturn overwrites all registers anyway.
I have its removal on my TODO list.
(It was there before my patch set because stack needed
*resizing to full pt_regs size*; now it does not need that).






> 
> On the flip side, fork probably needs to *read* those regs, so the
> fixup needs to happen before fork.
> 
> I'm sure it's possible to get this right, but it's complicated and
> error-prone, and I'm not at all convinced it's worth it to save two
> stores on the fast path.
> 
> (I'm pretty sure that current kernels have all kinds of bugs in this area.)
> 
>>
>> My patch only fixes up segment regs:
>>
>>         /* Handle reschedules */
>>         /* edx: work, edi: workmask */
>>  sysret_careful:
>> +       STORE_IRET_FRAME_CS_SS
>>         bt $TIF_NEED_RESCHED,%edx
>>
>> pt_regs->rcx and ->r11 aren't initialized.
>>
>> We can fix that e.g. by adding here four more insns:
>>
>>         movq    RIP(%rsp),%rcx
>>         movq    %rcx,RCX(%rsp)
>>         movq    EFLAGS(%rsp),%rcx
>>         movq    %rcx,R11(%rsp)
>>
>> similar to what my patch did on slow path on entry (see changes at
>> "tracesys" label).
>>
>> Or we can just bite the bullet and save rcx and r11 on fast path
>> (there it will require just two insns). +2 cycles on most CPUs.
>>
> 
> 
> --Andy
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath
  2014-08-11 12:24         ` Denys Vlasenko
@ 2014-08-11 20:06           ` Andy Lutomirski
  2014-08-12  9:21             ` Denys Vlasenko
  0 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2014-08-11 20:06 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Frederic Weisbecker, X86 ML, Alexei Starovoitov, Will Drewry,
	Kees Cook

On Mon, Aug 11, 2014 at 5:24 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 08/11/2014 12:42 AM, Andy Lutomirski wrote:
>> On Mon, Aug 11, 2014 at 12:00 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>> On 08/09/2014 12:59 AM, Andy Lutomirski wrote:
>>>>> + * When returning through fast path, userspace sees rcx = return address,
>>>>> + * r11 = rflags. When returning through iret (e.g. if audit is active),
>>>>> + * these registers may contain garbage.
>>>>> + * For ptrace we manage to avoid that: when we hit slow path on entry,
>>>>> + * we do save rcx and r11 in pt_regs, so ptrace on exit also sees them.
>>>>> + * If slow path is entered only on exit, there will be garbage.
>>>>
>>>> I don't like this.  At least the current code puts something
>>>> deterministic in there (-1) for the slow path, even though it's wrong
>>>> and makes the slow path behave visibly differently from the fast path.
>>>>
>>>> Leaking uninitialized data here is extra bad, though.  Keep in mind
>>>> that, when a syscall entry is interrupted before fully setting up
>>>> pt_regs, the interrupt frame overlaps task_pt_regs, so it's possible,
>>>> depending on the stack slot ordering, for a kernel secret
>>>> (kernel_stack?) to end up somewhere in task_pt_regs.
>>>
>>> It's easy to fix. We jump off fast path to slow path here:
>>>
>>>         movl TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS),%edx
>>>         andl %edi,%edx
>>>         jnz  sysret_careful
>>>
>>> This is the only use of "sysret_careful" label.
>>> Therefore, there we don't need to think about any other scenarios
>>> besides "we are returning from syscall here".
>>
>> I may be missing something here (on vacation, not really testing
>> things, no big monitor, etc), but how is this compatible with things
>> like rt_sigreturn?  rt_sigreturn is call from the fastpath, via a
>> stub, and it returns through int_ret_from_syscall.  The C part needs
>> to modify all the regs, and those regs need to stick, so fixing up rcx
>> and r11 after rt_sigreturn can't work.
>
> Code at "sysret_careful" label is only reachable
> on fast path return.
> We don't go down this code path after rt_sigreturn.
> stub_rt_sigreturn manually steers into iret code path instead:
>
> ENTRY(stub_rt_sigreturn)
>         CFI_STARTPROC
>         addq $8, %rsp
>         DEFAULT_FRAME 0
>         SAVE_EXTRA_REGS
>         STORE_IRET_FRAME_CS_SS
>         call sys_rt_sigreturn
>         movq %rax,RAX(%rsp)
>         RESTORE_EXTRA_REGS
>         jmp int_ret_from_sys_call   <==== NOTE THIS
>
> So, we don't do any rcx/r11 fixups after sys_rt_sigreturn.

Oh, right.  rt_sigreturn overwrites all regs, so it doesn't need a
fixup in advance.

That still leaves fork and everything that calls ptrace_event, though.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath
  2014-08-11 20:06           ` Andy Lutomirski
@ 2014-08-12  9:21             ` Denys Vlasenko
  2014-08-13  1:02               ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Denys Vlasenko @ 2014-08-12  9:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Frederic Weisbecker, X86 ML, Alexei Starovoitov, Will Drewry,
	Kees Cook

On 08/11/2014 10:06 PM, Andy Lutomirski wrote:
> On Mon, Aug 11, 2014 at 5:24 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> On 08/11/2014 12:42 AM, Andy Lutomirski wrote:
>>> On Mon, Aug 11, 2014 at 12:00 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>> On 08/09/2014 12:59 AM, Andy Lutomirski wrote:
>>>>>> + * When returning through fast path, userspace sees rcx = return address,
>>>>>> + * r11 = rflags. When returning through iret (e.g. if audit is active),
>>>>>> + * these registers may contain garbage.
>>>>>> + * For ptrace we manage to avoid that: when we hit slow path on entry,
>>>>>> + * we do save rcx and r11 in pt_regs, so ptrace on exit also sees them.
>>>>>> + * If slow path is entered only on exit, there will be garbage.
>>>>>
>>>>> I don't like this.  At least the current code puts something
>>>>> deterministic in there (-1) for the slow path, even though it's wrong
>>>>> and makes the slow path behave visibly differently from the fast path.
>>>>>
>>>>> Leaking uninitialized data here is extra bad, though.  Keep in mind
>>>>> that, when a syscall entry is interrupted before fully setting up
>>>>> pt_regs, the interrupt frame overlaps task_pt_regs, so it's possible,
>>>>> depending on the stack slot ordering, for a kernel secret
>>>>> (kernel_stack?) to end up somewhere in task_pt_regs.
>>>>
>>>> It's easy to fix. We jump off fast path to slow path here:
>>>>
>>>>         movl TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS),%edx
>>>>         andl %edi,%edx
>>>>         jnz  sysret_careful
>>>>
>>>> This is the only use of "sysret_careful" label.
>>>> Therefore, there we don't need to think about any other scenarios
>>>> besides "we are returning from syscall here".
>>>
>>> I may be missing something here (on vacation, not really testing
>>> things, no big monitor, etc), but how is this compatible with things
>>> like rt_sigreturn?  rt_sigreturn is call from the fastpath, via a
>>> stub, and it returns through int_ret_from_syscall.  The C part needs
>>> to modify all the regs, and those regs need to stick, so fixing up rcx
>>> and r11 after rt_sigreturn can't work.
>>
>> Code at "sysret_careful" label is only reachable
>> on fast path return.
>> We don't go down this code path after rt_sigreturn.
>> stub_rt_sigreturn manually steers into iret code path instead:
>>
>> ENTRY(stub_rt_sigreturn)
>>         CFI_STARTPROC
>>         addq $8, %rsp
>>         DEFAULT_FRAME 0
>>         SAVE_EXTRA_REGS
>>         STORE_IRET_FRAME_CS_SS
>>         call sys_rt_sigreturn
>>         movq %rax,RAX(%rsp)
>>         RESTORE_EXTRA_REGS
>>         jmp int_ret_from_sys_call   <==== NOTE THIS
>>
>> So, we don't do any rcx/r11 fixups after sys_rt_sigreturn.
> 
> Oh, right.  rt_sigreturn overwrites all regs, so it doesn't need a
> fixup in advance.
> 
> That still leaves fork and everything that calls ptrace_event, though.

I think I have it covered:

[v]fork and clone have fully populated pt_regs.

Syscall entry/exit ptrace stops are on slow path and therefore
also have fully populated pt_regs.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath
  2014-08-12  9:21             ` Denys Vlasenko
@ 2014-08-13  1:02               ` Andy Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2014-08-13  1:02 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Linus Torvalds, Oleg Nesterov, H. Peter Anvin,
	Frederic Weisbecker, X86 ML, Alexei Starovoitov, Will Drewry,
	Kees Cook

On Tue, Aug 12, 2014 at 2:21 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 08/11/2014 10:06 PM, Andy Lutomirski wrote:
>> On Mon, Aug 11, 2014 at 5:24 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>> On 08/11/2014 12:42 AM, Andy Lutomirski wrote:
>>>> On Mon, Aug 11, 2014 at 12:00 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>>> On 08/09/2014 12:59 AM, Andy Lutomirski wrote:
>>>>>>> + * When returning through fast path, userspace sees rcx = return address,
>>>>>>> + * r11 = rflags. When returning through iret (e.g. if audit is active),
>>>>>>> + * these registers may contain garbage.
>>>>>>> + * For ptrace we manage to avoid that: when we hit slow path on entry,
>>>>>>> + * we do save rcx and r11 in pt_regs, so ptrace on exit also sees them.
>>>>>>> + * If slow path is entered only on exit, there will be garbage.
>>>>>>
>>>>>> I don't like this.  At least the current code puts something
>>>>>> deterministic in there (-1) for the slow path, even though it's wrong
>>>>>> and makes the slow path behave visibly differently from the fast path.
>>>>>>
>>>>>> Leaking uninitialized data here is extra bad, though.  Keep in mind
>>>>>> that, when a syscall entry is interrupted before fully setting up
>>>>>> pt_regs, the interrupt frame overlaps task_pt_regs, so it's possible,
>>>>>> depending on the stack slot ordering, for a kernel secret
>>>>>> (kernel_stack?) to end up somewhere in task_pt_regs.
>>>>>
>>>>> It's easy to fix. We jump off fast path to slow path here:
>>>>>
>>>>>         movl TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS),%edx
>>>>>         andl %edi,%edx
>>>>>         jnz  sysret_careful
>>>>>
>>>>> This is the only use of "sysret_careful" label.
>>>>> Therefore, there we don't need to think about any other scenarios
>>>>> besides "we are returning from syscall here".
>>>>
>>>> I may be missing something here (on vacation, not really testing
>>>> things, no big monitor, etc), but how is this compatible with things
>>>> like rt_sigreturn?  rt_sigreturn is call from the fastpath, via a
>>>> stub, and it returns through int_ret_from_syscall.  The C part needs
>>>> to modify all the regs, and those regs need to stick, so fixing up rcx
>>>> and r11 after rt_sigreturn can't work.
>>>
>>> Code at "sysret_careful" label is only reachable
>>> on fast path return.
>>> We don't go down this code path after rt_sigreturn.
>>> stub_rt_sigreturn manually steers into iret code path instead:
>>>
>>> ENTRY(stub_rt_sigreturn)
>>>         CFI_STARTPROC
>>>         addq $8, %rsp
>>>         DEFAULT_FRAME 0
>>>         SAVE_EXTRA_REGS
>>>         STORE_IRET_FRAME_CS_SS
>>>         call sys_rt_sigreturn
>>>         movq %rax,RAX(%rsp)
>>>         RESTORE_EXTRA_REGS
>>>         jmp int_ret_from_sys_call   <==== NOTE THIS
>>>
>>> So, we don't do any rcx/r11 fixups after sys_rt_sigreturn.
>>
>> Oh, right.  rt_sigreturn overwrites all regs, so it doesn't need a
>> fixup in advance.
>>
>> That still leaves fork and everything that calls ptrace_event, though.
>
> I think I have it covered:
>
> [v]fork and clone have fully populated pt_regs.
>
> Syscall entry/exit ptrace stops are on slow path and therefore
> also have fully populated pt_regs.

Heh.  I hope so, but CVE-2014-4699 was an exception to that rule...

Anyway, I'll see if I can beef up my test cases that are relevant to this stuff.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-08-13  1:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-08 17:44 [PATCH v4 0/17] x86: entry.S optimizations Denys Vlasenko
2014-08-08 17:44 ` [PATCH 01/17] x86: entry_64.S: delete unused code Denys Vlasenko
2014-08-08 17:44 ` [PATCH 02/17] x86: ia32entry.S: fix wrong symbolic constant usage: R11->ARGOFFSET Denys Vlasenko
2014-08-08 17:44 ` [PATCH 03/17] x86: open-code register save/restore in trace_hardirqs thunks Denys Vlasenko
2014-08-08 17:44 ` [PATCH 04/17] x86: entry_64.S: fold SAVE_ARGS_IRQ macro into its sole user Denys Vlasenko
2014-08-08 17:44 ` [PATCH 05/17] x86: entry_64.S: always allocate complete "struct pt_regs" Denys Vlasenko
2014-08-08 17:44 ` [PATCH 06/17] x86: mass removal of ARGOFFSET Denys Vlasenko
2014-08-08 17:44 ` [PATCH 07/17] x86: rename some macros and labels, no code changes Denys Vlasenko
2014-08-08 17:44 ` [PATCH 08/17] x86: add comments about various syscall instructions, " Denys Vlasenko
2014-08-08 17:44 ` [PATCH 09/17] x86: entry_64.S: move save_paranoid and ret_from_fork closer to their users Denys Vlasenko
2014-08-08 17:44 ` [PATCH 10/17] x86: entry_64.S: rename save_paranoid to paranoid_entry, no code changes Denys Vlasenko
2014-08-08 17:44 ` [PATCH 11/17] x86: entry_64.S: fold test_in_nmi macro into its only user Denys Vlasenko
2014-08-08 17:44 ` [PATCH 12/17] x86: get rid of KERNEL_STACK_OFFSET Denys Vlasenko
2014-08-08 17:44 ` [PATCH 13/17] x86: ia32entry.S: fold IA32_ARG_FIXUP macro into its callers Denys Vlasenko
2014-08-08 17:44 ` [PATCH 14/17] x86: ia32entry.S: use mov instead of push/pop where possible Denys Vlasenko
2014-08-08 17:44 ` [PATCH 15/17] x86: code shrink in paranoid_exit Denys Vlasenko
2014-08-08 17:44 ` [PATCH 16/17] x86: entry_64.S: trivial optimization for ENOSYS Denys Vlasenko
2014-08-08 22:48   ` Andy Lutomirski
2014-08-08 17:44 ` [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath Denys Vlasenko
2014-08-08 22:59   ` Andy Lutomirski
2014-08-10 15:00     ` Denys Vlasenko
2014-08-10 22:42       ` Andy Lutomirski
2014-08-11 12:24         ` Denys Vlasenko
2014-08-11 20:06           ` Andy Lutomirski
2014-08-12  9:21             ` Denys Vlasenko
2014-08-13  1:02               ` Andy Lutomirski
2014-08-10 18:47   ` [PATCH 17/17 v2] " Denys Vlasenko
2014-08-09  0:27 ` [PATCH v4 0/17] x86: entry.S optimizations H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.