linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling
@ 2015-03-09 18:39 Denys Vlasenko
  2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Denys Vlasenko @ 2015-03-09 18:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

These changes make SYSENTER64 code path save flags and user's
stack pointer in pt_regs->flags and pt_regs->sp, where they belong.

As a result, we can drop stub_iopl() and thread_struct::usersp.

Usage of PER_CPU(old_rsp) is reduced to bare minimum.

FIXUP/RESTORE_TOP_OF_STACK macros are on diet too.

Changes since v1: on Ingo's request, split into 4 patches intead of 2.

CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org

Denys Vlasenko (4):
  x86: save r11 into pt_regs->flags on SYSCALL64 fastpath
  Remove stub_iopl
  x86: save user %rsp in pt_regs->sp on SYSCALL64 fastpath
  Remove unused stuff

 arch/x86/include/asm/calling.h   | 20 +++++++++-----
 arch/x86/include/asm/compat.h    |  2 +-
 arch/x86/include/asm/processor.h |  6 -----
 arch/x86/include/asm/ptrace.h    |  8 ++----
 arch/x86/kernel/entry_64.S       | 57 ++++++++++++++--------------------------
 arch/x86/kernel/perf_regs.c      |  2 +-
 arch/x86/kernel/process_64.c     |  8 +-----
 arch/x86/syscalls/syscall_64.tbl |  2 +-
 arch/x86/um/sys_call_table_64.c  |  2 +-
 9 files changed, 40 insertions(+), 67 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath
  2015-03-09 18:39 [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Denys Vlasenko
@ 2015-03-09 18:39 ` Denys Vlasenko
  2015-03-09 20:02   ` Andy Lutomirski
  2015-03-16 12:04   ` [tip:x86/asm] x86/asm/entry/64: Save R11 into pt_regs-> flags " tip-bot for Denys Vlasenko
  2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko
  2015-03-10  6:00 ` [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Ingo Molnar
  2 siblings, 2 replies; 23+ messages in thread
From: Denys Vlasenko @ 2015-03-09 18:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

Before this patch, r11 was saved in pt_regs->r11.
Which looks natural, but requires messy shuffling to/from iret frame
whenever ptrace or e.g. sys_iopl wants to modify flags - because
that's how this register is used by SYSCALL/SYSRET.

This patch saves r11 in pt_regs->flags,
and uses that value for SYSRET64 insn. Shuffling is eliminated.

FIXUP/RESTORE_TOP_OF_STACK are simplified.

stub_iopl is no longer needed: pt_regs->flags needs no fixing up.

Testing shows that syscall fast path is ~54.3 ns before
and after the patch (on 2.7 GHz Sandy Bridge CPU).

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/calling.h | 20 ++++++++++++++------
 arch/x86/kernel/entry_64.S     | 24 +++++++++++-------------
 2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index f1a962f..4b5f7bf 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -95,9 +95,11 @@ For 32-bit we have the following conventions - kernel is built with
 	CFI_ADJUST_CFA_OFFSET 15*8+\addskip
 	.endm
 
-	.macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8plus=1
-	.if \r8plus
+	.macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1
+	.if \r11
 	movq_cfi r11, 6*8+\offset
+	.endif
+	.if \r8910
 	movq_cfi r10, 7*8+\offset
 	movq_cfi r9,  8*8+\offset
 	movq_cfi r8,  9*8+\offset
@@ -113,16 +115,19 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rdi, 14*8+\offset
 	.endm
 	.macro SAVE_C_REGS offset=0
-	SAVE_C_REGS_HELPER \offset, 1, 1, 1
+	SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_RAX_RCX offset=0
-	SAVE_C_REGS_HELPER \offset, 0, 0, 1
+	SAVE_C_REGS_HELPER \offset, 0, 0, 1, 1
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_R891011
-	SAVE_C_REGS_HELPER 0, 1, 1, 0
+	SAVE_C_REGS_HELPER 0, 1, 1, 0, 0
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_RCX_R891011
-	SAVE_C_REGS_HELPER 0, 1, 0, 0
+	SAVE_C_REGS_HELPER 0, 1, 0, 0, 0
+	.endm
+	.macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11
+	SAVE_C_REGS_HELPER 0, 0, 0, 1, 0
 	.endm
 
 	.macro SAVE_EXTRA_REGS offset=0
@@ -179,6 +184,9 @@ For 32-bit we have the following conventions - kernel is built with
 	.macro RESTORE_C_REGS_EXCEPT_R11
 	RESTORE_C_REGS_HELPER 1,1,0,1,1
 	.endm
+	.macro RESTORE_C_REGS_EXCEPT_RCX_R11
+	RESTORE_C_REGS_HELPER 1,0,0,1,1
+	.endm
 	.macro RESTORE_RSI_RDI
 	RESTORE_C_REGS_HELPER 0,0,0,0,0
 	.endm
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 5117a2b..324200a 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -121,14 +121,12 @@ ENDPROC(native_usergs_sysret64)
 #endif
 
 /*
- * C code is not supposed to know about undefined top of stack. Every time
- * a C function with an pt_regs argument is called from the SYSCALL based
- * fast path FIXUP_TOP_OF_STACK is needed.
+ * C code is not supposed to know that the iret frame is not populated.
+ * Every time a C function with an pt_regs argument is called from
+ * the SYSCALL based fast path FIXUP_TOP_OF_STACK is needed.
  * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs
  * manipulation.
  */
-
-	/* %rsp:at FRAMEEND */
 	.macro FIXUP_TOP_OF_STACK tmp offset=0
 	movq PER_CPU_VAR(old_rsp),\tmp
 	movq \tmp,RSP+\offset(%rsp)
@@ -136,15 +134,13 @@ ENDPROC(native_usergs_sysret64)
 	movq $__USER_CS,CS+\offset(%rsp)
 	movq RIP+\offset(%rsp),\tmp  /* get rip */
 	movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
-	movq R11+\offset(%rsp),\tmp  /* get eflags */
-	movq \tmp,EFLAGS+\offset(%rsp)
+	movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */
+	movq \tmp,R11+\offset(%rsp)
 	.endm
 
 	.macro RESTORE_TOP_OF_STACK tmp offset=0
 	movq RSP+\offset(%rsp),\tmp
 	movq \tmp,PER_CPU_VAR(old_rsp)
-	movq EFLAGS+\offset(%rsp),\tmp
-	movq \tmp,R11+\offset(%rsp)
 	.endm
 
 /*
@@ -257,9 +253,10 @@ GLOBAL(system_call_after_swapgs)
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	ALLOC_PT_GPREGS_ON_STACK 8		/* +8: space for orig_ax */
-	SAVE_C_REGS_EXCEPT_RAX_RCX
+	SAVE_C_REGS_EXCEPT_RAX_RCX_R11
 	movq	$-ENOSYS,RAX(%rsp)
 	movq_cfi rax,ORIG_RAX
+	movq	%r11,EFLAGS(%rsp)
 	movq	%rcx,RIP(%rsp)
 	CFI_REL_OFFSET rip,RIP
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
@@ -277,7 +274,7 @@ system_call_fastpath:
 	movq %rax,RAX(%rsp)
 /*
  * Syscall return path ending with SYSRET (fast path)
- * Has incomplete stack frame and undefined top of stack.
+ * Has incompletely filled pt_regs, iret frame is also incomplete.
  */
 ret_from_sys_call:
 	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
@@ -291,9 +288,10 @@ ret_from_sys_call:
 	 * sysretq will re-enable interrupts:
 	 */
 	TRACE_IRQS_ON
-	RESTORE_C_REGS_EXCEPT_RCX
-	movq RIP(%rsp),%rcx
+	RESTORE_C_REGS_EXCEPT_RCX_R11
+	movq	RIP(%rsp),%rcx
 	CFI_REGISTER	rip,rcx
+	movq	EFLAGS(%rsp),%r11
 	/*CFI_REGISTER	rflags,r11*/
 	movq	PER_CPU_VAR(old_rsp), %rsp
 	/*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-09 18:39 [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Denys Vlasenko
  2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko
@ 2015-03-09 18:39 ` Denys Vlasenko
  2015-03-09 20:11   ` Andy Lutomirski
                     ` (2 more replies)
  2015-03-10  6:00 ` [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Ingo Molnar
  2 siblings, 3 replies; 23+ messages in thread
From: Denys Vlasenko @ 2015-03-09 18:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

PER_CPU(old_rsp) usage is simplified - now it is used only
as temp storage, and userspace stack pointer is immediately stored
in pt_regs->sp on syscall entry, instead of being used later,
on syscall exit.

Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
uses pt_regs->sp now.

FIXUP/RESTORE_TOP_OF_STACK are simplified.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/compat.h |  2 +-
 arch/x86/include/asm/ptrace.h |  8 ++------
 arch/x86/kernel/entry_64.S    | 18 +++++++-----------
 arch/x86/kernel/perf_regs.c   |  2 +-
 arch/x86/kernel/process_64.c  |  3 +--
 5 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h
index 59c6c40..acdee09 100644
--- a/arch/x86/include/asm/compat.h
+++ b/arch/x86/include/asm/compat.h
@@ -301,7 +301,7 @@ static inline void __user *arch_compat_alloc_user_space(long len)
 		sp = task_pt_regs(current)->sp;
 	} else {
 		/* -128 for the x32 ABI redzone */
-		sp = this_cpu_read(old_rsp) - 128;
+		sp = task_pt_regs(current)->sp - 128;
 	}
 
 	return (void __user *)round_down(sp - len, 16);
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 4077d96..74bb2e0 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -145,12 +145,8 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
 #endif
 }
 
-#define current_user_stack_pointer()	this_cpu_read(old_rsp)
-/* ia32 vs. x32 difference */
-#define compat_user_stack_pointer()	\
-	(test_thread_flag(TIF_IA32) 	\
-	 ? current_pt_regs()->sp 	\
-	 : this_cpu_read(old_rsp))
+#define current_user_stack_pointer()	current_pt_regs()->sp
+#define compat_user_stack_pointer()	current_pt_regs()->sp
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 703ced0..d86788c 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
  * manipulation.
  */
 	.macro FIXUP_TOP_OF_STACK tmp offset=0
-	movq PER_CPU_VAR(old_rsp),\tmp
-	movq \tmp,RSP+\offset(%rsp)
 	movq $__USER_DS,SS+\offset(%rsp)
 	movq $__USER_CS,CS+\offset(%rsp)
 	movq RIP+\offset(%rsp),\tmp  /* get rip */
@@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
 	.endm
 
 	.macro RESTORE_TOP_OF_STACK tmp offset=0
-	movq RSP+\offset(%rsp),\tmp
-	movq \tmp,PER_CPU_VAR(old_rsp)
+	/* nothing to do */
 	.endm
 
 /*
@@ -222,9 +219,6 @@ ENDPROC(native_usergs_sysret64)
  * Interrupts are off on entry.
  * Only called from user space.
  *
- * XXX	if we had a free scratch register we could save the RSP into the stack frame
- *      and report it properly in ps. Unfortunately we haven't.
- *
  * When user can change the frames always force IRET. That is because
  * it deals with uncanonical addresses better. SYSRET has trouble
  * with them due to bugs in both AMD and Intel CPUs.
@@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	ALLOC_PT_GPREGS_ON_STACK 8		/* +8: space for orig_ax */
+	movq	%rcx,RIP(%rsp)
+	movq	PER_CPU_VAR(old_rsp),%rcx
+	movq	%r11,EFLAGS(%rsp)
+	movq	%rcx,RSP(%rsp)
+	movq_cfi rax,ORIG_RAX
 	SAVE_C_REGS_EXCEPT_RAX_RCX_R11
 	movq	$-ENOSYS,RAX(%rsp)
-	movq_cfi rax,ORIG_RAX
-	movq	%r11,EFLAGS(%rsp)
-	movq	%rcx,RIP(%rsp)
 	CFI_REL_OFFSET rip,RIP
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
 	jnz tracesys
@@ -293,7 +289,7 @@ ret_from_sys_call:
 	CFI_REGISTER	rip,rcx
 	movq	EFLAGS(%rsp),%r11
 	/*CFI_REGISTER	rflags,r11*/
-	movq	PER_CPU_VAR(old_rsp), %rsp
+	movq	RSP(%rsp),%rsp
 	/*
 	 * 64bit SYSRET restores rip from rcx,
 	 * rflags from r11 (but RF and VM bits are forced to 0),
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 781861c..02a8720 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -177,7 +177,7 @@ void perf_get_regs_user(struct perf_regs *regs_user,
 		 * than just blindly copying user_regs.
 		 */
 		regs_user->abi = PERF_SAMPLE_REGS_ABI_64;
-		regs_user_copy->sp = this_cpu_read(old_rsp);
+		regs_user_copy->sp = user_regs->sp;
 		regs_user_copy->cs = __USER_CS;
 		regs_user_copy->ss = __USER_DS;
 		regs_user_copy->cx = -1;  /* usually contains garbage */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 1e393d2..e8c124a 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -602,6 +602,5 @@ long sys_arch_prctl(int code, unsigned long addr)
 
 unsigned long KSTK_ESP(struct task_struct *task)
 {
-	return (test_tsk_thread_flag(task, TIF_IA32)) ?
-			(task_pt_regs(task)->sp) : ((task)->thread.usersp);
+	return task_pt_regs(task)->sp;
 }
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath
  2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko
@ 2015-03-09 20:02   ` Andy Lutomirski
  2015-03-16 12:04   ` [tip:x86/asm] x86/asm/entry/64: Save R11 into pt_regs-> flags " tip-bot for Denys Vlasenko
  1 sibling, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2015-03-09 20:02 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Mon, Mar 9, 2015 at 11:39 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> Before this patch, r11 was saved in pt_regs->r11.
> Which looks natural, but requires messy shuffling to/from iret frame
> whenever ptrace or e.g. sys_iopl wants to modify flags - because
> that's how this register is used by SYSCALL/SYSRET.
>
> This patch saves r11 in pt_regs->flags,
> and uses that value for SYSRET64 insn. Shuffling is eliminated.
>
> FIXUP/RESTORE_TOP_OF_STACK are simplified.
>
> stub_iopl is no longer needed: pt_regs->flags needs no fixing up.
>
> Testing shows that syscall fast path is ~54.3 ns before
> and after the patch (on 2.7 GHz Sandy Bridge CPU).

Acked-by: Andy Lutomirski <luto@amacapital.net>

>
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Borislav Petkov <bp@alien8.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Alexei Starovoitov <ast@plumgrid.com>
> CC: Will Drewry <wad@chromium.org>
> CC: Kees Cook <keescook@chromium.org>
> CC: x86@kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  arch/x86/include/asm/calling.h | 20 ++++++++++++++------
>  arch/x86/kernel/entry_64.S     | 24 +++++++++++-------------
>  2 files changed, 25 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
> index f1a962f..4b5f7bf 100644
> --- a/arch/x86/include/asm/calling.h
> +++ b/arch/x86/include/asm/calling.h
> @@ -95,9 +95,11 @@ For 32-bit we have the following conventions - kernel is built with
>         CFI_ADJUST_CFA_OFFSET 15*8+\addskip
>         .endm
>
> -       .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8plus=1
> -       .if \r8plus
> +       .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1
> +       .if \r11
>         movq_cfi r11, 6*8+\offset
> +       .endif
> +       .if \r8910
>         movq_cfi r10, 7*8+\offset
>         movq_cfi r9,  8*8+\offset
>         movq_cfi r8,  9*8+\offset
> @@ -113,16 +115,19 @@ For 32-bit we have the following conventions - kernel is built with
>         movq_cfi rdi, 14*8+\offset
>         .endm
>         .macro SAVE_C_REGS offset=0
> -       SAVE_C_REGS_HELPER \offset, 1, 1, 1
> +       SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
>         .endm
>         .macro SAVE_C_REGS_EXCEPT_RAX_RCX offset=0
> -       SAVE_C_REGS_HELPER \offset, 0, 0, 1
> +       SAVE_C_REGS_HELPER \offset, 0, 0, 1, 1
>         .endm
>         .macro SAVE_C_REGS_EXCEPT_R891011
> -       SAVE_C_REGS_HELPER 0, 1, 1, 0
> +       SAVE_C_REGS_HELPER 0, 1, 1, 0, 0
>         .endm
>         .macro SAVE_C_REGS_EXCEPT_RCX_R891011
> -       SAVE_C_REGS_HELPER 0, 1, 0, 0
> +       SAVE_C_REGS_HELPER 0, 1, 0, 0, 0
> +       .endm
> +       .macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11
> +       SAVE_C_REGS_HELPER 0, 0, 0, 1, 0
>         .endm
>
>         .macro SAVE_EXTRA_REGS offset=0
> @@ -179,6 +184,9 @@ For 32-bit we have the following conventions - kernel is built with
>         .macro RESTORE_C_REGS_EXCEPT_R11
>         RESTORE_C_REGS_HELPER 1,1,0,1,1
>         .endm
> +       .macro RESTORE_C_REGS_EXCEPT_RCX_R11
> +       RESTORE_C_REGS_HELPER 1,0,0,1,1
> +       .endm
>         .macro RESTORE_RSI_RDI
>         RESTORE_C_REGS_HELPER 0,0,0,0,0
>         .endm
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 5117a2b..324200a 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -121,14 +121,12 @@ ENDPROC(native_usergs_sysret64)
>  #endif
>
>  /*
> - * C code is not supposed to know about undefined top of stack. Every time
> - * a C function with an pt_regs argument is called from the SYSCALL based
> - * fast path FIXUP_TOP_OF_STACK is needed.
> + * C code is not supposed to know that the iret frame is not populated.
> + * Every time a C function with an pt_regs argument is called from
> + * the SYSCALL based fast path FIXUP_TOP_OF_STACK is needed.
>   * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs
>   * manipulation.
>   */
> -
> -       /* %rsp:at FRAMEEND */
>         .macro FIXUP_TOP_OF_STACK tmp offset=0
>         movq PER_CPU_VAR(old_rsp),\tmp
>         movq \tmp,RSP+\offset(%rsp)
> @@ -136,15 +134,13 @@ ENDPROC(native_usergs_sysret64)
>         movq $__USER_CS,CS+\offset(%rsp)
>         movq RIP+\offset(%rsp),\tmp  /* get rip */
>         movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
> -       movq R11+\offset(%rsp),\tmp  /* get eflags */
> -       movq \tmp,EFLAGS+\offset(%rsp)
> +       movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */
> +       movq \tmp,R11+\offset(%rsp)
>         .endm
>
>         .macro RESTORE_TOP_OF_STACK tmp offset=0
>         movq RSP+\offset(%rsp),\tmp
>         movq \tmp,PER_CPU_VAR(old_rsp)
> -       movq EFLAGS+\offset(%rsp),\tmp
> -       movq \tmp,R11+\offset(%rsp)
>         .endm
>
>  /*
> @@ -257,9 +253,10 @@ GLOBAL(system_call_after_swapgs)
>          */
>         ENABLE_INTERRUPTS(CLBR_NONE)
>         ALLOC_PT_GPREGS_ON_STACK 8              /* +8: space for orig_ax */
> -       SAVE_C_REGS_EXCEPT_RAX_RCX
> +       SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>         movq    $-ENOSYS,RAX(%rsp)
>         movq_cfi rax,ORIG_RAX
> +       movq    %r11,EFLAGS(%rsp)
>         movq    %rcx,RIP(%rsp)
>         CFI_REL_OFFSET rip,RIP
>         testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
> @@ -277,7 +274,7 @@ system_call_fastpath:
>         movq %rax,RAX(%rsp)
>  /*
>   * Syscall return path ending with SYSRET (fast path)
> - * Has incomplete stack frame and undefined top of stack.
> + * Has incompletely filled pt_regs, iret frame is also incomplete.
>   */
>  ret_from_sys_call:
>         testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
> @@ -291,9 +288,10 @@ ret_from_sys_call:
>          * sysretq will re-enable interrupts:
>          */
>         TRACE_IRQS_ON
> -       RESTORE_C_REGS_EXCEPT_RCX
> -       movq RIP(%rsp),%rcx
> +       RESTORE_C_REGS_EXCEPT_RCX_R11
> +       movq    RIP(%rsp),%rcx
>         CFI_REGISTER    rip,rcx
> +       movq    EFLAGS(%rsp),%r11
>         /*CFI_REGISTER  rflags,r11*/
>         movq    PER_CPU_VAR(old_rsp), %rsp
>         /*
> --
> 1.8.1.4
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko
@ 2015-03-09 20:11   ` Andy Lutomirski
  2015-03-09 20:32     ` Denys Vlasenko
  2015-03-10 12:51   ` Ingo Molnar
  2015-03-16 12:05   ` [tip:x86/asm] x86/asm/entry/64: Save user RSP in pt_regs-> sp " tip-bot for Denys Vlasenko
  2 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2015-03-09 20:11 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Mon, Mar 9, 2015 at 11:39 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> PER_CPU(old_rsp) usage is simplified - now it is used only
> as temp storage, and userspace stack pointer is immediately stored
> in pt_regs->sp on syscall entry, instead of being used later,
> on syscall exit.
>
> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
> uses pt_regs->sp now.
>
> FIXUP/RESTORE_TOP_OF_STACK are simplified.
>
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Borislav Petkov <bp@alien8.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Alexei Starovoitov <ast@plumgrid.com>
> CC: Will Drewry <wad@chromium.org>
> CC: Kees Cook <keescook@chromium.org>
> CC: x86@kernel.org
> CC: linux-kernel@vger.kernel.org

Looks correct.

> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>          */
>         ENABLE_INTERRUPTS(CLBR_NONE)
>         ALLOC_PT_GPREGS_ON_STACK 8              /* +8: space for orig_ax */
> +       movq    %rcx,RIP(%rsp)
> +       movq    PER_CPU_VAR(old_rsp),%rcx
> +       movq    %r11,EFLAGS(%rsp)
> +       movq    %rcx,RSP(%rsp)
> +       movq_cfi rax,ORIG_RAX
>         SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>         movq    $-ENOSYS,RAX(%rsp)
> -       movq_cfi rax,ORIG_RAX
> -       movq    %r11,EFLAGS(%rsp)
> -       movq    %rcx,RIP(%rsp)

Why the reordering?

--Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-09 20:11   ` Andy Lutomirski
@ 2015-03-09 20:32     ` Denys Vlasenko
  2015-03-09 20:43       ` Andy Lutomirski
  0 siblings, 1 reply; 23+ messages in thread
From: Denys Vlasenko @ 2015-03-09 20:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>>          */
>>         ENABLE_INTERRUPTS(CLBR_NONE)
>>         ALLOC_PT_GPREGS_ON_STACK 8              /* +8: space for orig_ax */
>> +       movq    %rcx,RIP(%rsp)
>> +       movq    PER_CPU_VAR(old_rsp),%rcx
>> +       movq    %r11,EFLAGS(%rsp)
>> +       movq    %rcx,RSP(%rsp)
>> +       movq_cfi rax,ORIG_RAX
>>         SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>>         movq    $-ENOSYS,RAX(%rsp)
>> -       movq_cfi rax,ORIG_RAX
>> -       movq    %r11,EFLAGS(%rsp)
>> -       movq    %rcx,RIP(%rsp)
>
> Why the reordering?

No strong reason.

iret stack is "above" the rest of pt_regs.

This does not matter now, but when/if we convert to PUSHes
for register saving, pushes which build iret frame
will have to be before "save C-clobbered registers" part,
exactly as in this patch.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-09 20:32     ` Denys Vlasenko
@ 2015-03-09 20:43       ` Andy Lutomirski
  0 siblings, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2015-03-09 20:43 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Mon, Mar 9, 2015 at 1:32 PM, Denys Vlasenko <vda.linux@googlemail.com> wrote:
> On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>>>          */
>>>         ENABLE_INTERRUPTS(CLBR_NONE)
>>>         ALLOC_PT_GPREGS_ON_STACK 8              /* +8: space for orig_ax */
>>> +       movq    %rcx,RIP(%rsp)
>>> +       movq    PER_CPU_VAR(old_rsp),%rcx
>>> +       movq    %r11,EFLAGS(%rsp)
>>> +       movq    %rcx,RSP(%rsp)
>>> +       movq_cfi rax,ORIG_RAX
>>>         SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>>>         movq    $-ENOSYS,RAX(%rsp)
>>> -       movq_cfi rax,ORIG_RAX
>>> -       movq    %r11,EFLAGS(%rsp)
>>> -       movq    %rcx,RIP(%rsp)
>>
>> Why the reordering?
>
> No strong reason.
>
> iret stack is "above" the rest of pt_regs.
>
> This does not matter now, but when/if we convert to PUSHes
> for register saving, pushes which build iret frame
> will have to be before "save C-clobbered registers" part,
> exactly as in this patch.

Fair enough.

Acked-by: Andy Lutomirski <luto@amacapital.net>

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling
  2015-03-09 18:39 [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Denys Vlasenko
  2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko
  2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko
@ 2015-03-10  6:00 ` Ingo Molnar
  2 siblings, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2015-03-10  6:00 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> These changes make SYSENTER64 code path save flags and user's
> stack pointer in pt_regs->flags and pt_regs->sp, where they belong.
> 
> As a result, we can drop stub_iopl() and thread_struct::usersp.
> 
> Usage of PER_CPU(old_rsp) is reduced to bare minimum.
> 
> FIXUP/RESTORE_TOP_OF_STACK macros are on diet too.
> 
> Changes since v1: on Ingo's request, split into 4 patches intead of 2.

Unfortunately I got only 1/4 and 3/4, the other two patches are 
missing. They are not in my lkml and spam folders either.

Note that the missing mails:

> Denys Vlasenko (4):
>   x86: save r11 into pt_regs->flags on SYSCALL64 fastpath
>   Remove stub_iopl
>   x86: save user %rsp in pt_regs->sp on SYSCALL64 fastpath
>   Remove unused stuff

2/4 and 4/4 Have rather short titles, so maybe some spam trap ate 
them? Prefix them with 'x86: '?

Weird.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko
  2015-03-09 20:11   ` Andy Lutomirski
@ 2015-03-10 12:51   ` Ingo Molnar
  2015-03-10 13:10     ` Andy Lutomirski
  2015-03-10 13:18     ` Denys Vlasenko
  2015-03-16 12:05   ` [tip:x86/asm] x86/asm/entry/64: Save user RSP in pt_regs-> sp " tip-bot for Denys Vlasenko
  2 siblings, 2 replies; 23+ messages in thread
From: Ingo Molnar @ 2015-03-10 12:51 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> PER_CPU(old_rsp) usage is simplified - now it is used only
> as temp storage, and userspace stack pointer is immediately stored
> in pt_regs->sp on syscall entry, instead of being used later,
> on syscall exit.
> 
> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
> uses pt_regs->sp now.
> 
> FIXUP/RESTORE_TOP_OF_STACK are simplified.

Just trying to judge the performance impact:

> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
>   * manipulation.
>   */
>  	.macro FIXUP_TOP_OF_STACK tmp offset=0
> -	movq PER_CPU_VAR(old_rsp),\tmp
> -	movq \tmp,RSP+\offset(%rsp)
>  	movq $__USER_DS,SS+\offset(%rsp)
>  	movq $__USER_CS,CS+\offset(%rsp)
>  	movq RIP+\offset(%rsp),\tmp  /* get rip */
> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
>  	.endm
>  
>  	.macro RESTORE_TOP_OF_STACK tmp offset=0
> -	movq RSP+\offset(%rsp),\tmp
> -	movq \tmp,PER_CPU_VAR(old_rsp)
> +	/* nothing to do */
>  	.endm
>  
>  /*
> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>  	 */
>  	ENABLE_INTERRUPTS(CLBR_NONE)
>  	ALLOC_PT_GPREGS_ON_STACK 8		/* +8: space for orig_ax */
> +	movq	%rcx,RIP(%rsp)
> +	movq	PER_CPU_VAR(old_rsp),%rcx
> +	movq	%r11,EFLAGS(%rsp)
> +	movq	%rcx,RSP(%rsp)
> +	movq_cfi rax,ORIG_RAX
>  	SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>  	movq	$-ENOSYS,RAX(%rsp)
> -	movq_cfi rax,ORIG_RAX
> -	movq	%r11,EFLAGS(%rsp)
> -	movq	%rcx,RIP(%rsp)
>  	CFI_REL_OFFSET rip,RIP
>  	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
>  	jnz tracesys

So there are now +2 instructions (5 instead of 3) in the system_call 
path, but there are -2 instructions in the SYSRETQ path, so combined 
it's a wash performance-wise, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 12:51   ` Ingo Molnar
@ 2015-03-10 13:10     ` Andy Lutomirski
  2015-03-10 13:18     ` Denys Vlasenko
  1 sibling, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2015-03-10 13:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Tue, Mar 10, 2015 at 5:51 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
>> PER_CPU(old_rsp) usage is simplified - now it is used only
>> as temp storage, and userspace stack pointer is immediately stored
>> in pt_regs->sp on syscall entry, instead of being used later,
>> on syscall exit.
>>
>> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
>> uses pt_regs->sp now.
>>
>> FIXUP/RESTORE_TOP_OF_STACK are simplified.
>
> Just trying to judge the performance impact:
>
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
>>   * manipulation.
>>   */
>>       .macro FIXUP_TOP_OF_STACK tmp offset=0
>> -     movq PER_CPU_VAR(old_rsp),\tmp
>> -     movq \tmp,RSP+\offset(%rsp)
>>       movq $__USER_DS,SS+\offset(%rsp)
>>       movq $__USER_CS,CS+\offset(%rsp)
>>       movq RIP+\offset(%rsp),\tmp  /* get rip */
>> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
>>       .endm
>>
>>       .macro RESTORE_TOP_OF_STACK tmp offset=0
>> -     movq RSP+\offset(%rsp),\tmp
>> -     movq \tmp,PER_CPU_VAR(old_rsp)
>> +     /* nothing to do */
>>       .endm
>>
>>  /*
>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>>        */
>>       ENABLE_INTERRUPTS(CLBR_NONE)
>>       ALLOC_PT_GPREGS_ON_STACK 8              /* +8: space for orig_ax */
>> +     movq    %rcx,RIP(%rsp)
>> +     movq    PER_CPU_VAR(old_rsp),%rcx
>> +     movq    %r11,EFLAGS(%rsp)
>> +     movq    %rcx,RSP(%rsp)
>> +     movq_cfi rax,ORIG_RAX
>>       SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>>       movq    $-ENOSYS,RAX(%rsp)
>> -     movq_cfi rax,ORIG_RAX
>> -     movq    %r11,EFLAGS(%rsp)
>> -     movq    %rcx,RIP(%rsp)
>>       CFI_REL_OFFSET rip,RIP
>>       testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
>>       jnz tracesys
>
> So there are now +2 instructions (5 instead of 3) in the system_call
> path, but there are -2 instructions in the SYSRETQ path, so combined
> it's a wash performance-wise, right?

I think that the SYSRETQ path is the same number of instructions --
RESTORE_TOP_OF_STACK is only used in somewhat unusual circumstances.

On SYSRETQ, I think we touch one fewer cache line, although that cache
line will be in L1 unless the syscall pushed it out.  The added
instructions on entry are probably nearly free, though, as one of them
is a load from a forwarded address and the other is a store to a hot
cacheline.

--And

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 12:51   ` Ingo Molnar
  2015-03-10 13:10     ` Andy Lutomirski
@ 2015-03-10 13:18     ` Denys Vlasenko
  2015-03-10 13:20       ` Andy Lutomirski
  2015-03-10 13:21       ` Ingo Molnar
  1 sibling, 2 replies; 23+ messages in thread
From: Denys Vlasenko @ 2015-03-10 13:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel

On 03/10/2015 01:51 PM, Ingo Molnar wrote:
> 
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
> 
>> PER_CPU(old_rsp) usage is simplified - now it is used only
>> as temp storage, and userspace stack pointer is immediately stored
>> in pt_regs->sp on syscall entry, instead of being used later,
>> on syscall exit.
>>
>> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
>> uses pt_regs->sp now.
>>
>> FIXUP/RESTORE_TOP_OF_STACK are simplified.
> 
> Just trying to judge the performance impact:
> 
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
>>   * manipulation.
>>   */
>>  	.macro FIXUP_TOP_OF_STACK tmp offset=0
>> -	movq PER_CPU_VAR(old_rsp),\tmp
>> -	movq \tmp,RSP+\offset(%rsp)
>>  	movq $__USER_DS,SS+\offset(%rsp)
>>  	movq $__USER_CS,CS+\offset(%rsp)
>>  	movq RIP+\offset(%rsp),\tmp  /* get rip */
>> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
>>  	.endm
>>  
>>  	.macro RESTORE_TOP_OF_STACK tmp offset=0
>> -	movq RSP+\offset(%rsp),\tmp
>> -	movq \tmp,PER_CPU_VAR(old_rsp)
>> +	/* nothing to do */
>>  	.endm
>>  
>>  /*
>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>>  	 */
>>  	ENABLE_INTERRUPTS(CLBR_NONE)
>>  	ALLOC_PT_GPREGS_ON_STACK 8		/* +8: space for orig_ax */
>> +	movq	%rcx,RIP(%rsp)
>> +	movq	PER_CPU_VAR(old_rsp),%rcx
>> +	movq	%r11,EFLAGS(%rsp)
>> +	movq	%rcx,RSP(%rsp)
>> +	movq_cfi rax,ORIG_RAX
>>  	SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>>  	movq	$-ENOSYS,RAX(%rsp)
>> -	movq_cfi rax,ORIG_RAX
>> -	movq	%r11,EFLAGS(%rsp)
>> -	movq	%rcx,RIP(%rsp)
>>  	CFI_REL_OFFSET rip,RIP
>>  	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
>>  	jnz tracesys
> 
> So there are now +2 instructions (5 instead of 3) in the system_call 
> path, but there are -2 instructions in the SYSRETQ path,

Unfortunately, no. There is only this change in SYSRETQ path,
which simply changes where we get RSP from:

@@ -293,7 +289,7 @@ ret_from_sys_call:
 	CFI_REGISTER	rip,rcx
 	movq	EFLAGS(%rsp),%r11
 	/*CFI_REGISTER	rflags,r11*/
-	movq	PER_CPU_VAR(old_rsp), %rsp
+	movq	RSP(%rsp),%rsp
 	/*
 	 * 64bit SYSRET restores rip from rcx,
 	 * rflags from r11 (but RF and VM bits are forced to 0),

Most likely, no change in execution speed here.
At best, it is one cycle faster somewhere in address generation unit
because for PER_CPU_VAR() address evaluation, GS base is nonzero.


Since this patch does add two extra MOVs,
I did benchmark these patches. They add exactly one cycle
to system call code path on my Sandy Bridge CPU.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 13:18     ` Denys Vlasenko
@ 2015-03-10 13:20       ` Andy Lutomirski
  2015-03-10 13:26         ` Ingo Molnar
  2015-03-10 13:21       ` Ingo Molnar
  1 sibling, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2015-03-10 13:20 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Tue, Mar 10, 2015 at 6:18 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/10/2015 01:51 PM, Ingo Molnar wrote:
>>
>> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>
>>> PER_CPU(old_rsp) usage is simplified - now it is used only
>>> as temp storage, and userspace stack pointer is immediately stored
>>> in pt_regs->sp on syscall entry, instead of being used later,
>>> on syscall exit.
>>>
>>> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
>>> uses pt_regs->sp now.
>>>
>>> FIXUP/RESTORE_TOP_OF_STACK are simplified.
>>
>> Just trying to judge the performance impact:
>>
>>> --- a/arch/x86/kernel/entry_64.S
>>> +++ b/arch/x86/kernel/entry_64.S
>>> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
>>>   * manipulation.
>>>   */
>>>      .macro FIXUP_TOP_OF_STACK tmp offset=0
>>> -    movq PER_CPU_VAR(old_rsp),\tmp
>>> -    movq \tmp,RSP+\offset(%rsp)
>>>      movq $__USER_DS,SS+\offset(%rsp)
>>>      movq $__USER_CS,CS+\offset(%rsp)
>>>      movq RIP+\offset(%rsp),\tmp  /* get rip */
>>> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
>>>      .endm
>>>
>>>      .macro RESTORE_TOP_OF_STACK tmp offset=0
>>> -    movq RSP+\offset(%rsp),\tmp
>>> -    movq \tmp,PER_CPU_VAR(old_rsp)
>>> +    /* nothing to do */
>>>      .endm
>>>
>>>  /*
>>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>>>       */
>>>      ENABLE_INTERRUPTS(CLBR_NONE)
>>>      ALLOC_PT_GPREGS_ON_STACK 8              /* +8: space for orig_ax */
>>> +    movq    %rcx,RIP(%rsp)
>>> +    movq    PER_CPU_VAR(old_rsp),%rcx
>>> +    movq    %r11,EFLAGS(%rsp)
>>> +    movq    %rcx,RSP(%rsp)
>>> +    movq_cfi rax,ORIG_RAX
>>>      SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>>>      movq    $-ENOSYS,RAX(%rsp)
>>> -    movq_cfi rax,ORIG_RAX
>>> -    movq    %r11,EFLAGS(%rsp)
>>> -    movq    %rcx,RIP(%rsp)
>>>      CFI_REL_OFFSET rip,RIP
>>>      testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
>>>      jnz tracesys
>>
>> So there are now +2 instructions (5 instead of 3) in the system_call
>> path, but there are -2 instructions in the SYSRETQ path,
>
> Unfortunately, no. There is only this change in SYSRETQ path,
> which simply changes where we get RSP from:
>
> @@ -293,7 +289,7 @@ ret_from_sys_call:
>         CFI_REGISTER    rip,rcx
>         movq    EFLAGS(%rsp),%r11
>         /*CFI_REGISTER  rflags,r11*/
> -       movq    PER_CPU_VAR(old_rsp), %rsp
> +       movq    RSP(%rsp),%rsp
>         /*
>          * 64bit SYSRET restores rip from rcx,
>          * rflags from r11 (but RF and VM bits are forced to 0),
>
> Most likely, no change in execution speed here.
> At best, it is one cycle faster somewhere in address generation unit
> because for PER_CPU_VAR() address evaluation, GS base is nonzero.
>
>
> Since this patch does add two extra MOVs,
> I did benchmark these patches. They add exactly one cycle
> to system call code path on my Sandy Bridge CPU.
>

Personally, I'm willing to pay that cycle.  It could be a bigger
savings on context switch, and the simplification it enables is pretty
good.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 13:18     ` Denys Vlasenko
  2015-03-10 13:20       ` Andy Lutomirski
@ 2015-03-10 13:21       ` Ingo Molnar
  2015-03-10 13:26         ` Andy Lutomirski
                           ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Ingo Molnar @ 2015-03-10 13:21 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> > So there are now +2 instructions (5 instead of 3) in the 
> > system_call path, but there are -2 instructions in the SYSRETQ 
> > path,
> 
> Unfortunately, no. [...]

So I assumed that it was an equivalent transformation, given that none 
of the changelogs spelled out the increase in overhead ...

> [...] There is only this change in SYSRETQ path, which simply 
> changes where we get RSP from:
> 
> @@ -293,7 +289,7 @@ ret_from_sys_call:
>  	CFI_REGISTER	rip,rcx
>  	movq	EFLAGS(%rsp),%r11
>  	/*CFI_REGISTER	rflags,r11*/
> -	movq	PER_CPU_VAR(old_rsp), %rsp
> +	movq	RSP(%rsp),%rsp
>  	/*
>  	 * 64bit SYSRET restores rip from rcx,
>  	 * rflags from r11 (but RF and VM bits are forced to 0),
> 
> Most likely, no change in execution speed here.
> At best, it is one cycle faster somewhere in address generation unit
> because for PER_CPU_VAR() address evaluation, GS base is nonzero.
> 
> Since this patch does add two extra MOVs,
> I did benchmark these patches. They add exactly one cycle
> to system call code path on my Sandy Bridge CPU.

Hm, but that's the wrong direction, we should try to make it faster, 
and to clean it up - but making it slower without really good reasons 
isn't good.

Is 'usersp' really that much of a complication?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 13:20       ` Andy Lutomirski
@ 2015-03-10 13:26         ` Ingo Molnar
  0 siblings, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2015-03-10 13:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel


* Andy Lutomirski <luto@amacapital.net> wrote:

> > Since this patch does add two extra MOVs,
> > I did benchmark these patches. They add exactly one cycle
> > to system call code path on my Sandy Bridge CPU.
> 
> Personally, I'm willing to pay that cycle.  It could be a bigger 
> savings on context switch, and the simplification it enables is 
> pretty good.

But, but ... context switches are a relative slow path, compared to 
system calls. And I say this with the scheduler maintainer hat on as 
well.

So this is not a good bargain IMHO, assuming it's not some _huge_ 
difference in maintainability - but having an extra percpu field
isn't really much of a problem.

I don't claim that we couldn't in some other situation decide that a 
certain type of speedup isn't worth it - but what's the big problem 
here? A bit of arithmetics shouldn't be a problem?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 13:21       ` Ingo Molnar
@ 2015-03-10 13:26         ` Andy Lutomirski
  2015-03-10 14:00           ` Denys Vlasenko
  2015-03-10 13:28         ` Ingo Molnar
  2015-03-10 13:50         ` Denys Vlasenko
  2 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2015-03-10 13:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Tue, Mar 10, 2015 at 6:21 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
>> > So there are now +2 instructions (5 instead of 3) in the
>> > system_call path, but there are -2 instructions in the SYSRETQ
>> > path,
>>
>> Unfortunately, no. [...]
>
> So I assumed that it was an equivalent transformation, given that none
> of the changelogs spelled out the increase in overhead ...
>
>> [...] There is only this change in SYSRETQ path, which simply
>> changes where we get RSP from:
>>
>> @@ -293,7 +289,7 @@ ret_from_sys_call:
>>       CFI_REGISTER    rip,rcx
>>       movq    EFLAGS(%rsp),%r11
>>       /*CFI_REGISTER  rflags,r11*/
>> -     movq    PER_CPU_VAR(old_rsp), %rsp
>> +     movq    RSP(%rsp),%rsp
>>       /*
>>        * 64bit SYSRET restores rip from rcx,
>>        * rflags from r11 (but RF and VM bits are forced to 0),
>>
>> Most likely, no change in execution speed here.
>> At best, it is one cycle faster somewhere in address generation unit
>> because for PER_CPU_VAR() address evaluation, GS base is nonzero.
>>
>> Since this patch does add two extra MOVs,
>> I did benchmark these patches. They add exactly one cycle
>> to system call code path on my Sandy Bridge CPU.
>
> Hm, but that's the wrong direction, we should try to make it faster,
> and to clean it up - but making it slower without really good reasons
> isn't good.
>
> Is 'usersp' really that much of a complication?

usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
killing that off completely.  I've still never convinced myself that
there aren't ptrace-related info leaks in there.

Denys, did you ever benchmark what happens if we use push instead of
mov?  I bet that we get that cycle back and more, not to mention much
less icache usage.

The reason I think that is that this patch changes us from writing
nothing to the RIP slot to writing something there.  If we push our
stack frame instead of moving it into place, we must write to every
slot, and writing the right value is probably no more expensive than
writing, say, zero (the load / forwarding units are unlikely to be
very busy at that point).  So this change could actually be a step
toward getting faster.

--Andy

>
> Thanks,
>
>         Ingo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 13:21       ` Ingo Molnar
  2015-03-10 13:26         ` Andy Lutomirski
@ 2015-03-10 13:28         ` Ingo Molnar
  2015-03-10 13:50         ` Denys Vlasenko
  2 siblings, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2015-03-10 13:28 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel


> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
> 
> > > So there are now +2 instructions (5 instead of 3) in the 
> > > system_call path, but there are -2 instructions in the SYSRETQ 
> > > path,
> > 
> > Unfortunately, no. [...]
> 
> So I assumed that it was an equivalent transformation, given that 
> none of the changelogs spelled out the increase in overhead ...

Also, for future reference, you need to spell out performance costs of 
fast path patches very clearly - and that wasn't done here.

That's a big no-no, please don't do it again ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 13:21       ` Ingo Molnar
  2015-03-10 13:26         ` Andy Lutomirski
  2015-03-10 13:28         ` Ingo Molnar
@ 2015-03-10 13:50         ` Denys Vlasenko
  2015-03-16  9:44           ` Ingo Molnar
  2 siblings, 1 reply; 23+ messages in thread
From: Denys Vlasenko @ 2015-03-10 13:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Andy Lutomirski, Linus Torvalds, Steven Rostedt,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, Linux Kernel Mailing List

On Tue, Mar 10, 2015 at 2:21 PM, Ingo Molnar <mingo@kernel.org> wrote:
>> Since this patch does add two extra MOVs,
>> I did benchmark these patches. They add exactly one cycle
>> to system call code path on my Sandy Bridge CPU.
>
> Hm, but that's the wrong direction, we should try to make it faster,
> and to clean it up

As you know, goals of "faster" and "cleaner" are often mutually exclusive.
C'est la vie :(

entry.S seem to lean towards "faster" to the extent where it became
a tangled maze of obscure optimizations.

Such as the mysterious, and not at all obvious existence of 8 byte
"safety padding" at the top of the 32-bit kernel stack. Before Andy
stumbled into it, it was not at all obvious that it is even there.
I had to write a test patch to verify it.
There is a long-standing latent bug in the code where this padding
is not accounted for.

> - but making it slower without really good reasons isn't good.

The thinking here is that cleaning up entry.S is a good reason.

We won't do anything which would slow it down by, say, 5%,
but one cycle may be considered acceptable loss.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 13:26         ` Andy Lutomirski
@ 2015-03-10 14:00           ` Denys Vlasenko
  2015-03-10 14:02             ` Andy Lutomirski
  0 siblings, 1 reply; 23+ messages in thread
From: Denys Vlasenko @ 2015-03-10 14:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Denys Vlasenko, Linus Torvalds, Steven Rostedt,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
> RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
> killing that off completely.  I've still never convinced myself that
> there aren't ptrace-related info leaks in there.
>
> Denys, did you ever benchmark what happens if we use push instead of
> mov?  I bet that we get that cycle back and more, not to mention much
> less icache usage.

Yes, I did.
Push conversion seems to perform the same as current, MOV-based code.

The expected win there that we lose two huge 12-byte insns
which store __USER_CS and __USER_DS in iret frame.

MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86:
- needs REX prefix
- no sing-extending imm8 form exists for it
- ofs in our case can't fit into 8 bits
- (%esp) requires SIB byte

In my tests, each such instruction adds one cycle.

Compare this to PUSH imm8, which is 2 bytes only.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 14:00           ` Denys Vlasenko
@ 2015-03-10 14:02             ` Andy Lutomirski
  2015-03-10 14:09               ` Denys Vlasenko
  0 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2015-03-10 14:02 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Denys Vlasenko, Linus Torvalds, Steven Rostedt,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
>> RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
>> killing that off completely.  I've still never convinced myself that
>> there aren't ptrace-related info leaks in there.
>>
>> Denys, did you ever benchmark what happens if we use push instead of
>> mov?  I bet that we get that cycle back and more, not to mention much
>> less icache usage.
>
> Yes, I did.
> Push conversion seems to perform the same as current, MOV-based code.
>
> The expected win there that we lose two huge 12-byte insns
> which store __USER_CS and __USER_DS in iret frame.
>
> MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86:
> - needs REX prefix
> - no sing-extending imm8 form exists for it
> - ofs in our case can't fit into 8 bits
> - (%esp) requires SIB byte
>
> In my tests, each such instruction adds one cycle.
>
> Compare this to PUSH imm8, which is 2 bytes only.

Does that mean that using push on top of this patch gets us our cycle back?

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 14:02             ` Andy Lutomirski
@ 2015-03-10 14:09               ` Denys Vlasenko
  0 siblings, 0 replies; 23+ messages in thread
From: Denys Vlasenko @ 2015-03-10 14:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Denys Vlasenko, Linus Torvalds, Steven Rostedt,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Tue, Mar 10, 2015 at 3:02 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko
> <vda.linux@googlemail.com> wrote:
>> On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
>>> RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
>>> killing that off completely.  I've still never convinced myself that
>>> there aren't ptrace-related info leaks in there.
>>>
>>> Denys, did you ever benchmark what happens if we use push instead of
>>> mov?  I bet that we get that cycle back and more, not to mention much
>>> less icache usage.
>>
>> Yes, I did.
>> Push conversion seems to perform the same as current, MOV-based code.
>>
>> The expected win there that we lose two huge 12-byte insns
>> which store __USER_CS and __USER_DS in iret frame.
>>
>> MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86:
>> - needs REX prefix
>> - no sing-extending imm8 form exists for it
>> - ofs in our case can't fit into 8 bits
>> - (%esp) requires SIB byte
>>
>> In my tests, each such instruction adds one cycle.
>>
>> Compare this to PUSH imm8, which is 2 bytes only.
>
> Does that mean that using push on top of this patch gets us our cycle back?

Maybe. I can't be sure about it.

In general I see a jitter of 1-2, sometimes 3 cycles even when I do changes
which merely change code size (e.g. replacing equivalent insns).
This may be caused by jump targets getting aligned differently
wrt cacheline boundaries. If second/third/fourth insn after current one
is not fetched because it did not fit into the cacheline,
then some insn decoders don't get anything to chew on.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
  2015-03-10 13:50         ` Denys Vlasenko
@ 2015-03-16  9:44           ` Ingo Molnar
  0 siblings, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2015-03-16  9:44 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Denys Vlasenko, Andy Lutomirski, Linus Torvalds, Steven Rostedt,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, Linux Kernel Mailing List


* Denys Vlasenko <vda.linux@googlemail.com> wrote:

> > - but making it slower without really good reasons isn't good.
> 
> The thinking here is that cleaning up entry.S is a good reason.
> 
> We won't do anything which would slow it down by, say, 5%,
> but one cycle may be considered acceptable loss.

Ok, so I've applied this particular cleanup, in the hope that we can 
recover those cycles on the now cleaner base.

If that doesn't work out then we can still re-introduce this 
complication and see its maintainability in isolation, on a clean 
base.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip:x86/asm] x86/asm/entry/64: Save R11 into pt_regs-> flags on SYSCALL64 fastpath
  2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko
  2015-03-09 20:02   ` Andy Lutomirski
@ 2015-03-16 12:04   ` tip-bot for Denys Vlasenko
  1 sibling, 0 replies; 23+ messages in thread
From: tip-bot for Denys Vlasenko @ 2015-03-16 12:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dvlasenk, torvalds, keescook, fweisbec, oleg, rostedt, hpa,
	mingo, linux-kernel, ast, bp, luto, tglx, wad

Commit-ID:  29722cd4ef666705b2eda1c3ba44435488e509eb
Gitweb:     http://git.kernel.org/tip/29722cd4ef666705b2eda1c3ba44435488e509eb
Author:     Denys Vlasenko <dvlasenk@redhat.com>
AuthorDate: Mon, 9 Mar 2015 19:39:21 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 10 Mar 2015 13:56:10 +0100

x86/asm/entry/64: Save R11 into pt_regs->flags on SYSCALL64 fastpath

Before this patch, R11 was saved in pt_regs->r11.

Which looks natural, but requires messy shuffling to/from iret
frame whenever ptrace or e.g. sys_iopl() wants to modify flags -
because that's how this register is used by SYSCALL/SYSRET.

This patch saves R11 in pt_regs->flags, and uses that value for
the SYSRET64 instruction. Shuffling is eliminated.

FIXUP/RESTORE_TOP_OF_STACK are simplified.

stub_iopl is no longer needed: pt_regs->flags needs no fixing up.

Testing shows that syscall fast path is ~54.3 ns before
and after the patch (on 2.7 GHz Sandy Bridge CPU).

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1425926364-9526-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/calling.h | 20 ++++++++++++++------
 arch/x86/kernel/entry_64.S     | 24 +++++++++++-------------
 2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index f1a962f..4b5f7bf 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -95,9 +95,11 @@ For 32-bit we have the following conventions - kernel is built with
 	CFI_ADJUST_CFA_OFFSET 15*8+\addskip
 	.endm
 
-	.macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8plus=1
-	.if \r8plus
+	.macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1
+	.if \r11
 	movq_cfi r11, 6*8+\offset
+	.endif
+	.if \r8910
 	movq_cfi r10, 7*8+\offset
 	movq_cfi r9,  8*8+\offset
 	movq_cfi r8,  9*8+\offset
@@ -113,16 +115,19 @@ For 32-bit we have the following conventions - kernel is built with
 	movq_cfi rdi, 14*8+\offset
 	.endm
 	.macro SAVE_C_REGS offset=0
-	SAVE_C_REGS_HELPER \offset, 1, 1, 1
+	SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_RAX_RCX offset=0
-	SAVE_C_REGS_HELPER \offset, 0, 0, 1
+	SAVE_C_REGS_HELPER \offset, 0, 0, 1, 1
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_R891011
-	SAVE_C_REGS_HELPER 0, 1, 1, 0
+	SAVE_C_REGS_HELPER 0, 1, 1, 0, 0
 	.endm
 	.macro SAVE_C_REGS_EXCEPT_RCX_R891011
-	SAVE_C_REGS_HELPER 0, 1, 0, 0
+	SAVE_C_REGS_HELPER 0, 1, 0, 0, 0
+	.endm
+	.macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11
+	SAVE_C_REGS_HELPER 0, 0, 0, 1, 0
 	.endm
 
 	.macro SAVE_EXTRA_REGS offset=0
@@ -179,6 +184,9 @@ For 32-bit we have the following conventions - kernel is built with
 	.macro RESTORE_C_REGS_EXCEPT_R11
 	RESTORE_C_REGS_HELPER 1,1,0,1,1
 	.endm
+	.macro RESTORE_C_REGS_EXCEPT_RCX_R11
+	RESTORE_C_REGS_HELPER 1,0,0,1,1
+	.endm
 	.macro RESTORE_RSI_RDI
 	RESTORE_C_REGS_HELPER 0,0,0,0,0
 	.endm
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 5117a2b..324200a 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -121,14 +121,12 @@ ENDPROC(native_usergs_sysret64)
 #endif
 
 /*
- * C code is not supposed to know about undefined top of stack. Every time
- * a C function with an pt_regs argument is called from the SYSCALL based
- * fast path FIXUP_TOP_OF_STACK is needed.
+ * C code is not supposed to know that the iret frame is not populated.
+ * Every time a C function with an pt_regs argument is called from
+ * the SYSCALL based fast path FIXUP_TOP_OF_STACK is needed.
  * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs
  * manipulation.
  */
-
-	/* %rsp:at FRAMEEND */
 	.macro FIXUP_TOP_OF_STACK tmp offset=0
 	movq PER_CPU_VAR(old_rsp),\tmp
 	movq \tmp,RSP+\offset(%rsp)
@@ -136,15 +134,13 @@ ENDPROC(native_usergs_sysret64)
 	movq $__USER_CS,CS+\offset(%rsp)
 	movq RIP+\offset(%rsp),\tmp  /* get rip */
 	movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
-	movq R11+\offset(%rsp),\tmp  /* get eflags */
-	movq \tmp,EFLAGS+\offset(%rsp)
+	movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */
+	movq \tmp,R11+\offset(%rsp)
 	.endm
 
 	.macro RESTORE_TOP_OF_STACK tmp offset=0
 	movq RSP+\offset(%rsp),\tmp
 	movq \tmp,PER_CPU_VAR(old_rsp)
-	movq EFLAGS+\offset(%rsp),\tmp
-	movq \tmp,R11+\offset(%rsp)
 	.endm
 
 /*
@@ -257,9 +253,10 @@ GLOBAL(system_call_after_swapgs)
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	ALLOC_PT_GPREGS_ON_STACK 8		/* +8: space for orig_ax */
-	SAVE_C_REGS_EXCEPT_RAX_RCX
+	SAVE_C_REGS_EXCEPT_RAX_RCX_R11
 	movq	$-ENOSYS,RAX(%rsp)
 	movq_cfi rax,ORIG_RAX
+	movq	%r11,EFLAGS(%rsp)
 	movq	%rcx,RIP(%rsp)
 	CFI_REL_OFFSET rip,RIP
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
@@ -277,7 +274,7 @@ system_call_fastpath:
 	movq %rax,RAX(%rsp)
 /*
  * Syscall return path ending with SYSRET (fast path)
- * Has incomplete stack frame and undefined top of stack.
+ * Has incompletely filled pt_regs, iret frame is also incomplete.
  */
 ret_from_sys_call:
 	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
@@ -291,9 +288,10 @@ ret_from_sys_call:
 	 * sysretq will re-enable interrupts:
 	 */
 	TRACE_IRQS_ON
-	RESTORE_C_REGS_EXCEPT_RCX
-	movq RIP(%rsp),%rcx
+	RESTORE_C_REGS_EXCEPT_RCX_R11
+	movq	RIP(%rsp),%rcx
 	CFI_REGISTER	rip,rcx
+	movq	EFLAGS(%rsp),%r11
 	/*CFI_REGISTER	rflags,r11*/
 	movq	PER_CPU_VAR(old_rsp), %rsp
 	/*

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:x86/asm] x86/asm/entry/64: Save user RSP in pt_regs-> sp on SYSCALL64 fastpath
  2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko
  2015-03-09 20:11   ` Andy Lutomirski
  2015-03-10 12:51   ` Ingo Molnar
@ 2015-03-16 12:05   ` tip-bot for Denys Vlasenko
  2 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Denys Vlasenko @ 2015-03-16 12:05 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, torvalds, rostedt, ast, tglx, wad, keescook, hpa,
	fweisbec, bp, mingo, dvlasenk, oleg, luto

Commit-ID:  263042e4630a85e856b4a8cd72f28dab33ef4741
Gitweb:     http://git.kernel.org/tip/263042e4630a85e856b4a8cd72f28dab33ef4741
Author:     Denys Vlasenko <dvlasenk@redhat.com>
AuthorDate: Mon, 9 Mar 2015 19:39:23 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 10 Mar 2015 13:56:10 +0100

x86/asm/entry/64: Save user RSP in pt_regs->sp on SYSCALL64 fastpath

Prepare for the removal of 'usersp', by simplifying PER_CPU(old_rsp) usage:

  - use it only as temp storage

  - store the userspace stack pointer immediately in pt_regs->sp
    on syscall entry, instead of using it later, on syscall exit.

  - change C code to use pt_regs->sp only, instead of PER_CPU(old_rsp)
    and task->thread.usersp.

FIXUP/RESTORE_TOP_OF_STACK are simplified as well.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1425926364-9526-4-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/compat.h |  2 +-
 arch/x86/include/asm/ptrace.h |  8 ++------
 arch/x86/kernel/entry_64.S    | 18 +++++++-----------
 arch/x86/kernel/perf_regs.c   |  2 +-
 arch/x86/kernel/process_64.c  |  3 +--
 5 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h
index 59c6c40..acdee09 100644
--- a/arch/x86/include/asm/compat.h
+++ b/arch/x86/include/asm/compat.h
@@ -301,7 +301,7 @@ static inline void __user *arch_compat_alloc_user_space(long len)
 		sp = task_pt_regs(current)->sp;
 	} else {
 		/* -128 for the x32 ABI redzone */
-		sp = this_cpu_read(old_rsp) - 128;
+		sp = task_pt_regs(current)->sp - 128;
 	}
 
 	return (void __user *)round_down(sp - len, 16);
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 4077d96..74bb2e0 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -145,12 +145,8 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
 #endif
 }
 
-#define current_user_stack_pointer()	this_cpu_read(old_rsp)
-/* ia32 vs. x32 difference */
-#define compat_user_stack_pointer()	\
-	(test_thread_flag(TIF_IA32) 	\
-	 ? current_pt_regs()->sp 	\
-	 : this_cpu_read(old_rsp))
+#define current_user_stack_pointer()	current_pt_regs()->sp
+#define compat_user_stack_pointer()	current_pt_regs()->sp
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 703ced0..d86788c 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
  * manipulation.
  */
 	.macro FIXUP_TOP_OF_STACK tmp offset=0
-	movq PER_CPU_VAR(old_rsp),\tmp
-	movq \tmp,RSP+\offset(%rsp)
 	movq $__USER_DS,SS+\offset(%rsp)
 	movq $__USER_CS,CS+\offset(%rsp)
 	movq RIP+\offset(%rsp),\tmp  /* get rip */
@@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
 	.endm
 
 	.macro RESTORE_TOP_OF_STACK tmp offset=0
-	movq RSP+\offset(%rsp),\tmp
-	movq \tmp,PER_CPU_VAR(old_rsp)
+	/* nothing to do */
 	.endm
 
 /*
@@ -222,9 +219,6 @@ ENDPROC(native_usergs_sysret64)
  * Interrupts are off on entry.
  * Only called from user space.
  *
- * XXX	if we had a free scratch register we could save the RSP into the stack frame
- *      and report it properly in ps. Unfortunately we haven't.
- *
  * When user can change the frames always force IRET. That is because
  * it deals with uncanonical addresses better. SYSRET has trouble
  * with them due to bugs in both AMD and Intel CPUs.
@@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	ALLOC_PT_GPREGS_ON_STACK 8		/* +8: space for orig_ax */
+	movq	%rcx,RIP(%rsp)
+	movq	PER_CPU_VAR(old_rsp),%rcx
+	movq	%r11,EFLAGS(%rsp)
+	movq	%rcx,RSP(%rsp)
+	movq_cfi rax,ORIG_RAX
 	SAVE_C_REGS_EXCEPT_RAX_RCX_R11
 	movq	$-ENOSYS,RAX(%rsp)
-	movq_cfi rax,ORIG_RAX
-	movq	%r11,EFLAGS(%rsp)
-	movq	%rcx,RIP(%rsp)
 	CFI_REL_OFFSET rip,RIP
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
 	jnz tracesys
@@ -293,7 +289,7 @@ ret_from_sys_call:
 	CFI_REGISTER	rip,rcx
 	movq	EFLAGS(%rsp),%r11
 	/*CFI_REGISTER	rflags,r11*/
-	movq	PER_CPU_VAR(old_rsp), %rsp
+	movq	RSP(%rsp),%rsp
 	/*
 	 * 64bit SYSRET restores rip from rcx,
 	 * rflags from r11 (but RF and VM bits are forced to 0),
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 781861c..02a8720 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -177,7 +177,7 @@ void perf_get_regs_user(struct perf_regs *regs_user,
 		 * than just blindly copying user_regs.
 		 */
 		regs_user->abi = PERF_SAMPLE_REGS_ABI_64;
-		regs_user_copy->sp = this_cpu_read(old_rsp);
+		regs_user_copy->sp = user_regs->sp;
 		regs_user_copy->cs = __USER_CS;
 		regs_user_copy->ss = __USER_DS;
 		regs_user_copy->cx = -1;  /* usually contains garbage */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 1e393d2..e8c124a 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -602,6 +602,5 @@ long sys_arch_prctl(int code, unsigned long addr)
 
 unsigned long KSTK_ESP(struct task_struct *task)
 {
-	return (test_tsk_thread_flag(task, TIF_IA32)) ?
-			(task_pt_regs(task)->sp) : ((task)->thread.usersp);
+	return task_pt_regs(task)->sp;
 }

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-03-16 12:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-09 18:39 [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Denys Vlasenko
2015-03-09 18:39 ` [PATCH 1/4] x86: save r11 into pt_regs->flags on SYSCALL64 fastpath Denys Vlasenko
2015-03-09 20:02   ` Andy Lutomirski
2015-03-16 12:04   ` [tip:x86/asm] x86/asm/entry/64: Save R11 into pt_regs-> flags " tip-bot for Denys Vlasenko
2015-03-09 18:39 ` [PATCH 3/4] x86: save user rsp in pt_regs->sp " Denys Vlasenko
2015-03-09 20:11   ` Andy Lutomirski
2015-03-09 20:32     ` Denys Vlasenko
2015-03-09 20:43       ` Andy Lutomirski
2015-03-10 12:51   ` Ingo Molnar
2015-03-10 13:10     ` Andy Lutomirski
2015-03-10 13:18     ` Denys Vlasenko
2015-03-10 13:20       ` Andy Lutomirski
2015-03-10 13:26         ` Ingo Molnar
2015-03-10 13:21       ` Ingo Molnar
2015-03-10 13:26         ` Andy Lutomirski
2015-03-10 14:00           ` Denys Vlasenko
2015-03-10 14:02             ` Andy Lutomirski
2015-03-10 14:09               ` Denys Vlasenko
2015-03-10 13:28         ` Ingo Molnar
2015-03-10 13:50         ` Denys Vlasenko
2015-03-16  9:44           ` Ingo Molnar
2015-03-16 12:05   ` [tip:x86/asm] x86/asm/entry/64: Save user RSP in pt_regs-> sp " tip-bot for Denys Vlasenko
2015-03-10  6:00 ` [PATCH 0/4 v2] x86: entry_64.S: steps towards simpler iret frame handling Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).