All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET
@ 2015-03-18 19:47 Denys Vlasenko
  2015-03-18 19:47 ` [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack Denys Vlasenko
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Denys Vlasenko @ 2015-03-18 19:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

PER_CPU_VAR(kernel_stack) was set up in a way where it points
five stack slots below the top of stack.

Presumably, it was done to avoid one "sub $5*8,%rsp"
in syscall/sysenter code paths, where iret frame needs to be
created by hand.

Ironically, none of them benefits from this optimization,
since all of them need to allocate additional data on stack
(struct pt_regs), so they still have to perform subtraction.

This patch eliminates KERNEL_STACK_OFFSET.

PER_CPU_VAR(kernel_stack) now points directly to top of stack.
pt_regs allocations are adjusted to allocate iret frame as well.
Hopefully we can merge it later with 32-bit specific
PER_CPU_VAR(cpu_current_top_of_stack) variable...

Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??"
are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS) -
"calculate thread_info's address using information that
rsp is SIZEOF_PTREGS bytes below the stack top".

Net result in generated code is that constants in several insns
are changed.

This change is necessary for changing struct pt_regs creation
in SYSCALL64 code path from MOV to PUSH instructions.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/ia32/ia32entry.S          | 34 +++++++++++++++++-----------------
 arch/x86/include/asm/thread_info.h |  5 ++---
 arch/x86/kernel/cpu/common.c       |  2 +-
 arch/x86/kernel/entry_64.S         |  9 ++++-----
 arch/x86/kernel/process_32.c       |  2 +-
 arch/x86/kernel/process_64.c       |  3 +--
 arch/x86/kernel/smpboot.c          |  3 +--
 arch/x86/xen/smp.c                 |  3 +--
 8 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index ad9efef..acbff3f 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -127,7 +127,7 @@ ENTRY(ia32_sysenter_target)
 	CFI_REL_OFFSET rsp,0
 	pushfq_cfi
 	/*CFI_REL_OFFSET rflags,0*/
-	movl	TI_sysenter_return+THREAD_INFO(%rsp,3*8-KERNEL_STACK_OFFSET),%r10d
+	movl	TI_sysenter_return+THREAD_INFO(%rsp,3*8),%r10d
 	CFI_REGISTER rip,r10
 	pushq_cfi $__USER32_CS
 	/*CFI_REL_OFFSET cs,0*/
@@ -159,8 +159,8 @@ ENTRY(ia32_sysenter_target)
 	jnz sysenter_fix_flags
 sysenter_flags_fixed:
 
-	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
-	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
+	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	CFI_REMEMBER_STATE
 	jnz  sysenter_tracesys
 	cmpq	$(IA32_NR_syscalls-1),%rax
@@ -177,10 +177,10 @@ sysenter_dispatch:
 	movq	%rax,RAX(%rsp)
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
+	testl	$_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz	sysexit_audit
 sysexit_from_sys_call:
-	andl    $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
+	andl    $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	/* clear IF, that popfq doesn't enable interrupts early */
 	andl	$~0x200,EFLAGS(%rsp)
 	movl	RIP(%rsp),%edx		/* User %eip */
@@ -225,7 +225,7 @@ sysexit_from_sys_call:
 	.endm
 
 	.macro auditsys_exit exit
-	testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
+	testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz ia32_ret_from_sys_call
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
@@ -240,7 +240,7 @@ sysexit_from_sys_call:
 	movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl %edi,TI_flags+THREAD_INFO(%rsp,RIP)
+	testl %edi,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jz \exit
 	CLEAR_RREGS
 	jmp int_with_check
@@ -262,7 +262,7 @@ sysenter_fix_flags:
 
 sysenter_tracesys:
 #ifdef CONFIG_AUDITSYSCALL
-	testl	$(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
+	testl	$(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jz	sysenter_auditsys
 #endif
 	SAVE_EXTRA_REGS
@@ -311,7 +311,7 @@ ENDPROC(ia32_sysenter_target)
 ENTRY(ia32_cstar_target)
 	CFI_STARTPROC32	simple
 	CFI_SIGNAL_FRAME
-	CFI_DEF_CFA	rsp,KERNEL_STACK_OFFSET
+	CFI_DEF_CFA	rsp,0
 	CFI_REGISTER	rip,rcx
 	/*CFI_REGISTER	rflags,r11*/
 	SWAPGS_UNSAFE_STACK
@@ -323,7 +323,7 @@ ENTRY(ia32_cstar_target)
 	 * disabled irqs and here we enable it straight after entry:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	ALLOC_PT_GPREGS_ON_STACK 8	/* +8: space for orig_ax */
+	ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
 	SAVE_C_REGS_EXCEPT_RCX_R891011
 	movl 	%eax,%eax	/* zero extension */
 	movq	%rax,ORIG_RAX(%rsp)
@@ -346,8 +346,8 @@ ENTRY(ia32_cstar_target)
 1:	movl	(%r8),%r9d
 	_ASM_EXTABLE(1b,ia32_badarg)
 	ASM_CLAC
-	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
-	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+	orl     $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
+	testl   $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	CFI_REMEMBER_STATE
 	jnz   cstar_tracesys
 	cmpq $IA32_NR_syscalls-1,%rax
@@ -364,10 +364,10 @@ cstar_dispatch:
 	movq %rax,RAX(%rsp)
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
+	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz sysretl_audit
 sysretl_from_sys_call:
-	andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
+	andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	RESTORE_RSI_RDI_RDX
 	movl RIP(%rsp),%ecx
 	CFI_REGISTER rip,rcx
@@ -402,7 +402,7 @@ sysretl_audit:
 
 cstar_tracesys:
 #ifdef CONFIG_AUDITSYSCALL
-	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
+	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jz cstar_auditsys
 #endif
 	xchgl %r9d,%ebp
@@ -469,8 +469,8 @@ ENTRY(ia32_syscall)
 	   this could be a problem. */
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS_EXCEPT_R891011
-	orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
-	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+	orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
+	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz ia32_tracesys
 	cmpq $(IA32_NR_syscalls-1),%rax
 	ja ia32_badsys
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index ba115eb..5381cb9 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -172,7 +172,6 @@ struct thread_info {
 #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)
 
 #define STACK_WARN		(THREAD_SIZE/8)
-#define KERNEL_STACK_OFFSET	(5*(BITS_PER_LONG/8))
 
 /*
  * macros/functions for gaining access to the thread information structure
@@ -204,13 +203,13 @@ static inline unsigned long current_stack_pointer(void)
 /* how to get the thread information struct from ASM */
 #define GET_THREAD_INFO(reg) \
 	_ASM_MOV PER_CPU_VAR(kernel_stack),reg ; \
-	_ASM_SUB $(THREAD_SIZE-KERNEL_STACK_OFFSET),reg ;
+	_ASM_SUB $(THREAD_SIZE),reg ;
 
 /*
  * Same if PER_CPU_VAR(kernel_stack) is, perhaps with some offset, already in
  * a certain register (to be used in assembler memory operands).
  */
-#define THREAD_INFO(reg, off) KERNEL_STACK_OFFSET+(off)-THREAD_SIZE(reg)
+#define THREAD_INFO(reg, off) (off)-THREAD_SIZE(reg)
 
 #endif
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index e45ba5e..34719f3 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1173,7 +1173,7 @@ static __init int setup_disablecpuid(char *arg)
 __setup("clearcpuid=", setup_disablecpuid);
 
 DEFINE_PER_CPU(unsigned long, kernel_stack) =
-	(unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE;
+	(unsigned long)&init_thread_union + THREAD_SIZE;
 EXPORT_PER_CPU_SYMBOL(kernel_stack);
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0c91256..ecb68d8 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -225,7 +225,7 @@ ENDPROC(native_usergs_sysret64)
 ENTRY(system_call)
 	CFI_STARTPROC	simple
 	CFI_SIGNAL_FRAME
-	CFI_DEF_CFA	rsp,KERNEL_STACK_OFFSET
+	CFI_DEF_CFA	rsp,0
 	CFI_REGISTER	rip,rcx
 	/*CFI_REGISTER	rflags,r11*/
 	SWAPGS_UNSAFE_STACK
@@ -242,9 +242,8 @@ GLOBAL(system_call_after_swapgs)
 	 * so we can enable interrupts only after we're done with using rsp_scratch:
 	 */
 	movq	%rsp,PER_CPU_VAR(rsp_scratch)
-	/* kernel_stack is set so that 5 slots (iret frame) are preallocated */
 	movq	PER_CPU_VAR(kernel_stack),%rsp
-	ALLOC_PT_GPREGS_ON_STACK 8		/* +8: space for orig_ax */
+	ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
 	movq	%rcx,RIP(%rsp)
 	movq	PER_CPU_VAR(rsp_scratch),%rcx
 	movq	%r11,EFLAGS(%rsp)
@@ -258,7 +257,7 @@ GLOBAL(system_call_after_swapgs)
 	SAVE_C_REGS_EXCEPT_RAX_RCX_R11
 	movq	$-ENOSYS,RAX(%rsp)
 	CFI_REL_OFFSET rip,RIP
-	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz tracesys
 system_call_fastpath:
 #if __SYSCALL_MASK == ~0
@@ -276,7 +275,7 @@ system_call_fastpath:
  * Has incompletely filled pt_regs, iret frame is also incomplete.
  */
 ret_from_sys_call:
-	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
+	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz int_ret_from_sys_call_fixup	/* Go the the slow path */
 
 	LOCKDEP_SYS_EXIT
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 1b9963f..5a4c2f8 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -312,7 +312,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	load_sp0(tss, next);
 	this_cpu_write(kernel_stack,
 		       (unsigned long)task_stack_page(next_p) +
-		       THREAD_SIZE - KERNEL_STACK_OFFSET);
+		       THREAD_SIZE);
 	this_cpu_write(cpu_current_top_of_stack,
 		       (unsigned long)task_stack_page(next_p) +
 		       THREAD_SIZE);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 97f5658..db49063 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -409,8 +409,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	load_sp0(tss, next);
 
 	this_cpu_write(kernel_stack,
-		  (unsigned long)task_stack_page(next_p) +
-		  THREAD_SIZE - KERNEL_STACK_OFFSET);
+		(unsigned long)task_stack_page(next_p) + THREAD_SIZE);
 
 	/*
 	 * Now maybe reload the debug registers and handle I/O bitmaps
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 759388c..7b20ffd 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -813,8 +813,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 	initial_gs = per_cpu_offset(cpu);
 #endif
 	per_cpu(kernel_stack, cpu) =
-		(unsigned long)task_stack_page(idle) -
-		KERNEL_STACK_OFFSET + THREAD_SIZE;
+		(unsigned long)task_stack_page(idle) + THREAD_SIZE;
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
 	initial_code = (unsigned long)start_secondary;
 	stack_start  = idle->thread.sp;
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 08e8489..765b768 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -452,8 +452,7 @@ static int xen_cpu_up(unsigned int cpu, struct task_struct *idle)
 	clear_tsk_thread_flag(idle, TIF_FORK);
 #endif
 	per_cpu(kernel_stack, cpu) =
-		(unsigned long)task_stack_page(idle) -
-		KERNEL_STACK_OFFSET + THREAD_SIZE;
+		(unsigned long)task_stack_page(idle) + THREAD_SIZE;
 
 	xen_setup_runstate_info(cpu);
 	xen_setup_timer(cpu);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack
  2015-03-18 19:47 [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET Denys Vlasenko
@ 2015-03-18 19:47 ` Denys Vlasenko
  2015-03-18 21:01   ` Andy Lutomirski
  2015-03-18 19:47 ` [PATCH 3/3] x86: get rid of FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK Denys Vlasenko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Denys Vlasenko @ 2015-03-18 19:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

We lose a number of large insns there:

    text    data     bss     dec     hex filename
    9863       0       0    9863    2687 entry_64_before.o
    9671       0       0    9671    25c7 entry_64.o

What's more important, we convert two "MOVQ $imm,off(%rsp)" to "PUSH $imm"
(the ones which fill pt_regs->cs,ss).

Before this patch, placing them on fast path was slowing it down by two cycles:
this form of MOV is very large, 12 bytes, and this probably reduces decode bandwidth
to one insn per cycle when it meets them.
Therefore they were living in FIXUP_TOP_OF_STACK instead (away from hot path).

"PUSH $imm" is a small 2-byte insn. Moving it to fast path does not slow it down,
in my measurements.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/entry_64.S | 46 +++++++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index ecb68d8..4647c1d 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -126,11 +126,8 @@ ENDPROC(native_usergs_sysret64)
  * manipulation.
  */
 	.macro FIXUP_TOP_OF_STACK tmp offset=0
-	movq $__USER_DS,SS+\offset(%rsp)
-	movq $__USER_CS,CS+\offset(%rsp)
-	movq RIP+\offset(%rsp),\tmp  /* get rip */
-	movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
-	movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */
+	/* copy flags to r11 as sysret would do */
+	movq EFLAGS+\offset(%rsp),\tmp
 	movq \tmp,R11+\offset(%rsp)
 	.endm
 
@@ -236,27 +233,34 @@ ENTRY(system_call)
 	 */
 GLOBAL(system_call_after_swapgs)
 
-	/*
-	 * We use 'rsp_scratch' as a scratch register, hence this block must execute
-	 * atomically in the face of possible interrupt-driven task preemption,
-	 * so we can enable interrupts only after we're done with using rsp_scratch:
-	 */
 	movq	%rsp,PER_CPU_VAR(rsp_scratch)
 	movq	PER_CPU_VAR(kernel_stack),%rsp
-	ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
-	movq	%rcx,RIP(%rsp)
-	movq	PER_CPU_VAR(rsp_scratch),%rcx
-	movq	%r11,EFLAGS(%rsp)
-	movq	%rcx,RSP(%rsp)
+
+	/* Construct struct pt_regs on stack */
+	pushq_cfi $__USER_DS			/* pt_regs->ss */
+	pushq_cfi PER_CPU_VAR(rsp_scratch)	/* pt_regs->sp */
 	/*
-	 * No need to follow this irqs off/on section - it's straight
-	 * and short:
+	 * We use 'rsp_scratch' as a scratch register, hence the block above
+	 * must execute atomically in the face of possible interrupt-driven
+	 * task preemption. We must enable interrupts only after we're done
+	 * with using rsp_scratch:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	movq_cfi rax,ORIG_RAX
-	SAVE_C_REGS_EXCEPT_RAX_RCX_R11
-	movq	$-ENOSYS,RAX(%rsp)
-	CFI_REL_OFFSET rip,RIP
+	pushq_cfi	%r11			/* pt_regs->flags */
+	pushq_cfi	$__USER_CS		/* pt_regs->cs */
+	pushq_cfi	%rcx			/* pt_regs->ip */
+	CFI_REL_OFFSET rip,0
+	pushq_cfi_reg	rax			/* pt_regs->orig_ax */
+	pushq_cfi_reg	rdi			/* pt_regs->di */
+	pushq_cfi_reg	rsi			/* pt_regs->si */
+	pushq_cfi_reg	rdx			/* pt_regs->dx */
+	pushq_cfi_reg	rcx			/* pt_regs->cx */
+	pushq_cfi	$-ENOSYS		/* pt_regs->ax */
+	pushq_cfi_reg	r8			/* pt_regs->r8 */
+	pushq_cfi_reg	r9			/* pt_regs->r9 */
+	pushq_cfi_reg	r10			/* pt_regs->r10 */
+	sub	$(7*8),%rsp /* pt_regs->r11,bp,bx,r12-15 not saved */
+
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz tracesys
 system_call_fastpath:
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] x86: get rid of FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK
  2015-03-18 19:47 [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET Denys Vlasenko
  2015-03-18 19:47 ` [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack Denys Vlasenko
@ 2015-03-18 19:47 ` Denys Vlasenko
  2015-03-18 21:02   ` Andy Lutomirski
  2015-03-18 20:47 ` [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET Borislav Petkov
  2015-03-18 20:54 ` Andy Lutomirski
  3 siblings, 1 reply; 14+ messages in thread
From: Denys Vlasenko @ 2015-03-18 19:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

FIXUP_TOP_OF_STACK is only necessary because we don't save %r11
to pt_regs->r11 on SYSCALL64 fast path, but we want ptrace to see
it populated.

Bite the bullet, add a single additional PUSH insn, and remove
FIXUP_TOP_OF_STACK.

RESTORE_TOP_OF_STACK is already a nop. Remove it too.

On SandyBridge CPU, it does not get slower:
measured 54.22 ns per getpid syscall before and after last two changes
on defconfig kernel.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/entry_64.S | 35 ++---------------------------------
 1 file changed, 2 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 4647c1d..a0a3a6e 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -22,8 +22,6 @@
  * - CFI macros are used to generate dwarf2 unwind information for better
  * backtraces. They don't change any code.
  * - ENTRY/END Define functions in the symbol table.
- * - FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK - Fix up the hardware stack
- * frame that is otherwise undefined after a SYSCALL
  * - TRACE_IRQ_* - Trace hard interrupt state for lock debugging.
  * - idtentry - Define exception entry points.
  */
@@ -119,23 +117,6 @@ ENDPROC(native_usergs_sysret64)
 #endif
 
 /*
- * C code is not supposed to know that the iret frame is not populated.
- * Every time a C function with an pt_regs argument is called from
- * the SYSCALL based fast path FIXUP_TOP_OF_STACK is needed.
- * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs
- * manipulation.
- */
-	.macro FIXUP_TOP_OF_STACK tmp offset=0
-	/* copy flags to r11 as sysret would do */
-	movq EFLAGS+\offset(%rsp),\tmp
-	movq \tmp,R11+\offset(%rsp)
-	.endm
-
-	.macro RESTORE_TOP_OF_STACK tmp offset=0
-	/* nothing to do */
-	.endm
-
-/*
  * empty frame
  */
 	.macro EMPTY_FRAME start=1 offset=0
@@ -259,7 +240,8 @@ GLOBAL(system_call_after_swapgs)
 	pushq_cfi_reg	r8			/* pt_regs->r8 */
 	pushq_cfi_reg	r9			/* pt_regs->r9 */
 	pushq_cfi_reg	r10			/* pt_regs->r10 */
-	sub	$(7*8),%rsp /* pt_regs->r11,bp,bx,r12-15 not saved */
+	pushq_cfi_reg	r11			/* pt_regs->r11 */
+	sub	$(6*8),%rsp /* pt_regs->bp,bx,r12-15 not saved */
 
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
 	jnz tracesys
@@ -306,7 +288,6 @@ ret_from_sys_call:
 	CFI_RESTORE_STATE
 
 int_ret_from_sys_call_fixup:
-	FIXUP_TOP_OF_STACK %r11
 	jmp int_ret_from_sys_call
 
 	/* Do syscall entry tracing */
@@ -322,7 +303,6 @@ tracesys:
 
 tracesys_phase2:
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %rdi
 	movq %rsp, %rdi
 	movq $AUDIT_ARCH_X86_64, %rsi
 	movq %rax,%rdx
@@ -415,9 +395,7 @@ ENTRY(stub_\func)
 	CFI_STARTPROC
 	DEFAULT_FRAME 0, 8		/* offset 8: return address */
 	SAVE_EXTRA_REGS 8
-	FIXUP_TOP_OF_STACK %r11, 8
 	call sys_\func
-	RESTORE_TOP_OF_STACK %r11, 8
 	ret
 	CFI_ENDPROC
 END(stub_\func)
@@ -432,7 +410,6 @@ ENTRY(stub_execve)
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
 	call sys_execve
 	movq %rax,RAX(%rsp)
 	RESTORE_EXTRA_REGS
@@ -445,9 +422,7 @@ ENTRY(stub_execveat)
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
 	call sys_execveat
-	RESTORE_TOP_OF_STACK %r11
 	movq %rax,RAX(%rsp)
 	RESTORE_EXTRA_REGS
 	jmp int_ret_from_sys_call
@@ -463,7 +438,6 @@ ENTRY(stub_rt_sigreturn)
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
 	call sys_rt_sigreturn
 	movq %rax,RAX(%rsp) # fixme, this could be done at the higher layer
 	RESTORE_EXTRA_REGS
@@ -477,7 +451,6 @@ ENTRY(stub_x32_rt_sigreturn)
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
 	call sys32_x32_rt_sigreturn
 	movq %rax,RAX(%rsp) # fixme, this could be done at the higher layer
 	RESTORE_EXTRA_REGS
@@ -490,9 +463,7 @@ ENTRY(stub_x32_execve)
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
 	call compat_sys_execve
-	RESTORE_TOP_OF_STACK %r11
 	movq %rax,RAX(%rsp)
 	RESTORE_EXTRA_REGS
 	jmp int_ret_from_sys_call
@@ -504,9 +475,7 @@ ENTRY(stub_x32_execveat)
 	addq $8, %rsp
 	DEFAULT_FRAME 0
 	SAVE_EXTRA_REGS
-	FIXUP_TOP_OF_STACK %r11
 	call compat_sys_execveat
-	RESTORE_TOP_OF_STACK %r11
 	movq %rax,RAX(%rsp)
 	RESTORE_EXTRA_REGS
 	jmp int_ret_from_sys_call
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET
  2015-03-18 19:47 [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET Denys Vlasenko
  2015-03-18 19:47 ` [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack Denys Vlasenko
  2015-03-18 19:47 ` [PATCH 3/3] x86: get rid of FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK Denys Vlasenko
@ 2015-03-18 20:47 ` Borislav Petkov
  2015-03-18 20:54 ` Andy Lutomirski
  3 siblings, 0 replies; 14+ messages in thread
From: Borislav Petkov @ 2015-03-18 20:47 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, x86, linux-kernel

On Wed, Mar 18, 2015 at 08:47:34PM +0100, Denys Vlasenko wrote:
> PER_CPU_VAR(kernel_stack) was set up in a way where it points
> five stack slots below the top of stack.
> 
> Presumably, it was done to avoid one "sub $5*8,%rsp"
> in syscall/sysenter code paths, where iret frame needs to be
> created by hand.
> 
> Ironically, none of them benefits from this optimization,
> since all of them need to allocate additional data on stack
> (struct pt_regs), so they still have to perform subtraction.
> 
> This patch eliminates KERNEL_STACK_OFFSET.
> 
> PER_CPU_VAR(kernel_stack) now points directly to top of stack.
> pt_regs allocations are adjusted to allocate iret frame as well.
> Hopefully we can merge it later with 32-bit specific
> PER_CPU_VAR(cpu_current_top_of_stack) variable...
> 
> Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??"
> are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS) -
> "calculate thread_info's address using information that
> rsp is SIZEOF_PTREGS bytes below the stack top".

Can we please add those last two lines to the definition of THREAD_INFO
in the header? It is good to have.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET
  2015-03-18 19:47 [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET Denys Vlasenko
                   ` (2 preceding siblings ...)
  2015-03-18 20:47 ` [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET Borislav Petkov
@ 2015-03-18 20:54 ` Andy Lutomirski
  2015-03-19 15:28   ` Denys Vlasenko
  3 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2015-03-18 20:54 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Wed, Mar 18, 2015 at 12:47 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> PER_CPU_VAR(kernel_stack) was set up in a way where it points
> five stack slots below the top of stack.
>
> Presumably, it was done to avoid one "sub $5*8,%rsp"
> in syscall/sysenter code paths, where iret frame needs to be
> created by hand.
>
> Ironically, none of them benefits from this optimization,
> since all of them need to allocate additional data on stack
> (struct pt_regs), so they still have to perform subtraction.
>
> This patch eliminates KERNEL_STACK_OFFSET.
>
> PER_CPU_VAR(kernel_stack) now points directly to top of stack.
> pt_regs allocations are adjusted to allocate iret frame as well.
> Hopefully we can merge it later with 32-bit specific
> PER_CPU_VAR(cpu_current_top_of_stack) variable...
>
> Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??"
> are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS) -
> "calculate thread_info's address using information that
> rsp is SIZEOF_PTREGS bytes below the stack top".
>
> Net result in generated code is that constants in several insns
> are changed.
>
> This change is necessary for changing struct pt_regs creation
> in SYSCALL64 code path from MOV to PUSH instructions.
>

Would it be reasonable to break this up into two pieces: first, remove
KERNEL_STACK_OFFSET from THREAD_INFO and related macros (i.e. make
them relative to current_top_of_stack instead) and second, remove
KERNEL_STACK_OFFSET?

Arguably the second half would be better accomplished by switching
everything to use sp0 or its x86_32 equivalent and then eliminating
kernel_stack entirely.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack
  2015-03-18 19:47 ` [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack Denys Vlasenko
@ 2015-03-18 21:01   ` Andy Lutomirski
  2015-03-18 21:12     ` Denys Vlasenko
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2015-03-18 21:01 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Wed, Mar 18, 2015 at 12:47 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> We lose a number of large insns there:
>
>     text    data     bss     dec     hex filename
>     9863       0       0    9863    2687 entry_64_before.o
>     9671       0       0    9671    25c7 entry_64.o
>
> What's more important, we convert two "MOVQ $imm,off(%rsp)" to "PUSH $imm"
> (the ones which fill pt_regs->cs,ss).
>
> Before this patch, placing them on fast path was slowing it down by two cycles:
> this form of MOV is very large, 12 bytes, and this probably reduces decode bandwidth
> to one insn per cycle when it meets them.
> Therefore they were living in FIXUP_TOP_OF_STACK instead (away from hot path).

Does that mean that this has zero performance impact, or is it
actually a speedup?

Also, I would add something to the description explaining that this
affects specifically the syscall path and that we're now populating
SS, CS, and RCX unconditionally and that we therefore don't need to
populate then in FIXUP_TOP_OF_STACK.

>
> "PUSH $imm" is a small 2-byte insn. Moving it to fast path does not slow it down,
> in my measurements.
>
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Borislav Petkov <bp@alien8.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Alexei Starovoitov <ast@plumgrid.com>
> CC: Will Drewry <wad@chromium.org>
> CC: Kees Cook <keescook@chromium.org>
> CC: x86@kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/entry_64.S | 46 +++++++++++++++++++++++++---------------------
>  1 file changed, 25 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index ecb68d8..4647c1d 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -126,11 +126,8 @@ ENDPROC(native_usergs_sysret64)
>   * manipulation.
>   */
>         .macro FIXUP_TOP_OF_STACK tmp offset=0
> -       movq $__USER_DS,SS+\offset(%rsp)
> -       movq $__USER_CS,CS+\offset(%rsp)
> -       movq RIP+\offset(%rsp),\tmp  /* get rip */
> -       movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
> -       movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */
> +       /* copy flags to r11 as sysret would do */
> +       movq EFLAGS+\offset(%rsp),\tmp
>         movq \tmp,R11+\offset(%rsp)
>         .endm
>
> @@ -236,27 +233,34 @@ ENTRY(system_call)
>          */
>  GLOBAL(system_call_after_swapgs)
>
> -       /*
> -        * We use 'rsp_scratch' as a scratch register, hence this block must execute
> -        * atomically in the face of possible interrupt-driven task preemption,
> -        * so we can enable interrupts only after we're done with using rsp_scratch:
> -        */
>         movq    %rsp,PER_CPU_VAR(rsp_scratch)
>         movq    PER_CPU_VAR(kernel_stack),%rsp
> -       ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
> -       movq    %rcx,RIP(%rsp)
> -       movq    PER_CPU_VAR(rsp_scratch),%rcx
> -       movq    %r11,EFLAGS(%rsp)
> -       movq    %rcx,RSP(%rsp)
> +
> +       /* Construct struct pt_regs on stack */
> +       pushq_cfi $__USER_DS                    /* pt_regs->ss */
> +       pushq_cfi PER_CPU_VAR(rsp_scratch)      /* pt_regs->sp */
>         /*
> -        * No need to follow this irqs off/on section - it's straight
> -        * and short:
> +        * We use 'rsp_scratch' as a scratch register, hence the block above
> +        * must execute atomically in the face of possible interrupt-driven
> +        * task preemption. We must enable interrupts only after we're done
> +        * with using rsp_scratch:
>          */

I think that you lost information here, but only because it was really
poorly written.  If you mentally replace "follow" with "trace", then
it should make sense.  That is, I would change your comment to:

+        * We use 'rsp_scratch' as a scratch register, hence the block above
+        * must execute atomically in the face of possible interrupt-driven
+        * task preemption. We must enable interrupts only after we're done
+        * with using rsp_scratch.  Don't bother tracing the interrupt
re-enable,
+        * because the IRQ off code is very short and we didn't trace the fact
+        * that syscall disabled interrupts in the first place.


>         ENABLE_INTERRUPTS(CLBR_NONE)

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] x86: get rid of FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK
  2015-03-18 19:47 ` [PATCH 3/3] x86: get rid of FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK Denys Vlasenko
@ 2015-03-18 21:02   ` Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2015-03-18 21:02 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Wed, Mar 18, 2015 at 12:47 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> FIXUP_TOP_OF_STACK is only necessary because we don't save %r11
> to pt_regs->r11 on SYSCALL64 fast path, but we want ptrace to see
> it populated.
>
> Bite the bullet, add a single additional PUSH insn, and remove
> FIXUP_TOP_OF_STACK.
>
> RESTORE_TOP_OF_STACK is already a nop. Remove it too.
>
> On SandyBridge CPU, it does not get slower:
> measured 54.22 ns per getpid syscall before and after last two changes
> on defconfig kernel.

Hallelujah!

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack
  2015-03-18 21:01   ` Andy Lutomirski
@ 2015-03-18 21:12     ` Denys Vlasenko
  2015-03-18 21:22       ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Denys Vlasenko @ 2015-03-18 21:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On 03/18/2015 10:01 PM, Andy Lutomirski wrote:
> On Wed, Mar 18, 2015 at 12:47 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> We lose a number of large insns there:
>>
>>     text    data     bss     dec     hex filename
>>     9863       0       0    9863    2687 entry_64_before.o
>>     9671       0       0    9671    25c7 entry_64.o
>>
>> What's more important, we convert two "MOVQ $imm,off(%rsp)" to "PUSH $imm"
>> (the ones which fill pt_regs->cs,ss).
>>
>> Before this patch, placing them on fast path was slowing it down by two cycles:
>> this form of MOV is very large, 12 bytes, and this probably reduces decode bandwidth
>> to one insn per cycle when it meets them.
>> Therefore they were living in FIXUP_TOP_OF_STACK instead (away from hot path).
> 
> Does that mean that this has zero performance impact, or is it
> actually a speedup?


No, it's not a speedup because those big bad instructions weren't
on hot path to begin with.

We want them to be there.

Inserting them in a form of MOVs into hot path (say, in order
to eliminate FIXUP_TOP_OF_STACK) *would be* a slowdown.

But we switch to PUSH method, and then inserting them _as PUSHes_
seems to be a wash.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack
  2015-03-18 21:12     ` Denys Vlasenko
@ 2015-03-18 21:22       ` Andy Lutomirski
  2015-03-18 21:32         ` Denys Vlasenko
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2015-03-18 21:22 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Wed, Mar 18, 2015 at 2:12 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/18/2015 10:01 PM, Andy Lutomirski wrote:
>> On Wed, Mar 18, 2015 at 12:47 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>> We lose a number of large insns there:
>>>
>>>     text    data     bss     dec     hex filename
>>>     9863       0       0    9863    2687 entry_64_before.o
>>>     9671       0       0    9671    25c7 entry_64.o
>>>
>>> What's more important, we convert two "MOVQ $imm,off(%rsp)" to "PUSH $imm"
>>> (the ones which fill pt_regs->cs,ss).
>>>
>>> Before this patch, placing them on fast path was slowing it down by two cycles:
>>> this form of MOV is very large, 12 bytes, and this probably reduces decode bandwidth
>>> to one insn per cycle when it meets them.
>>> Therefore they were living in FIXUP_TOP_OF_STACK instead (away from hot path).
>>
>> Does that mean that this has zero performance impact, or is it
>> actually a speedup?
>
>
> No, it's not a speedup because those big bad instructions weren't
> on hot path to begin with.
>
> We want them to be there.
>
> Inserting them in a form of MOVs into hot path (say, in order
> to eliminate FIXUP_TOP_OF_STACK) *would be* a slowdown.
>
> But we switch to PUSH method, and then inserting them _as PUSHes_
> seems to be a wash.
>

Sorry, what I meant was: what was the performance impact of this patch
on fast-path syscalls?

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack
  2015-03-18 21:22       ` Andy Lutomirski
@ 2015-03-18 21:32         ` Denys Vlasenko
  2015-03-18 21:42           ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Denys Vlasenko @ 2015-03-18 21:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On 03/18/2015 10:22 PM, Andy Lutomirski wrote:
> On Wed, Mar 18, 2015 at 2:12 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> On 03/18/2015 10:01 PM, Andy Lutomirski wrote:
>>> On Wed, Mar 18, 2015 at 12:47 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>> We lose a number of large insns there:
>>>>
>>>>     text    data     bss     dec     hex filename
>>>>     9863       0       0    9863    2687 entry_64_before.o
>>>>     9671       0       0    9671    25c7 entry_64.o
>>>>
>>>> What's more important, we convert two "MOVQ $imm,off(%rsp)" to "PUSH $imm"
>>>> (the ones which fill pt_regs->cs,ss).
>>>>
>>>> Before this patch, placing them on fast path was slowing it down by two cycles:
>>>> this form of MOV is very large, 12 bytes, and this probably reduces decode bandwidth
>>>> to one insn per cycle when it meets them.
>>>> Therefore they were living in FIXUP_TOP_OF_STACK instead (away from hot path).
>>>
>>> Does that mean that this has zero performance impact, or is it
>>> actually a speedup?
>>
>>
>> No, it's not a speedup because those big bad instructions weren't
>> on hot path to begin with.
>>
>> We want them to be there.
>>
>> Inserting them in a form of MOVs into hot path (say, in order
>> to eliminate FIXUP_TOP_OF_STACK) *would be* a slowdown.
>>
>> But we switch to PUSH method, and then inserting them _as PUSHes_
>> seems to be a wash.
>>
> 
> Sorry, what I meant was: what was the performance impact of this patch
> on fast-path syscalls?

I measured the next patch (which added one additional push)
and it was a wash compared to timings before both patches.
See comment there.

I did not measure this patch in isolation this time around,
on the previous iteration of this patch it was a single-cycle speedup.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack
  2015-03-18 21:32         ` Denys Vlasenko
@ 2015-03-18 21:42           ` Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2015-03-18 21:42 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Wed, Mar 18, 2015 at 2:32 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/18/2015 10:22 PM, Andy Lutomirski wrote:
>> On Wed, Mar 18, 2015 at 2:12 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>> On 03/18/2015 10:01 PM, Andy Lutomirski wrote:
>>>> On Wed, Mar 18, 2015 at 12:47 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>>> We lose a number of large insns there:
>>>>>
>>>>>     text    data     bss     dec     hex filename
>>>>>     9863       0       0    9863    2687 entry_64_before.o
>>>>>     9671       0       0    9671    25c7 entry_64.o
>>>>>
>>>>> What's more important, we convert two "MOVQ $imm,off(%rsp)" to "PUSH $imm"
>>>>> (the ones which fill pt_regs->cs,ss).
>>>>>
>>>>> Before this patch, placing them on fast path was slowing it down by two cycles:
>>>>> this form of MOV is very large, 12 bytes, and this probably reduces decode bandwidth
>>>>> to one insn per cycle when it meets them.
>>>>> Therefore they were living in FIXUP_TOP_OF_STACK instead (away from hot path).
>>>>
>>>> Does that mean that this has zero performance impact, or is it
>>>> actually a speedup?
>>>
>>>
>>> No, it's not a speedup because those big bad instructions weren't
>>> on hot path to begin with.
>>>
>>> We want them to be there.
>>>
>>> Inserting them in a form of MOVs into hot path (say, in order
>>> to eliminate FIXUP_TOP_OF_STACK) *would be* a slowdown.
>>>
>>> But we switch to PUSH method, and then inserting them _as PUSHes_
>>> seems to be a wash.
>>>
>>
>> Sorry, what I meant was: what was the performance impact of this patch
>> on fast-path syscalls?
>
> I measured the next patch (which added one additional push)
> and it was a wash compared to timings before both patches.
> See comment there.

Oh, I misunderstood and assumed that the comment there was about that
patch in isolation.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET
  2015-03-18 20:54 ` Andy Lutomirski
@ 2015-03-19 15:28   ` Denys Vlasenko
  2015-03-19 15:43     ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Denys Vlasenko @ 2015-03-19 15:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On 03/18/2015 09:54 PM, Andy Lutomirski wrote:
> On Wed, Mar 18, 2015 at 12:47 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> PER_CPU_VAR(kernel_stack) was set up in a way where it points
>> five stack slots below the top of stack.
>>
>> Presumably, it was done to avoid one "sub $5*8,%rsp"
>> in syscall/sysenter code paths, where iret frame needs to be
>> created by hand.
>>
>> Ironically, none of them benefits from this optimization,
>> since all of them need to allocate additional data on stack
>> (struct pt_regs), so they still have to perform subtraction.
>>
>> This patch eliminates KERNEL_STACK_OFFSET.
>>
>> PER_CPU_VAR(kernel_stack) now points directly to top of stack.
>> pt_regs allocations are adjusted to allocate iret frame as well.
>> Hopefully we can merge it later with 32-bit specific
>> PER_CPU_VAR(cpu_current_top_of_stack) variable...
>>
>> Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??"
>> are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS) -
>> "calculate thread_info's address using information that
>> rsp is SIZEOF_PTREGS bytes below the stack top".
>>
>> Net result in generated code is that constants in several insns
>> are changed.
>>
>> This change is necessary for changing struct pt_regs creation
>> in SYSCALL64 code path from MOV to PUSH instructions.
>>
> 
> Would it be reasonable to break this up into two pieces: first, remove
> KERNEL_STACK_OFFSET from THREAD_INFO and related macros (i.e. make
> them relative to current_top_of_stack instead)

PER_CPU(cpu_current_top_of_stack) exists only in 32 bits.
Can't use it in 64-bit code.

PER_CPU(cpu_current_top_of_stack) becomes equal
to PER_CPU(kernel_stack) only after this patch.

I plan to clean up PER_CPU(cpu_current_top_of_stack)/
PER_CPU(kernel_stack)/PER_CPU(tss->sp0) mess
on top of my patches.

First: this seems to be the easier way.

Second (why do this *after* this patch set, not in it?):
the removal of KERNEL_STACK_OFFSET has no strong reason behind it
unless we do MOV -> PUSH conversion - in which case it is mandatory:
we must have RSP start at top of stack in 64-bit mode.

I want to post a minimal patch set which results in PUSH conversion.
Want to avoid polluting it with tangential goal of removing
duplicated percpu variable.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET
  2015-03-19 15:28   ` Denys Vlasenko
@ 2015-03-19 15:43     ` Andy Lutomirski
  2015-03-19 15:47       ` Denys Vlasenko
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2015-03-19 15:43 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Thu, Mar 19, 2015 at 8:28 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/18/2015 09:54 PM, Andy Lutomirski wrote:
>> On Wed, Mar 18, 2015 at 12:47 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>> PER_CPU_VAR(kernel_stack) was set up in a way where it points
>>> five stack slots below the top of stack.
>>>
>>> Presumably, it was done to avoid one "sub $5*8,%rsp"
>>> in syscall/sysenter code paths, where iret frame needs to be
>>> created by hand.
>>>
>>> Ironically, none of them benefits from this optimization,
>>> since all of them need to allocate additional data on stack
>>> (struct pt_regs), so they still have to perform subtraction.
>>>
>>> This patch eliminates KERNEL_STACK_OFFSET.
>>>
>>> PER_CPU_VAR(kernel_stack) now points directly to top of stack.
>>> pt_regs allocations are adjusted to allocate iret frame as well.
>>> Hopefully we can merge it later with 32-bit specific
>>> PER_CPU_VAR(cpu_current_top_of_stack) variable...
>>>
>>> Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??"
>>> are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS) -
>>> "calculate thread_info's address using information that
>>> rsp is SIZEOF_PTREGS bytes below the stack top".
>>>
>>> Net result in generated code is that constants in several insns
>>> are changed.
>>>
>>> This change is necessary for changing struct pt_regs creation
>>> in SYSCALL64 code path from MOV to PUSH instructions.
>>>
>>
>> Would it be reasonable to break this up into two pieces: first, remove
>> KERNEL_STACK_OFFSET from THREAD_INFO and related macros (i.e. make
>> them relative to current_top_of_stack instead)
>
> PER_CPU(cpu_current_top_of_stack) exists only in 32 bits.
> Can't use it in 64-bit code.
>
> PER_CPU(cpu_current_top_of_stack) becomes equal
> to PER_CPU(kernel_stack) only after this patch.

You could have a #define that gives the top of the stack from asm, though.

>
> I plan to clean up PER_CPU(cpu_current_top_of_stack)/
> PER_CPU(kernel_stack)/PER_CPU(tss->sp0) mess
> on top of my patches.
>
> First: this seems to be the easier way.

It's probably easier, but it's harder to review since you're making
two change at once (changing the rsp loaded during syscall and
changing the way that GET_THREAD_INFO works).

--Andy

>
> Second (why do this *after* this patch set, not in it?):
> the removal of KERNEL_STACK_OFFSET has no strong reason behind it
> unless we do MOV -> PUSH conversion - in which case it is mandatory:
> we must have RSP start at top of stack in 64-bit mode.
>
> I want to post a minimal patch set which results in PUSH conversion.
> Want to avoid polluting it with tangential goal of removing
> duplicated percpu variable.
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET
  2015-03-19 15:43     ` Andy Lutomirski
@ 2015-03-19 15:47       ` Denys Vlasenko
  0 siblings, 0 replies; 14+ messages in thread
From: Denys Vlasenko @ 2015-03-19 15:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On 03/19/2015 04:43 PM, Andy Lutomirski wrote:
> On Thu, Mar 19, 2015 at 8:28 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> On 03/18/2015 09:54 PM, Andy Lutomirski wrote:
>>> On Wed, Mar 18, 2015 at 12:47 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>> PER_CPU_VAR(kernel_stack) was set up in a way where it points
>>>> five stack slots below the top of stack.
>>>>
>>>> Presumably, it was done to avoid one "sub $5*8,%rsp"
>>>> in syscall/sysenter code paths, where iret frame needs to be
>>>> created by hand.
>>>>
>>>> Ironically, none of them benefits from this optimization,
>>>> since all of them need to allocate additional data on stack
>>>> (struct pt_regs), so they still have to perform subtraction.
>>>>
>>>> This patch eliminates KERNEL_STACK_OFFSET.
>>>>
>>>> PER_CPU_VAR(kernel_stack) now points directly to top of stack.
>>>> pt_regs allocations are adjusted to allocate iret frame as well.
>>>> Hopefully we can merge it later with 32-bit specific
>>>> PER_CPU_VAR(cpu_current_top_of_stack) variable...
>>>>
>>>> Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??"
>>>> are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS) -
>>>> "calculate thread_info's address using information that
>>>> rsp is SIZEOF_PTREGS bytes below the stack top".
>>>>
>>>> Net result in generated code is that constants in several insns
>>>> are changed.
>>>>
>>>> This change is necessary for changing struct pt_regs creation
>>>> in SYSCALL64 code path from MOV to PUSH instructions.
>>>>
>>>
>>> Would it be reasonable to break this up into two pieces: first, remove
>>> KERNEL_STACK_OFFSET from THREAD_INFO and related macros (i.e. make
>>> them relative to current_top_of_stack instead)
>>
>> PER_CPU(cpu_current_top_of_stack) exists only in 32 bits.
>> Can't use it in 64-bit code.
>>
>> PER_CPU(cpu_current_top_of_stack) becomes equal
>> to PER_CPU(kernel_stack) only after this patch.
> 
> You could have a #define that gives the top of the stack from asm, though.
> 
>>
>> I plan to clean up PER_CPU(cpu_current_top_of_stack)/
>> PER_CPU(kernel_stack)/PER_CPU(tss->sp0) mess
>> on top of my patches.
>>
>> First: this seems to be the easier way.
> 
> It's probably easier, but it's harder to review since you're making
> two change at once (changing the rsp loaded during syscall and
> changing the way that GET_THREAD_INFO works).

I will split off the THREAD_INFO() change.



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-03-19 15:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-18 19:47 [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET Denys Vlasenko
2015-03-18 19:47 ` [PATCH 2/3] x86: entry_64.S: use PUSH insns to build pt_regs on stack Denys Vlasenko
2015-03-18 21:01   ` Andy Lutomirski
2015-03-18 21:12     ` Denys Vlasenko
2015-03-18 21:22       ` Andy Lutomirski
2015-03-18 21:32         ` Denys Vlasenko
2015-03-18 21:42           ` Andy Lutomirski
2015-03-18 19:47 ` [PATCH 3/3] x86: get rid of FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK Denys Vlasenko
2015-03-18 21:02   ` Andy Lutomirski
2015-03-18 20:47 ` [PATCH 1/3] x86: get rid of KERNEL_STACK_OFFSET Borislav Petkov
2015-03-18 20:54 ` Andy Lutomirski
2015-03-19 15:28   ` Denys Vlasenko
2015-03-19 15:43     ` Andy Lutomirski
2015-03-19 15:47       ` Denys Vlasenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.