linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] x86: Rewrite switch_to()
@ 2016-05-21 16:04 Brian Gerst
  2016-05-21 16:04 ` [PATCH 1/4] x86: Save return value from kernel_thread Brian Gerst
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Brian Gerst @ 2016-05-21 16:04 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Ingo Molnar, H. Peter Anvin, Denys Vlasenko, Andy Lutomirski,
	Borislav Petkov, Thomas Gleixner

This patch set simplifies the switch_to() code, by moving the stack switch
code out of line into an asm stub before calling __switch_to().  This ends
up being more readable, and using the C calling convention instead of
clobbering all registers improves code generation.  It also allows newly
forked processes to construct a special stack frame to seamlessly flow
to ret_from_fork, instead of using a test and branch, or an unbalanced
call/ret.

[PATCH 1/4] x86: Save return value from kernel_thread
[PATCH 2/4] x86-32, kgdb: Don't use thread.ip in
[PATCH 3/4] x86: Rewrite switch_to() code
[PATCH 4/4] x86: Pass kernel thread parameters in fork_frame

 arch/x86/entry/entry_32.S          |  68 +++++++++++++-----
 arch/x86/entry/entry_64.S          |  72 +++++++++++++------
 arch/x86/include/asm/processor.h   |   3 -
 arch/x86/include/asm/switch_to.h   | 137 ++++++-------------------------------
 arch/x86/include/asm/thread_info.h |   2 -
 arch/x86/kernel/asm-offsets.c      |   6 ++
 arch/x86/kernel/asm-offsets_32.c   |   5 ++
 arch/x86/kernel/asm-offsets_64.c   |   5 ++
 arch/x86/kernel/kgdb.c             |   3 +-
 arch/x86/kernel/process_32.c       |  19 ++---
 arch/x86/kernel/process_64.c       |  17 +++--
 arch/x86/kernel/smpboot.c          |   1 -
 12 files changed, 153 insertions(+), 185 deletions(-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/4] x86: Save return value from kernel_thread
  2016-05-21 16:04 [PATCH 0/4] x86: Rewrite switch_to() Brian Gerst
@ 2016-05-21 16:04 ` Brian Gerst
  2016-05-22  1:44   ` Andy Lutomirski
  2016-05-21 16:04 ` [PATCH 2/4] x86-32, kgdb: Don't use thread.ip in sleeping_thread_to_gdb_regs() Brian Gerst
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Brian Gerst @ 2016-05-21 16:04 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Ingo Molnar, H. Peter Anvin, Denys Vlasenko, Andy Lutomirski,
	Borislav Petkov, Thomas Gleixner

Kernel threads should always return zero on success after calling do_execve().  The
two existing cases in the kernel (kernel_init() and call_usermodehelper_exec_async())
correctly do this.  Save a few bytes by storing EAX/RAX instead of an immediate zero.
Also fix the 64-bit case which should save the full 64-bits.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/entry/entry_32.S | 2 +-
 arch/x86/entry/entry_64.S | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 983e5d3..ee6fea0 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -221,7 +221,7 @@ ENTRY(ret_from_kernel_thread)
 	popl	%eax
 	movl	PT_EBP(%esp), %eax
 	call	*PT_EBX(%esp)
-	movl	$0, PT_EAX(%esp)
+	movl	%eax, PT_EAX(%esp)
 
 	/*
 	 * Kernel threads return to userspace as if returning from a syscall.
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1..ab9f8c8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -387,7 +387,7 @@ ENTRY(ret_from_fork)
 	 */
 	movq	RBP(%rsp), %rdi
 	call	*RBX(%rsp)
-	movl	$0, RAX(%rsp)
+	movq	%rax, RAX(%rsp)
 
 	/*
 	 * Fall through as though we're exiting a syscall.  This makes a
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/4] x86-32, kgdb: Don't use thread.ip in sleeping_thread_to_gdb_regs()
  2016-05-21 16:04 [PATCH 0/4] x86: Rewrite switch_to() Brian Gerst
  2016-05-21 16:04 ` [PATCH 1/4] x86: Save return value from kernel_thread Brian Gerst
@ 2016-05-21 16:04 ` Brian Gerst
  2016-05-23 17:05   ` Andy Lutomirski
  2016-05-21 16:04 ` [PATCH 3/4] x86: Rewrite switch_to() code Brian Gerst
  2016-05-21 16:04 ` [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame Brian Gerst
  3 siblings, 1 reply; 31+ messages in thread
From: Brian Gerst @ 2016-05-21 16:04 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Ingo Molnar, H. Peter Anvin, Denys Vlasenko, Andy Lutomirski,
	Borislav Petkov, Thomas Gleixner

Match 64-bit and set gdb_regs[GDB_PC] to zero.  thread.ip is always the
same point in the scheduler (except for newly forked processes), and will
be removed in a future patch.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/kernel/kgdb.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 04cde52..fe649a5 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -172,7 +172,6 @@ void sleeping_thread_to_gdb_regs(unsigned long *gdb_regs, struct task_struct *p)
 	gdb_regs[GDB_ES]	= __KERNEL_DS;
 	gdb_regs[GDB_PS]	= 0;
 	gdb_regs[GDB_CS]	= __KERNEL_CS;
-	gdb_regs[GDB_PC]	= p->thread.ip;
 	gdb_regs[GDB_SS]	= __KERNEL_DS;
 	gdb_regs[GDB_FS]	= 0xFFFF;
 	gdb_regs[GDB_GS]	= 0xFFFF;
@@ -180,7 +179,6 @@ void sleeping_thread_to_gdb_regs(unsigned long *gdb_regs, struct task_struct *p)
 	gdb_regs32[GDB_PS]	= *(unsigned long *)(p->thread.sp + 8);
 	gdb_regs32[GDB_CS]	= __KERNEL_CS;
 	gdb_regs32[GDB_SS]	= __KERNEL_DS;
-	gdb_regs[GDB_PC]	= 0;
 	gdb_regs[GDB_R8]	= 0;
 	gdb_regs[GDB_R9]	= 0;
 	gdb_regs[GDB_R10]	= 0;
@@ -190,6 +188,7 @@ void sleeping_thread_to_gdb_regs(unsigned long *gdb_regs, struct task_struct *p)
 	gdb_regs[GDB_R14]	= 0;
 	gdb_regs[GDB_R15]	= 0;
 #endif
+	gdb_regs[GDB_PC]	= 0;
 	gdb_regs[GDB_SP]	= p->thread.sp;
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-21 16:04 [PATCH 0/4] x86: Rewrite switch_to() Brian Gerst
  2016-05-21 16:04 ` [PATCH 1/4] x86: Save return value from kernel_thread Brian Gerst
  2016-05-21 16:04 ` [PATCH 2/4] x86-32, kgdb: Don't use thread.ip in sleeping_thread_to_gdb_regs() Brian Gerst
@ 2016-05-21 16:04 ` Brian Gerst
  2016-05-22 17:59   ` Andy Lutomirski
  2016-06-15  1:31   ` Andy Lutomirski
  2016-05-21 16:04 ` [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame Brian Gerst
  3 siblings, 2 replies; 31+ messages in thread
From: Brian Gerst @ 2016-05-21 16:04 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Ingo Molnar, H. Peter Anvin, Denys Vlasenko, Andy Lutomirski,
	Borislav Petkov, Thomas Gleixner

Move the low-level context switch code to an out-of-line asm stub instead of
using complex inline asm.  This allows constructing a new stack frame for the
child process to make it seamlessly flow to ret_from_fork without an extra
test and branch in __switch_to().  It also improves code generation for
__schedule() by using the C calling convention instead of clobbering all
registers.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/entry/entry_32.S          |  38 ++++++++++
 arch/x86/entry/entry_64.S          |  42 +++++++++++-
 arch/x86/include/asm/processor.h   |   3 -
 arch/x86/include/asm/switch_to.h   | 137 ++++++-------------------------------
 arch/x86/include/asm/thread_info.h |   2 -
 arch/x86/kernel/asm-offsets.c      |   6 ++
 arch/x86/kernel/asm-offsets_32.c   |   5 ++
 arch/x86/kernel/asm-offsets_64.c   |   5 ++
 arch/x86/kernel/process_32.c       |   8 ++-
 arch/x86/kernel/process_64.c       |   7 +-
 arch/x86/kernel/smpboot.c          |   1 -
 11 files changed, 124 insertions(+), 130 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index ee6fea0..05e5340 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -204,6 +204,44 @@
 	POP_GS_EX
 .endm
 
+/*
+ * %eax: prev task
+ * %edx: next task
+ */
+ENTRY(__switch_to_asm)
+	/*
+	 * Save callee-saved registers
+	 * This must match the order in struct fork_frame
+	 * Frame pointer must be last for get_wchan
+	 */
+	pushl	%ebx
+	pushl	%edi
+	pushl	%esi
+	pushl	%ebp
+
+	/* switch stack */
+	movl	%esp, TASK_threadsp(%eax)
+	movl	TASK_threadsp(%edx), %esp
+
+#ifdef CONFIG_CC_STACKPROTECTOR
+	movl	TASK_stack_canary(%edx), %ebx
+	movl	%ebx, PER_CPU_VAR(stack_canary)+stack_canary_offset
+#endif
+
+	/* restore callee-saved registers */
+	popl	%ebp
+	popl	%esi
+	popl	%edi
+	popl	%ebx
+
+	jmp	__switch_to
+END(__switch_to_asm)
+
+/*
+ * A newly forked process directly context switches into this address.
+ *
+ * eax: prev task we switched from
+ */
 ENTRY(ret_from_fork)
 	pushl	%eax
 	call	schedule_tail
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ab9f8c8..0542ad1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -365,13 +365,49 @@ END(ptregs_\func)
 #include <asm/syscalls_64.h>
 
 /*
+ * %rdi: prev task
+ * %rsi: next task
+ */
+ENTRY(__switch_to_asm)
+	/*
+	 * Save callee-saved registers
+	 * This must match the order in struct fork_frame
+	 * Frame pointer must be last for get_wchan
+	 */
+	pushq	%rbx
+	pushq	%r12
+	pushq	%r13
+	pushq	%r14
+	pushq	%r15
+	pushq	%rbp
+
+	/* switch stack */
+	movq	%rsp, TASK_threadsp(%rdi)
+	movq	TASK_threadsp(%rsi), %rsp
+
+#ifdef CONFIG_CC_STACKPROTECTOR
+	movq	TASK_stack_canary(%rsi), %rbx
+	movq	%rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset
+#endif
+
+	/* restore callee-saved registers */
+	popq	%rbp
+	popq	%r15
+	popq	%r14
+	popq	%r13
+	popq	%r12
+	popq	%rbx
+
+	jmp	__switch_to
+END(__switch_to_asm)
+
+/*
  * A newly forked process directly context switches into this address.
  *
- * rdi: prev task we switched from
+ * rax: prev task we switched from
  */
 ENTRY(ret_from_fork)
-	LOCK ; btr $TIF_FORK, TI_flags(%r8)
-
+	movq	%rax, %rdi
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
 	testb	$3, CS(%rsp)			/* from kernel_thread? */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 62c6cc3..d3c2598 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -384,9 +384,6 @@ struct thread_struct {
 	unsigned short		fsindex;
 	unsigned short		gsindex;
 #endif
-#ifdef CONFIG_X86_32
-	unsigned long		ip;
-#endif
 #ifdef CONFIG_X86_64
 	unsigned long		fsbase;
 	unsigned long		gsbase;
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8f321a1..b6c9e0c 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -2,130 +2,35 @@
 #define _ASM_X86_SWITCH_TO_H
 
 struct task_struct; /* one of the stranger aspects of C forward declarations */
+
+struct task_struct *__switch_to_asm(struct task_struct *prev,
+				    struct task_struct *next);
+
 __visible struct task_struct *__switch_to(struct task_struct *prev,
-					   struct task_struct *next);
+					  struct task_struct *next);
 struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
-#ifdef CONFIG_X86_32
-
-#ifdef CONFIG_CC_STACKPROTECTOR
-#define __switch_canary							\
-	"movl %P[task_canary](%[next]), %%ebx\n\t"			\
-	"movl %%ebx, "__percpu_arg([stack_canary])"\n\t"
-#define __switch_canary_oparam						\
-	, [stack_canary] "=m" (stack_canary.canary)
-#define __switch_canary_iparam						\
-	, [task_canary] "i" (offsetof(struct task_struct, stack_canary))
-#else	/* CC_STACKPROTECTOR */
-#define __switch_canary
-#define __switch_canary_oparam
-#define __switch_canary_iparam
-#endif	/* CC_STACKPROTECTOR */
+struct fork_frame {
+	unsigned long bp;
+#ifdef CONFIG_X86_64
+	unsigned long r15;
+	unsigned long r14;
+	unsigned long r13;
+	unsigned long r12;
+#else
+	unsigned long si;
+	unsigned long di;
+#endif
+	unsigned long bx;
+	unsigned long ret_addr;
+	struct pt_regs regs;
+};
 
-/*
- * Saving eflags is important. It switches not only IOPL between tasks,
- * it also protects other tasks from NT leaking through sysenter etc.
- */
 #define switch_to(prev, next, last)					\
 do {									\
-	/*								\
-	 * Context-switching clobbers all registers, so we clobber	\
-	 * them explicitly, via unused output variables.		\
-	 * (EAX and EBP is not listed because EBP is saved/restored	\
-	 * explicitly for wchan access and EAX is the return value of	\
-	 * __switch_to())						\
-	 */								\
-	unsigned long ebx, ecx, edx, esi, edi;				\
-									\
-	asm volatile("pushl %%ebp\n\t"		/* save    EBP   */	\
-		     "movl %%esp,%[prev_sp]\n\t"	/* save    ESP   */ \
-		     "movl %[next_sp],%%esp\n\t"	/* restore ESP   */ \
-		     "movl $1f,%[prev_ip]\n\t"	/* save    EIP   */	\
-		     "pushl %[next_ip]\n\t"	/* restore EIP   */	\
-		     __switch_canary					\
-		     "jmp __switch_to\n"	/* regparm call  */	\
-		     "1:\t"						\
-		     "popl %%ebp\n\t"		/* restore EBP   */	\
-									\
-		     /* output parameters */				\
-		     : [prev_sp] "=m" (prev->thread.sp),		\
-		       [prev_ip] "=m" (prev->thread.ip),		\
-		       "=a" (last),					\
-									\
-		       /* clobbered output registers: */		\
-		       "=b" (ebx), "=c" (ecx), "=d" (edx),		\
-		       "=S" (esi), "=D" (edi)				\
-		       							\
-		       __switch_canary_oparam				\
-									\
-		       /* input parameters: */				\
-		     : [next_sp]  "m" (next->thread.sp),		\
-		       [next_ip]  "m" (next->thread.ip),		\
-		       							\
-		       /* regparm parameters for __switch_to(): */	\
-		       [prev]     "a" (prev),				\
-		       [next]     "d" (next)				\
-									\
-		       __switch_canary_iparam				\
-									\
-		     : /* reloaded segment registers */			\
-			"memory");					\
+	((last) = __switch_to_asm((prev), (next)));			\
 } while (0)
 
-#else /* CONFIG_X86_32 */
-
-/* frame pointer must be last for get_wchan */
-#define SAVE_CONTEXT    "pushq %%rbp ; movq %%rsi,%%rbp\n\t"
-#define RESTORE_CONTEXT "movq %%rbp,%%rsi ; popq %%rbp\t"
-
-#define __EXTRA_CLOBBER  \
-	, "rcx", "rbx", "rdx", "r8", "r9", "r10", "r11", \
-	  "r12", "r13", "r14", "r15", "flags"
-
-#ifdef CONFIG_CC_STACKPROTECTOR
-#define __switch_canary							  \
-	"movq %P[task_canary](%%rsi),%%r8\n\t"				  \
-	"movq %%r8,"__percpu_arg([gs_canary])"\n\t"
-#define __switch_canary_oparam						  \
-	, [gs_canary] "=m" (irq_stack_union.stack_canary)
-#define __switch_canary_iparam						  \
-	, [task_canary] "i" (offsetof(struct task_struct, stack_canary))
-#else	/* CC_STACKPROTECTOR */
-#define __switch_canary
-#define __switch_canary_oparam
-#define __switch_canary_iparam
-#endif	/* CC_STACKPROTECTOR */
-
-/*
- * There is no need to save or restore flags, because flags are always
- * clean in kernel mode, with the possible exception of IOPL.  Kernel IOPL
- * has no effect.
- */
-#define switch_to(prev, next, last) \
-	asm volatile(SAVE_CONTEXT					  \
-	     "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */	  \
-	     "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */	  \
-	     "call __switch_to\n\t"					  \
-	     "movq "__percpu_arg([current_task])",%%rsi\n\t"		  \
-	     __switch_canary						  \
-	     "movq %P[thread_info](%%rsi),%%r8\n\t"			  \
-	     "movq %%rax,%%rdi\n\t" 					  \
-	     "testl  %[_tif_fork],%P[ti_flags](%%r8)\n\t"		  \
-	     "jnz   ret_from_fork\n\t"					  \
-	     RESTORE_CONTEXT						  \
-	     : "=a" (last)					  	  \
-	       __switch_canary_oparam					  \
-	     : [next] "S" (next), [prev] "D" (prev),			  \
-	       [threadrsp] "i" (offsetof(struct task_struct, thread.sp)), \
-	       [ti_flags] "i" (offsetof(struct thread_info, flags)),	  \
-	       [_tif_fork] "i" (_TIF_FORK),			  	  \
-	       [thread_info] "i" (offsetof(struct task_struct, stack)),   \
-	       [current_task] "m" (current_task)			  \
-	       __switch_canary_iparam					  \
-	     : "memory", "cc" __EXTRA_CLOBBER)
-
-#endif /* CONFIG_X86_32 */
-
 #endif /* _ASM_X86_SWITCH_TO_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 30c133a..20d56ec 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -99,7 +99,6 @@ struct thread_info {
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
-#define TIF_FORK		18	/* ret_from_fork */
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
@@ -123,7 +122,6 @@ struct thread_info {
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
-#define _TIF_FORK		(1 << TIF_FORK)
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 674134e..ec41c79 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -29,6 +29,12 @@
 
 void common(void) {
 	BLANK();
+	OFFSET(TASK_threadsp, task_struct, thread.sp);
+#ifdef CONFIG_CC_STACKPROTECTOR
+	OFFSET(TASK_stack_canary, task_struct, stack_canary);
+#endif
+
+	BLANK();
 	OFFSET(TI_flags, thread_info, flags);
 	OFFSET(TI_status, thread_info, status);
 	OFFSET(TI_addr_limit, thread_info, addr_limit);
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index ecdc1d2..880aa09 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -57,6 +57,11 @@ void foo(void)
 	/* Size of SYSENTER_stack */
 	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
 
+#ifdef CONFIG_CC_STACKPROTECTOR
+	BLANK();
+	OFFSET(stack_canary_offset, stack_canary, canary);
+#endif
+
 #if defined(CONFIG_LGUEST) || defined(CONFIG_LGUEST_GUEST) || defined(CONFIG_LGUEST_MODULE)
 	BLANK();
 	OFFSET(LGUEST_DATA_irq_enabled, lguest_data, irq_enabled);
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index d875f97..210927e 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -56,6 +56,11 @@ int main(void)
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
 	BLANK();
 
+#ifdef CONFIG_CC_STACKPROTECTOR
+	DEFINE(stack_canary_offset, offsetof(union irq_stack_union, stack_canary));
+	BLANK();
+#endif
+
 	DEFINE(__NR_syscall_max, sizeof(syscalls_64) - 1);
 	DEFINE(NR_syscalls, sizeof(syscalls_64));
 
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 9f95091..0ba6fdf 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -133,17 +133,19 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	unsigned long arg, struct task_struct *p, unsigned long tls)
 {
 	struct pt_regs *childregs = task_pt_regs(p);
+	struct fork_frame *frame = container_of(childregs, struct fork_frame, regs);
 	struct task_struct *tsk;
 	int err;
 
-	p->thread.sp = (unsigned long) childregs;
+	frame->bp = 0;
+	p->thread.sp = (unsigned long) frame;
 	p->thread.sp0 = (unsigned long) (childregs+1);
 	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
 
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
-		p->thread.ip = (unsigned long) ret_from_kernel_thread;
+		frame->ret_addr = (unsigned long) ret_from_kernel_thread;
 		task_user_gs(p) = __KERNEL_STACK_CANARY;
 		childregs->ds = __USER_DS;
 		childregs->es = __USER_DS;
@@ -161,7 +163,7 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	if (sp)
 		childregs->sp = sp;
 
-	p->thread.ip = (unsigned long) ret_from_fork;
+	frame->ret_addr = (unsigned long) ret_from_fork;
 	task_user_gs(p) = get_user_gs(current_pt_regs());
 
 	p->thread.io_bitmap_ptr = NULL;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6e789ca..9fab915 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -142,11 +142,14 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	int err;
 	struct pt_regs *childregs;
 	struct task_struct *me = current;
+	struct fork_frame *frame;
 
 	p->thread.sp0 = (unsigned long)task_stack_page(p) + THREAD_SIZE;
 	childregs = task_pt_regs(p);
-	p->thread.sp = (unsigned long) childregs;
-	set_tsk_thread_flag(p, TIF_FORK);
+	frame = container_of(childregs, struct fork_frame, regs);
+	frame->bp = 0;
+	frame->ret_addr = (unsigned long) ret_from_fork;
+	p->thread.sp = (unsigned long) frame;
 	p->thread.io_bitmap_ptr = NULL;
 
 	savesegment(gs, p->thread.gsindex);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index fafe8b9..8feb392 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -928,7 +928,6 @@ void common_cpu_up(unsigned int cpu, struct task_struct *idle)
 	per_cpu(cpu_current_top_of_stack, cpu) =
 		(unsigned long)task_stack_page(idle) + THREAD_SIZE;
 #else
-	clear_tsk_thread_flag(idle, TIF_FORK);
 	initial_gs = per_cpu_offset(cpu);
 #endif
 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame
  2016-05-21 16:04 [PATCH 0/4] x86: Rewrite switch_to() Brian Gerst
                   ` (2 preceding siblings ...)
  2016-05-21 16:04 ` [PATCH 3/4] x86: Rewrite switch_to() code Brian Gerst
@ 2016-05-21 16:04 ` Brian Gerst
  2016-05-22 18:01   ` Andy Lutomirski
  2016-05-23 15:23   ` Josh Poimboeuf
  3 siblings, 2 replies; 31+ messages in thread
From: Brian Gerst @ 2016-05-21 16:04 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Ingo Molnar, H. Peter Anvin, Denys Vlasenko, Andy Lutomirski,
	Borislav Petkov, Thomas Gleixner

Instead of setting up a fake pt_regs context, put the kernel thread
function pointer and arg into the unused callee-restored registers
of struct fork_frame.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
---
 arch/x86/entry/entry_32.S    | 28 +++++++++++-----------------
 arch/x86/entry/entry_64.S    | 30 +++++++++++-------------------
 arch/x86/kernel/process_32.c | 15 ++++-----------
 arch/x86/kernel/process_64.c | 10 +++-------
 4 files changed, 29 insertions(+), 54 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 05e5340..424e8d3 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -241,35 +241,29 @@ END(__switch_to_asm)
  * A newly forked process directly context switches into this address.
  *
  * eax: prev task we switched from
+ * ebx: kernel thread func
+ * edi: kernel thread arg
  */
 ENTRY(ret_from_fork)
 	pushl	%eax
 	call	schedule_tail
 	popl	%eax
 
+	testl	%ebx, %ebx
+	jnz	1f
+
+2:
 	/* When we fork, we trace the syscall return in the child, too. */
 	movl    %esp, %eax
 	call    syscall_return_slowpath
 	jmp     restore_all
-END(ret_from_fork)
 
-ENTRY(ret_from_kernel_thread)
-	pushl	%eax
-	call	schedule_tail
-	popl	%eax
-	movl	PT_EBP(%esp), %eax
-	call	*PT_EBX(%esp)
+	/* kernel thread */
+1:	movl	%edi, %eax
+	call	*%ebx
 	movl	%eax, PT_EAX(%esp)
-
-	/*
-	 * Kernel threads return to userspace as if returning from a syscall.
-	 * We should check whether anything actually uses this path and, if so,
-	 * consider switching it over to ret_from_fork.
-	 */
-	movl    %esp, %eax
-	call    syscall_return_slowpath
-	jmp     restore_all
-ENDPROC(ret_from_kernel_thread)
+	jmp	2b
+END(ret_from_fork)
 
 /*
  * Return to user mode is not as complex as all this looks,
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0542ad1..cc8a0ba 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -405,37 +405,29 @@ END(__switch_to_asm)
  * A newly forked process directly context switches into this address.
  *
  * rax: prev task we switched from
+ * rbx: kernel thread func
+ * r12: kernel thread arg
  */
 ENTRY(ret_from_fork)
 	movq	%rax, %rdi
 	call	schedule_tail			/* rdi: 'prev' task parameter */
 
-	testb	$3, CS(%rsp)			/* from kernel_thread? */
+	testq	%rbx, %rbx			/* from kernel_thread? */
 	jnz	1f
 
-	/*
-	 * We came from kernel_thread.  This code path is quite twisted, and
-	 * someone should clean it up.
-	 *
-	 * copy_thread_tls stashes the function pointer in RBX and the
-	 * parameter to be passed in RBP.  The called function is permitted
-	 * to call do_execve and thereby jump to user mode.
-	 */
-	movq	RBP(%rsp), %rdi
-	call	*RBX(%rsp)
-	movq	%rax, RAX(%rsp)
-
-	/*
-	 * Fall through as though we're exiting a syscall.  This makes a
-	 * twisted sort of sense if we just called do_execve.
-	 */
-
-1:
+2:
 	movq	%rsp, %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
 	SWAPGS
 	jmp	restore_regs_and_iret
+
+1:
+	/* kernel thread */
+	movq	%r12, %rdi
+	call	*%rbx
+	movq	%rax, RAX(%rsp)
+	jmp	2b
 END(ret_from_fork)
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 0ba6fdf..bba563f 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -138,6 +138,7 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	int err;
 
 	frame->bp = 0;
+	frame->ret_addr = (unsigned long) ret_from_fork;
 	p->thread.sp = (unsigned long) frame;
 	p->thread.sp0 = (unsigned long) (childregs+1);
 	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
@@ -145,25 +146,17 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
-		frame->ret_addr = (unsigned long) ret_from_kernel_thread;
-		task_user_gs(p) = __KERNEL_STACK_CANARY;
-		childregs->ds = __USER_DS;
-		childregs->es = __USER_DS;
-		childregs->fs = __KERNEL_PERCPU;
-		childregs->bx = sp;	/* function */
-		childregs->bp = arg;
-		childregs->orig_ax = -1;
-		childregs->cs = __KERNEL_CS | get_kernel_rpl();
-		childregs->flags = X86_EFLAGS_IF | X86_EFLAGS_FIXED;
+		frame->bx = sp;		/* function */
+		frame->di = arg;
 		p->thread.io_bitmap_ptr = NULL;
 		return 0;
 	}
+	frame->bx = 0;
 	*childregs = *current_pt_regs();
 	childregs->ax = 0;
 	if (sp)
 		childregs->sp = sp;
 
-	frame->ret_addr = (unsigned long) ret_from_fork;
 	task_user_gs(p) = get_user_gs(current_pt_regs());
 
 	p->thread.io_bitmap_ptr = NULL;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9fab915..421646a 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -163,15 +163,11 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
-		childregs->sp = (unsigned long)childregs;
-		childregs->ss = __KERNEL_DS;
-		childregs->bx = sp; /* function */
-		childregs->bp = arg;
-		childregs->orig_ax = -1;
-		childregs->cs = __KERNEL_CS | get_kernel_rpl();
-		childregs->flags = X86_EFLAGS_IF | X86_EFLAGS_FIXED;
+		frame->bx = sp;		/* function */
+		frame->r12 = arg;
 		return 0;
 	}
+	frame->bx = 0;
 	*childregs = *current_pt_regs();
 
 	childregs->ax = 0;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] x86: Save return value from kernel_thread
  2016-05-21 16:04 ` [PATCH 1/4] x86: Save return value from kernel_thread Brian Gerst
@ 2016-05-22  1:44   ` Andy Lutomirski
  2016-05-22  2:19     ` Brian Gerst
  0 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2016-05-22  1:44 UTC (permalink / raw)
  To: Brian Gerst
  Cc: X86 ML, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Denys Vlasenko, Borislav Petkov, Thomas Gleixner

On Sat, May 21, 2016 at 9:04 AM, Brian Gerst <brgerst@gmail.com> wrote:
> Kernel threads should always return zero on success after calling do_execve().  The
> two existing cases in the kernel (kernel_init() and call_usermodehelper_exec_async())
> correctly do this.  Save a few bytes by storing EAX/RAX instead of an immediate zero.
> Also fix the 64-bit case which should save the full 64-bits.

Does this have any additional motivation beyond cleanup and fixing an
inconsequential bug?  I.e. does the rest of your series need this?

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] x86: Save return value from kernel_thread
  2016-05-22  1:44   ` Andy Lutomirski
@ 2016-05-22  2:19     ` Brian Gerst
  0 siblings, 0 replies; 31+ messages in thread
From: Brian Gerst @ 2016-05-22  2:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Denys Vlasenko, Borislav Petkov, Thomas Gleixner

On Sat, May 21, 2016 at 9:44 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Sat, May 21, 2016 at 9:04 AM, Brian Gerst <brgerst@gmail.com> wrote:
>> Kernel threads should always return zero on success after calling do_execve().  The
>> two existing cases in the kernel (kernel_init() and call_usermodehelper_exec_async())
>> correctly do this.  Save a few bytes by storing EAX/RAX instead of an immediate zero.
>> Also fix the 64-bit case which should save the full 64-bits.
>
> Does this have any additional motivation beyond cleanup and fixing an
> inconsequential bug?  I.e. does the rest of your series need this?
>
> --Andy

It's just a minor cleanup, not necessary.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-21 16:04 ` [PATCH 3/4] x86: Rewrite switch_to() code Brian Gerst
@ 2016-05-22 17:59   ` Andy Lutomirski
  2016-05-22 19:31     ` Brian Gerst
  2016-05-23  2:34     ` Josh Poimboeuf
  2016-06-15  1:31   ` Andy Lutomirski
  1 sibling, 2 replies; 31+ messages in thread
From: Andy Lutomirski @ 2016-05-22 17:59 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Thomas Gleixner, Denys Vlasenko, Ingo Molnar, linux-kernel,
	Borislav Petkov, H. Peter Anvin, X86 ML, Josh Poimboeuf

cc: Josh Poimboeuf: do you care about the exact stack layout of the
bottom of the stack of an inactive task?

On May 21, 2016 9:05 AM, "Brian Gerst" <brgerst@gmail.com> wrote:
>
> Move the low-level context switch code to an out-of-line asm stub instead of
> using complex inline asm.  This allows constructing a new stack frame for the
> child process to make it seamlessly flow to ret_from_fork without an extra
> test and branch in __switch_to().  It also improves code generation for
> __schedule() by using the C calling convention instead of clobbering all
> registers.

I like the concept a lot.

>
> Signed-off-by: Brian Gerst <brgerst@gmail.com>
> ---
>  arch/x86/entry/entry_32.S          |  38 ++++++++++
>  arch/x86/entry/entry_64.S          |  42 +++++++++++-
>  arch/x86/include/asm/processor.h   |   3 -
>  arch/x86/include/asm/switch_to.h   | 137 ++++++-------------------------------
>  arch/x86/include/asm/thread_info.h |   2 -
>  arch/x86/kernel/asm-offsets.c      |   6 ++
>  arch/x86/kernel/asm-offsets_32.c   |   5 ++
>  arch/x86/kernel/asm-offsets_64.c   |   5 ++
>  arch/x86/kernel/process_32.c       |   8 ++-
>  arch/x86/kernel/process_64.c       |   7 +-
>  arch/x86/kernel/smpboot.c          |   1 -
>  11 files changed, 124 insertions(+), 130 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index ee6fea0..05e5340 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -204,6 +204,44 @@
>         POP_GS_EX
>  .endm
>
> +/*
> + * %eax: prev task
> + * %edx: next task
> + */
> +ENTRY(__switch_to_asm)
> +       /*
> +        * Save callee-saved registers
> +        * This must match the order in struct fork_frame
> +        * Frame pointer must be last for get_wchan
> +        */
> +       pushl   %ebx
> +       pushl   %edi
> +       pushl   %esi
> +       pushl   %ebp
> +
> +       /* switch stack */
> +       movl    %esp, TASK_threadsp(%eax)
> +       movl    TASK_threadsp(%edx), %esp
> +
> +#ifdef CONFIG_CC_STACKPROTECTOR
> +       movl    TASK_stack_canary(%edx), %ebx
> +       movl    %ebx, PER_CPU_VAR(stack_canary)+stack_canary_offset
> +#endif
> +
> +       /* restore callee-saved registers */
> +       popl    %ebp
> +       popl    %esi
> +       popl    %edi
> +       popl    %ebx

This is highly, highly magical.  eax and edx are prev and next, and:

> +
> +       jmp     __switch_to

leaves prev in eax.  This works, but it might be worth a comment.

> +END(__switch_to_asm)

>  /*
> + * %rdi: prev task
> + * %rsi: next task
> + */
> +ENTRY(__switch_to_asm)
> +       /*
> +        * Save callee-saved registers
> +        * This must match the order in struct fork_frame
> +        * Frame pointer must be last for get_wchan
> +        */
> +       pushq   %rbx
> +       pushq   %r12
> +       pushq   %r13
> +       pushq   %r14
> +       pushq   %r15
> +       pushq   %rbp
> +
> +       /* switch stack */
> +       movq    %rsp, TASK_threadsp(%rdi)
> +       movq    TASK_threadsp(%rsi), %rsp
> +
> +#ifdef CONFIG_CC_STACKPROTECTOR
> +       movq    TASK_stack_canary(%rsi), %rbx
> +       movq    %rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset
> +#endif
> +
> +       /* restore callee-saved registers */
> +       popq    %rbp
> +       popq    %r15
> +       popq    %r14
> +       popq    %r13
> +       popq    %r12
> +       popq    %rbx
> +
> +       jmp     __switch_to

Ditto with the magic here.

> +struct fork_frame {
> +       unsigned long bp;
> +#ifdef CONFIG_X86_64
> +       unsigned long r15;
> +       unsigned long r14;
> +       unsigned long r13;
> +       unsigned long r12;
> +#else
> +       unsigned long si;
> +       unsigned long di;
> +#endif
> +       unsigned long bx;
> +       unsigned long ret_addr;
> +       struct pt_regs regs;
> +};

This, like the old implementation, is very much geared to the current
implementation of fork.  Can you split it up:

struct inactive_task_frame {
    unsigned long bp;
    ...
    unsigned long ret_addr;
};

/* fork works by setting up the child stack so that switch_to will
land at ret_from_fork with sp pointing at pt_regs */
struct fork_frame {
    struct inactive_task_frame switch_frame;
    struct pt_regs regs;
};

Then, if and when someone wants to fork into a different type of
context, they can reuse this.  Also, a future improved unwinder can
use inactive_task_frame directly to kick off its unwind.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame
  2016-05-21 16:04 ` [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame Brian Gerst
@ 2016-05-22 18:01   ` Andy Lutomirski
  2016-05-22 19:21     ` Brian Gerst
  2016-05-23 15:23   ` Josh Poimboeuf
  1 sibling, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2016-05-22 18:01 UTC (permalink / raw)
  To: Brian Gerst
  Cc: X86 ML, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Denys Vlasenko, Borislav Petkov, Thomas Gleixner

On Sat, May 21, 2016 at 9:04 AM, Brian Gerst <brgerst@gmail.com> wrote:
> Instead of setting up a fake pt_regs context, put the kernel thread
> function pointer and arg into the unused callee-restored registers
> of struct fork_frame.
>
> Signed-off-by: Brian Gerst <brgerst@gmail.com>

Seems reasonable.  Can you make sure you explicitly test with
something that uses the usermodehelper mechanism?

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame
  2016-05-22 18:01   ` Andy Lutomirski
@ 2016-05-22 19:21     ` Brian Gerst
  0 siblings, 0 replies; 31+ messages in thread
From: Brian Gerst @ 2016-05-22 19:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Denys Vlasenko, Borislav Petkov, Thomas Gleixner

On Sun, May 22, 2016 at 2:01 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Sat, May 21, 2016 at 9:04 AM, Brian Gerst <brgerst@gmail.com> wrote:
>> Instead of setting up a fake pt_regs context, put the kernel thread
>> function pointer and arg into the unused callee-restored registers
>> of struct fork_frame.
>>
>> Signed-off-by: Brian Gerst <brgerst@gmail.com>
>
> Seems reasonable.  Can you make sure you explicitly test with
> something that uses the usermodehelper mechanism?
>
> --Andy

kernel_init() does the same thing to spawn init, so it wouldn't boot otherwise.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-22 17:59   ` Andy Lutomirski
@ 2016-05-22 19:31     ` Brian Gerst
  2016-05-22 21:07       ` Andy Lutomirski
  2016-05-23  2:34     ` Josh Poimboeuf
  1 sibling, 1 reply; 31+ messages in thread
From: Brian Gerst @ 2016-05-22 19:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Denys Vlasenko, Ingo Molnar, linux-kernel,
	Borislav Petkov, H. Peter Anvin, X86 ML, Josh Poimboeuf

On Sun, May 22, 2016 at 1:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> cc: Josh Poimboeuf: do you care about the exact stack layout of the
> bottom of the stack of an inactive task?
>
> On May 21, 2016 9:05 AM, "Brian Gerst" <brgerst@gmail.com> wrote:
>>
>> Move the low-level context switch code to an out-of-line asm stub instead of
>> using complex inline asm.  This allows constructing a new stack frame for the
>> child process to make it seamlessly flow to ret_from_fork without an extra
>> test and branch in __switch_to().  It also improves code generation for
>> __schedule() by using the C calling convention instead of clobbering all
>> registers.
>
> I like the concept a lot.
>
>>
>> Signed-off-by: Brian Gerst <brgerst@gmail.com>
>> ---
>>  arch/x86/entry/entry_32.S          |  38 ++++++++++
>>  arch/x86/entry/entry_64.S          |  42 +++++++++++-
>>  arch/x86/include/asm/processor.h   |   3 -
>>  arch/x86/include/asm/switch_to.h   | 137 ++++++-------------------------------
>>  arch/x86/include/asm/thread_info.h |   2 -
>>  arch/x86/kernel/asm-offsets.c      |   6 ++
>>  arch/x86/kernel/asm-offsets_32.c   |   5 ++
>>  arch/x86/kernel/asm-offsets_64.c   |   5 ++
>>  arch/x86/kernel/process_32.c       |   8 ++-
>>  arch/x86/kernel/process_64.c       |   7 +-
>>  arch/x86/kernel/smpboot.c          |   1 -
>>  11 files changed, 124 insertions(+), 130 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
>> index ee6fea0..05e5340 100644
>> --- a/arch/x86/entry/entry_32.S
>> +++ b/arch/x86/entry/entry_32.S
>> @@ -204,6 +204,44 @@
>>         POP_GS_EX
>>  .endm
>>
>> +/*
>> + * %eax: prev task
>> + * %edx: next task
>> + */
>> +ENTRY(__switch_to_asm)
>> +       /*
>> +        * Save callee-saved registers
>> +        * This must match the order in struct fork_frame
>> +        * Frame pointer must be last for get_wchan
>> +        */
>> +       pushl   %ebx
>> +       pushl   %edi
>> +       pushl   %esi
>> +       pushl   %ebp
>> +
>> +       /* switch stack */
>> +       movl    %esp, TASK_threadsp(%eax)
>> +       movl    TASK_threadsp(%edx), %esp
>> +
>> +#ifdef CONFIG_CC_STACKPROTECTOR
>> +       movl    TASK_stack_canary(%edx), %ebx
>> +       movl    %ebx, PER_CPU_VAR(stack_canary)+stack_canary_offset
>> +#endif
>> +
>> +       /* restore callee-saved registers */
>> +       popl    %ebp
>> +       popl    %esi
>> +       popl    %edi
>> +       popl    %ebx
>
> This is highly, highly magical.  eax and edx are prev and next, and:

What is so magical about the standard C calling convention (regarm(3)
in the 32-bit case)?  This just passes them right though to
__switch_to().

>> +
>> +       jmp     __switch_to
>
> leaves prev in eax.  This works, but it might be worth a comment.

Not quite, __switch_to() returns 'last', not 'prev'.  The previous
task when this is called is not the same task when the thread  wakes
up.

>> +END(__switch_to_asm)
>
>>  /*
>> + * %rdi: prev task
>> + * %rsi: next task
>> + */
>> +ENTRY(__switch_to_asm)
>> +       /*
>> +        * Save callee-saved registers
>> +        * This must match the order in struct fork_frame
>> +        * Frame pointer must be last for get_wchan
>> +        */
>> +       pushq   %rbx
>> +       pushq   %r12
>> +       pushq   %r13
>> +       pushq   %r14
>> +       pushq   %r15
>> +       pushq   %rbp
>> +
>> +       /* switch stack */
>> +       movq    %rsp, TASK_threadsp(%rdi)
>> +       movq    TASK_threadsp(%rsi), %rsp
>> +
>> +#ifdef CONFIG_CC_STACKPROTECTOR
>> +       movq    TASK_stack_canary(%rsi), %rbx
>> +       movq    %rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset
>> +#endif
>> +
>> +       /* restore callee-saved registers */
>> +       popq    %rbp
>> +       popq    %r15
>> +       popq    %r14
>> +       popq    %r13
>> +       popq    %r12
>> +       popq    %rbx
>> +
>> +       jmp     __switch_to
>
> Ditto with the magic here.
>
>> +struct fork_frame {
>> +       unsigned long bp;
>> +#ifdef CONFIG_X86_64
>> +       unsigned long r15;
>> +       unsigned long r14;
>> +       unsigned long r13;
>> +       unsigned long r12;
>> +#else
>> +       unsigned long si;
>> +       unsigned long di;
>> +#endif
>> +       unsigned long bx;
>> +       unsigned long ret_addr;
>> +       struct pt_regs regs;
>> +};
>
> This, like the old implementation, is very much geared to the current
> implementation of fork.  Can you split it up:
>
> struct inactive_task_frame {
>     unsigned long bp;
>     ...
>     unsigned long ret_addr;
> };
>
> /* fork works by setting up the child stack so that switch_to will
> land at ret_from_fork with sp pointing at pt_regs */
> struct fork_frame {
>     struct inactive_task_frame switch_frame;
>     struct pt_regs regs;
> };
>
> Then, if and when someone wants to fork into a different type of
> context, they can reuse this.  Also, a future improved unwinder can
> use inactive_task_frame directly to kick off its unwind.

Sounds reasonable.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-22 19:31     ` Brian Gerst
@ 2016-05-22 21:07       ` Andy Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2016-05-22 21:07 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Thomas Gleixner, Denys Vlasenko, Ingo Molnar, linux-kernel,
	Borislav Petkov, H. Peter Anvin, X86 ML, Josh Poimboeuf

On Sun, May 22, 2016 at 12:31 PM, Brian Gerst <brgerst@gmail.com> wrote:
> On Sun, May 22, 2016 at 1:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> cc: Josh Poimboeuf: do you care about the exact stack layout of the
>> bottom of the stack of an inactive task?
>>
>> On May 21, 2016 9:05 AM, "Brian Gerst" <brgerst@gmail.com> wrote:
>>>
>>> Move the low-level context switch code to an out-of-line asm stub instead of
>>> using complex inline asm.  This allows constructing a new stack frame for the
>>> child process to make it seamlessly flow to ret_from_fork without an extra
>>> test and branch in __switch_to().  It also improves code generation for
>>> __schedule() by using the C calling convention instead of clobbering all
>>> registers.
>>
>> I like the concept a lot.
>>
>>>
>>> Signed-off-by: Brian Gerst <brgerst@gmail.com>
>>> ---
>>>  arch/x86/entry/entry_32.S          |  38 ++++++++++
>>>  arch/x86/entry/entry_64.S          |  42 +++++++++++-
>>>  arch/x86/include/asm/processor.h   |   3 -
>>>  arch/x86/include/asm/switch_to.h   | 137 ++++++-------------------------------
>>>  arch/x86/include/asm/thread_info.h |   2 -
>>>  arch/x86/kernel/asm-offsets.c      |   6 ++
>>>  arch/x86/kernel/asm-offsets_32.c   |   5 ++
>>>  arch/x86/kernel/asm-offsets_64.c   |   5 ++
>>>  arch/x86/kernel/process_32.c       |   8 ++-
>>>  arch/x86/kernel/process_64.c       |   7 +-
>>>  arch/x86/kernel/smpboot.c          |   1 -
>>>  11 files changed, 124 insertions(+), 130 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
>>> index ee6fea0..05e5340 100644
>>> --- a/arch/x86/entry/entry_32.S
>>> +++ b/arch/x86/entry/entry_32.S
>>> @@ -204,6 +204,44 @@
>>>         POP_GS_EX
>>>  .endm
>>>
>>> +/*
>>> + * %eax: prev task
>>> + * %edx: next task
>>> + */
>>> +ENTRY(__switch_to_asm)
>>> +       /*
>>> +        * Save callee-saved registers
>>> +        * This must match the order in struct fork_frame
>>> +        * Frame pointer must be last for get_wchan
>>> +        */
>>> +       pushl   %ebx
>>> +       pushl   %edi
>>> +       pushl   %esi
>>> +       pushl   %ebp
>>> +
>>> +       /* switch stack */
>>> +       movl    %esp, TASK_threadsp(%eax)
>>> +       movl    TASK_threadsp(%edx), %esp
>>> +
>>> +#ifdef CONFIG_CC_STACKPROTECTOR
>>> +       movl    TASK_stack_canary(%edx), %ebx
>>> +       movl    %ebx, PER_CPU_VAR(stack_canary)+stack_canary_offset
>>> +#endif
>>> +
>>> +       /* restore callee-saved registers */
>>> +       popl    %ebp
>>> +       popl    %esi
>>> +       popl    %edi
>>> +       popl    %ebx
>>
>> This is highly, highly magical.  eax and edx are prev and next, and:
>
> What is so magical about the standard C calling convention (regarm(3)
> in the 32-bit case)?  This just passes them right though to
> __switch_to().
>

I guess it's not highly magical, just a bit different from what I
expected.  I guess it's okay.

--Andy

>>> +
>>> +       jmp     __switch_to
>>
>> leaves prev in eax.  This works, but it might be worth a comment.
>
> Not quite, __switch_to() returns 'last', not 'prev'.  The previous
> task when this is called is not the same task when the thread  wakes
> up.

Right.

I wish switch_to were a normal function instead of a silly macro.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-22 17:59   ` Andy Lutomirski
  2016-05-22 19:31     ` Brian Gerst
@ 2016-05-23  2:34     ` Josh Poimboeuf
  2016-05-23  4:47       ` Andy Lutomirski
  2016-05-23 11:14       ` Brian Gerst
  1 sibling, 2 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2016-05-23  2:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brian Gerst, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML

On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
> cc: Josh Poimboeuf: do you care about the exact stack layout of the
> bottom of the stack of an inactive task?

So there's one minor issue with this patch, relating to unwinding the
stack of a newly forked task.  For detecting reliable stacks, the
unwinder needs to unwind all the way to the syscall pt_regs to make sure
the stack is sane.  But for newly forked tasks, that won't be possible
here because the unwinding will stop at the fork_frame instead.

So from an unwinder standpoint it might be nice for copy_thread_tls() to
place a frame pointer on the stack next to the ret_from_fork return
address, so that it would resemble an actual stack frame.  The frame
pointer could probably just be hard-coded to zero.  And then the first
bp in fork_frame would need to be a pointer to it instead of zero.  That
would make it nicely resemble the stack of any other task.

Alternatively I could teach the unwinder that if the unwinding starts at
the fork_frame offset from the end of the stack page, and the saved rbp
is zero, it can assume that it's a newly forked task.  But that seems a
little more brittle to me, as it requires the unwinder to understand
more of the internal workings of the fork code.

But overall I think this patch is a really nice cleanup, and other than
the above minor issue it should be fine with my reliable unwinder, since
rbp is still at the top of the stack.

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23  2:34     ` Josh Poimboeuf
@ 2016-05-23  4:47       ` Andy Lutomirski
  2016-05-23 11:40         ` Josh Poimboeuf
  2016-05-23 11:14       ` Brian Gerst
  1 sibling, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2016-05-23  4:47 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Brian Gerst, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML

On Sun, May 22, 2016 at 7:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
>> cc: Josh Poimboeuf: do you care about the exact stack layout of the
>> bottom of the stack of an inactive task?
>
> So there's one minor issue with this patch, relating to unwinding the
> stack of a newly forked task.  For detecting reliable stacks, the
> unwinder needs to unwind all the way to the syscall pt_regs to make sure
> the stack is sane.  But for newly forked tasks, that won't be possible
> here because the unwinding will stop at the fork_frame instead.
>
> So from an unwinder standpoint it might be nice for copy_thread_tls() to
> place a frame pointer on the stack next to the ret_from_fork return
> address, so that it would resemble an actual stack frame.  The frame
> pointer could probably just be hard-coded to zero.  And then the first
> bp in fork_frame would need to be a pointer to it instead of zero.  That
> would make it nicely resemble the stack of any other task.
>
> Alternatively I could teach the unwinder that if the unwinding starts at
> the fork_frame offset from the end of the stack page, and the saved rbp
> is zero, it can assume that it's a newly forked task.  But that seems a
> little more brittle to me, as it requires the unwinder to understand
> more of the internal workings of the fork code.
>
> But overall I think this patch is a really nice cleanup, and other than
> the above minor issue it should be fine with my reliable unwinder, since
> rbp is still at the top of the stack.

Is this a regression or is there some reason that it works right
without the patch?

In any event, whatever we settle on for general pt_regs unwinding
should work for this, too.

>
> --
> Josh



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23  2:34     ` Josh Poimboeuf
  2016-05-23  4:47       ` Andy Lutomirski
@ 2016-05-23 11:14       ` Brian Gerst
  2016-05-23 11:47         ` Josh Poimboeuf
  1 sibling, 1 reply; 31+ messages in thread
From: Brian Gerst @ 2016-05-23 11:14 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML

On Sun, May 22, 2016 at 10:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
>> cc: Josh Poimboeuf: do you care about the exact stack layout of the
>> bottom of the stack of an inactive task?
>
> So there's one minor issue with this patch, relating to unwinding the
> stack of a newly forked task.  For detecting reliable stacks, the
> unwinder needs to unwind all the way to the syscall pt_regs to make sure
> the stack is sane.  But for newly forked tasks, that won't be possible
> here because the unwinding will stop at the fork_frame instead.
>
> So from an unwinder standpoint it might be nice for copy_thread_tls() to
> place a frame pointer on the stack next to the ret_from_fork return
> address, so that it would resemble an actual stack frame.  The frame
> pointer could probably just be hard-coded to zero.  And then the first
> bp in fork_frame would need to be a pointer to it instead of zero.  That
> would make it nicely resemble the stack of any other task.
>
> Alternatively I could teach the unwinder that if the unwinding starts at
> the fork_frame offset from the end of the stack page, and the saved rbp
> is zero, it can assume that it's a newly forked task.  But that seems a
> little more brittle to me, as it requires the unwinder to understand
> more of the internal workings of the fork code.
>
> But overall I think this patch is a really nice cleanup, and other than
> the above minor issue it should be fine with my reliable unwinder, since
> rbp is still at the top of the stack.

Ok, how about if it pushed RBP first, then we teach get_wchan() to add
the fixed offset from thread.sp to get bp?  that way it don't have to
push it twice.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23  4:47       ` Andy Lutomirski
@ 2016-05-23 11:40         ` Josh Poimboeuf
  2016-05-23 11:49           ` Brian Gerst
  0 siblings, 1 reply; 31+ messages in thread
From: Josh Poimboeuf @ 2016-05-23 11:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brian Gerst, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML

On Sun, May 22, 2016 at 09:47:22PM -0700, Andy Lutomirski wrote:
> On Sun, May 22, 2016 at 7:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
> >> cc: Josh Poimboeuf: do you care about the exact stack layout of the
> >> bottom of the stack of an inactive task?
> >
> > So there's one minor issue with this patch, relating to unwinding the
> > stack of a newly forked task.  For detecting reliable stacks, the
> > unwinder needs to unwind all the way to the syscall pt_regs to make sure
> > the stack is sane.  But for newly forked tasks, that won't be possible
> > here because the unwinding will stop at the fork_frame instead.
> >
> > So from an unwinder standpoint it might be nice for copy_thread_tls() to
> > place a frame pointer on the stack next to the ret_from_fork return
> > address, so that it would resemble an actual stack frame.  The frame
> > pointer could probably just be hard-coded to zero.  And then the first
> > bp in fork_frame would need to be a pointer to it instead of zero.  That
> > would make it nicely resemble the stack of any other task.
> >
> > Alternatively I could teach the unwinder that if the unwinding starts at
> > the fork_frame offset from the end of the stack page, and the saved rbp
> > is zero, it can assume that it's a newly forked task.  But that seems a
> > little more brittle to me, as it requires the unwinder to understand
> > more of the internal workings of the fork code.
> >
> > But overall I think this patch is a really nice cleanup, and other than
> > the above minor issue it should be fine with my reliable unwinder, since
> > rbp is still at the top of the stack.
> 
> Is this a regression or is there some reason that it works right
> without the patch?

Without the patch, it uses TIF_FORK to determine the stack is empty.

> In any event, whatever we settle on for general pt_regs unwinding
> should work for this, too.

Yeah, agreed.

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23 11:14       ` Brian Gerst
@ 2016-05-23 11:47         ` Josh Poimboeuf
  2016-05-23 11:49           ` Josh Poimboeuf
  0 siblings, 1 reply; 31+ messages in thread
From: Josh Poimboeuf @ 2016-05-23 11:47 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML

On Mon, May 23, 2016 at 07:14:14AM -0400, Brian Gerst wrote:
> On Sun, May 22, 2016 at 10:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
> >> cc: Josh Poimboeuf: do you care about the exact stack layout of the
> >> bottom of the stack of an inactive task?
> >
> > So there's one minor issue with this patch, relating to unwinding the
> > stack of a newly forked task.  For detecting reliable stacks, the
> > unwinder needs to unwind all the way to the syscall pt_regs to make sure
> > the stack is sane.  But for newly forked tasks, that won't be possible
> > here because the unwinding will stop at the fork_frame instead.
> >
> > So from an unwinder standpoint it might be nice for copy_thread_tls() to
> > place a frame pointer on the stack next to the ret_from_fork return
> > address, so that it would resemble an actual stack frame.  The frame
> > pointer could probably just be hard-coded to zero.  And then the first
> > bp in fork_frame would need to be a pointer to it instead of zero.  That
> > would make it nicely resemble the stack of any other task.
> >
> > Alternatively I could teach the unwinder that if the unwinding starts at
> > the fork_frame offset from the end of the stack page, and the saved rbp
> > is zero, it can assume that it's a newly forked task.  But that seems a
> > little more brittle to me, as it requires the unwinder to understand
> > more of the internal workings of the fork code.
> >
> > But overall I think this patch is a really nice cleanup, and other than
> > the above minor issue it should be fine with my reliable unwinder, since
> > rbp is still at the top of the stack.
> 
> Ok, how about if it pushed RBP first, then we teach get_wchan() to add
> the fixed offset from thread.sp to get bp?  that way it don't have to
> push it twice.

In theory I like the idea, and it would work: the unwinder could just
use the inactive_task_frame struct (as Andy suggested) to find the frame
pointer.

But I suspect it would break all existing unwinders, both in-tree and
out-of-tree.  The only out-of-tree one I know of is crash, not sure if
there are more out there.

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23 11:47         ` Josh Poimboeuf
@ 2016-05-23 11:49           ` Josh Poimboeuf
  2016-05-23 16:46             ` Josh Poimboeuf
  0 siblings, 1 reply; 31+ messages in thread
From: Josh Poimboeuf @ 2016-05-23 11:49 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML

On Mon, May 23, 2016 at 06:47:22AM -0500, Josh Poimboeuf wrote:
> On Mon, May 23, 2016 at 07:14:14AM -0400, Brian Gerst wrote:
> > On Sun, May 22, 2016 at 10:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
> > >> cc: Josh Poimboeuf: do you care about the exact stack layout of the
> > >> bottom of the stack of an inactive task?
> > >
> > > So there's one minor issue with this patch, relating to unwinding the
> > > stack of a newly forked task.  For detecting reliable stacks, the
> > > unwinder needs to unwind all the way to the syscall pt_regs to make sure
> > > the stack is sane.  But for newly forked tasks, that won't be possible
> > > here because the unwinding will stop at the fork_frame instead.
> > >
> > > So from an unwinder standpoint it might be nice for copy_thread_tls() to
> > > place a frame pointer on the stack next to the ret_from_fork return
> > > address, so that it would resemble an actual stack frame.  The frame
> > > pointer could probably just be hard-coded to zero.  And then the first
> > > bp in fork_frame would need to be a pointer to it instead of zero.  That
> > > would make it nicely resemble the stack of any other task.
> > >
> > > Alternatively I could teach the unwinder that if the unwinding starts at
> > > the fork_frame offset from the end of the stack page, and the saved rbp
> > > is zero, it can assume that it's a newly forked task.  But that seems a
> > > little more brittle to me, as it requires the unwinder to understand
> > > more of the internal workings of the fork code.
> > >
> > > But overall I think this patch is a really nice cleanup, and other than
> > > the above minor issue it should be fine with my reliable unwinder, since
> > > rbp is still at the top of the stack.
> > 
> > Ok, how about if it pushed RBP first, then we teach get_wchan() to add
> > the fixed offset from thread.sp to get bp?  that way it don't have to
> > push it twice.
> 
> In theory I like the idea, and it would work: the unwinder could just
> use the inactive_task_frame struct (as Andy suggested) to find the frame
> pointer.
> 
> But I suspect it would break all existing unwinders, both in-tree and
> out-of-tree.  The only out-of-tree one I know of is crash, not sure if
> there are more out there.

I should mention it would only affect those unwinders which know how to
do sleeping kernel tasks.  So generic tools like gdb wouldn't be
affected.

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23 11:40         ` Josh Poimboeuf
@ 2016-05-23 11:49           ` Brian Gerst
  2016-05-23 12:05             ` Josh Poimboeuf
  0 siblings, 1 reply; 31+ messages in thread
From: Brian Gerst @ 2016-05-23 11:49 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML

On Mon, May 23, 2016 at 7:40 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Sun, May 22, 2016 at 09:47:22PM -0700, Andy Lutomirski wrote:
>> On Sun, May 22, 2016 at 7:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
>> >> cc: Josh Poimboeuf: do you care about the exact stack layout of the
>> >> bottom of the stack of an inactive task?
>> >
>> > So there's one minor issue with this patch, relating to unwinding the
>> > stack of a newly forked task.  For detecting reliable stacks, the
>> > unwinder needs to unwind all the way to the syscall pt_regs to make sure
>> > the stack is sane.  But for newly forked tasks, that won't be possible
>> > here because the unwinding will stop at the fork_frame instead.
>> >
>> > So from an unwinder standpoint it might be nice for copy_thread_tls() to
>> > place a frame pointer on the stack next to the ret_from_fork return
>> > address, so that it would resemble an actual stack frame.  The frame
>> > pointer could probably just be hard-coded to zero.  And then the first
>> > bp in fork_frame would need to be a pointer to it instead of zero.  That
>> > would make it nicely resemble the stack of any other task.
>> >
>> > Alternatively I could teach the unwinder that if the unwinding starts at
>> > the fork_frame offset from the end of the stack page, and the saved rbp
>> > is zero, it can assume that it's a newly forked task.  But that seems a
>> > little more brittle to me, as it requires the unwinder to understand
>> > more of the internal workings of the fork code.
>> >
>> > But overall I think this patch is a really nice cleanup, and other than
>> > the above minor issue it should be fine with my reliable unwinder, since
>> > rbp is still at the top of the stack.
>>
>> Is this a regression or is there some reason that it works right
>> without the patch?
>
> Without the patch, it uses TIF_FORK to determine the stack is empty.

Where is this code?  I don't see it in the mainline kernel.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23 11:49           ` Brian Gerst
@ 2016-05-23 12:05             ` Josh Poimboeuf
  0 siblings, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2016-05-23 12:05 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML

On Mon, May 23, 2016 at 07:49:37AM -0400, Brian Gerst wrote:
> On Mon, May 23, 2016 at 7:40 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Sun, May 22, 2016 at 09:47:22PM -0700, Andy Lutomirski wrote:
> >> On Sun, May 22, 2016 at 7:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
> >> >> cc: Josh Poimboeuf: do you care about the exact stack layout of the
> >> >> bottom of the stack of an inactive task?
> >> >
> >> > So there's one minor issue with this patch, relating to unwinding the
> >> > stack of a newly forked task.  For detecting reliable stacks, the
> >> > unwinder needs to unwind all the way to the syscall pt_regs to make sure
> >> > the stack is sane.  But for newly forked tasks, that won't be possible
> >> > here because the unwinding will stop at the fork_frame instead.
> >> >
> >> > So from an unwinder standpoint it might be nice for copy_thread_tls() to
> >> > place a frame pointer on the stack next to the ret_from_fork return
> >> > address, so that it would resemble an actual stack frame.  The frame
> >> > pointer could probably just be hard-coded to zero.  And then the first
> >> > bp in fork_frame would need to be a pointer to it instead of zero.  That
> >> > would make it nicely resemble the stack of any other task.
> >> >
> >> > Alternatively I could teach the unwinder that if the unwinding starts at
> >> > the fork_frame offset from the end of the stack page, and the saved rbp
> >> > is zero, it can assume that it's a newly forked task.  But that seems a
> >> > little more brittle to me, as it requires the unwinder to understand
> >> > more of the internal workings of the fork code.
> >> >
> >> > But overall I think this patch is a really nice cleanup, and other than
> >> > the above minor issue it should be fine with my reliable unwinder, since
> >> > rbp is still at the top of the stack.
> >>
> >> Is this a regression or is there some reason that it works right
> >> without the patch?
> >
> > Without the patch, it uses TIF_FORK to determine the stack is empty.
> 
> Where is this code?  I don't see it in the mainline kernel.

Yeah, it hasn't been merged.  Here's the last version:

  https://lkml.kernel.org/r/4d34d452bf8f85c7d6d5f93db1d3eeb4cba335c7.1461875890.git.jpoimboe@redhat.com

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame
  2016-05-21 16:04 ` [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame Brian Gerst
  2016-05-22 18:01   ` Andy Lutomirski
@ 2016-05-23 15:23   ` Josh Poimboeuf
  2016-05-23 15:36     ` Andy Lutomirski
  1 sibling, 1 reply; 31+ messages in thread
From: Josh Poimboeuf @ 2016-05-23 15:23 UTC (permalink / raw)
  To: Brian Gerst
  Cc: x86, linux-kernel, Ingo Molnar, H. Peter Anvin, Denys Vlasenko,
	Andy Lutomirski, Borislav Petkov, Thomas Gleixner

On Sat, May 21, 2016 at 12:04:51PM -0400, Brian Gerst wrote:
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -405,37 +405,29 @@ END(__switch_to_asm)
>   * A newly forked process directly context switches into this address.
>   *
>   * rax: prev task we switched from
> + * rbx: kernel thread func
> + * r12: kernel thread arg
>   */
>  ENTRY(ret_from_fork)
>  	movq	%rax, %rdi
>  	call	schedule_tail			/* rdi: 'prev' task parameter */
>  
> -	testb	$3, CS(%rsp)			/* from kernel_thread? */
> +	testq	%rbx, %rbx			/* from kernel_thread? */
>  	jnz	1f
>  
> -	/*
> -	 * We came from kernel_thread.  This code path is quite twisted, and
> -	 * someone should clean it up.
> -	 *
> -	 * copy_thread_tls stashes the function pointer in RBX and the
> -	 * parameter to be passed in RBP.  The called function is permitted
> -	 * to call do_execve and thereby jump to user mode.
> -	 */
> -	movq	RBP(%rsp), %rdi
> -	call	*RBX(%rsp)
> -	movq	%rax, RAX(%rsp)
> -
> -	/*
> -	 * Fall through as though we're exiting a syscall.  This makes a
> -	 * twisted sort of sense if we just called do_execve.
> -	 */
> -
> -1:
> +2:
>  	movq	%rsp, %rdi
>  	call	syscall_return_slowpath	/* returns with IRQs disabled */
>  	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
>  	SWAPGS
>  	jmp	restore_regs_and_iret
> +
> +1:
> +	/* kernel thread */
> +	movq	%r12, %rdi
> +	call	*%rbx
> +	movq	%rax, RAX(%rsp)
> +	jmp	2b
>  END(ret_from_fork)

It seems really surprising that a kernel thread would be returning to
user space.  It would probably be a good idea to preserve the existing
comments about that.

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame
  2016-05-23 15:23   ` Josh Poimboeuf
@ 2016-05-23 15:36     ` Andy Lutomirski
  2016-05-23 21:04       ` Brian Gerst
  0 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2016-05-23 15:36 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Brian Gerst, X86 ML, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Denys Vlasenko, Borislav Petkov, Thomas Gleixner

On Mon, May 23, 2016 at 8:23 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Sat, May 21, 2016 at 12:04:51PM -0400, Brian Gerst wrote:
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -405,37 +405,29 @@ END(__switch_to_asm)
>>   * A newly forked process directly context switches into this address.
>>   *
>>   * rax: prev task we switched from
>> + * rbx: kernel thread func
>> + * r12: kernel thread arg
>>   */
>>  ENTRY(ret_from_fork)
>>       movq    %rax, %rdi
>>       call    schedule_tail                   /* rdi: 'prev' task parameter */
>>
>> -     testb   $3, CS(%rsp)                    /* from kernel_thread? */
>> +     testq   %rbx, %rbx                      /* from kernel_thread? */
>>       jnz     1f
>>
>> -     /*
>> -      * We came from kernel_thread.  This code path is quite twisted, and
>> -      * someone should clean it up.
>> -      *
>> -      * copy_thread_tls stashes the function pointer in RBX and the
>> -      * parameter to be passed in RBP.  The called function is permitted
>> -      * to call do_execve and thereby jump to user mode.
>> -      */
>> -     movq    RBP(%rsp), %rdi
>> -     call    *RBX(%rsp)
>> -     movq    %rax, RAX(%rsp)
>> -
>> -     /*
>> -      * Fall through as though we're exiting a syscall.  This makes a
>> -      * twisted sort of sense if we just called do_execve.
>> -      */
>> -
>> -1:
>> +2:
>>       movq    %rsp, %rdi
>>       call    syscall_return_slowpath /* returns with IRQs disabled */
>>       TRACE_IRQS_ON                   /* user mode is traced as IRQS on */
>>       SWAPGS
>>       jmp     restore_regs_and_iret
>> +
>> +1:
>> +     /* kernel thread */
>> +     movq    %r12, %rdi
>> +     call    *%rbx
>> +     movq    %rax, RAX(%rsp)
>> +     jmp     2b
>>  END(ret_from_fork)
>
> It seems really surprising that a kernel thread would be returning to
> user space.  It would probably be a good idea to preserve the existing
> comments about that.
>

Agreed.

Which reminds me: at some point, on top of this series, we should
consider either having multiple variants of ret_from_fork or otherwise
generalizing the code.  If and when we implement CPL3 for *kernel*
code (SGX and UEFI come to mind as possible use cases), we probably
won't want to go through syscall_return_slowpath.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23 11:49           ` Josh Poimboeuf
@ 2016-05-23 16:46             ` Josh Poimboeuf
  2016-05-23 17:03               ` Andy Lutomirski
  0 siblings, 1 reply; 31+ messages in thread
From: Josh Poimboeuf @ 2016-05-23 16:46 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML,
	Dave Anderson

On Mon, May 23, 2016 at 06:49:03AM -0500, Josh Poimboeuf wrote:
> On Mon, May 23, 2016 at 06:47:22AM -0500, Josh Poimboeuf wrote:
> > On Mon, May 23, 2016 at 07:14:14AM -0400, Brian Gerst wrote:
> > > On Sun, May 22, 2016 at 10:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > > On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
> > > >> cc: Josh Poimboeuf: do you care about the exact stack layout of the
> > > >> bottom of the stack of an inactive task?
> > > >
> > > > So there's one minor issue with this patch, relating to unwinding the
> > > > stack of a newly forked task.  For detecting reliable stacks, the
> > > > unwinder needs to unwind all the way to the syscall pt_regs to make sure
> > > > the stack is sane.  But for newly forked tasks, that won't be possible
> > > > here because the unwinding will stop at the fork_frame instead.
> > > >
> > > > So from an unwinder standpoint it might be nice for copy_thread_tls() to
> > > > place a frame pointer on the stack next to the ret_from_fork return
> > > > address, so that it would resemble an actual stack frame.  The frame
> > > > pointer could probably just be hard-coded to zero.  And then the first
> > > > bp in fork_frame would need to be a pointer to it instead of zero.  That
> > > > would make it nicely resemble the stack of any other task.
> > > >
> > > > Alternatively I could teach the unwinder that if the unwinding starts at
> > > > the fork_frame offset from the end of the stack page, and the saved rbp
> > > > is zero, it can assume that it's a newly forked task.  But that seems a
> > > > little more brittle to me, as it requires the unwinder to understand
> > > > more of the internal workings of the fork code.
> > > >
> > > > But overall I think this patch is a really nice cleanup, and other than
> > > > the above minor issue it should be fine with my reliable unwinder, since
> > > > rbp is still at the top of the stack.
> > > 
> > > Ok, how about if it pushed RBP first, then we teach get_wchan() to add
> > > the fixed offset from thread.sp to get bp?  that way it don't have to
> > > push it twice.
> > 
> > In theory I like the idea, and it would work: the unwinder could just
> > use the inactive_task_frame struct (as Andy suggested) to find the frame
> > pointer.
> > 
> > But I suspect it would break all existing unwinders, both in-tree and
> > out-of-tree.  The only out-of-tree one I know of is crash, not sure if
> > there are more out there.
> 
> I should mention it would only affect those unwinders which know how to
> do sleeping kernel tasks.  So generic tools like gdb wouldn't be
> affected.

[continuing my conversation with myself...]

To clarify, I still think we should do it.  The stack format of a
sleeping task isn't exactly an ABI, and I wouldn't expect many tools to
rely on it.  I can help with the fixing of in-tree unwinders if needed.

Or I could even do the moving of the frame pointer as a separate patch
on top of this one, since it might cause breakage elsewhere.

Adding Dave Anderson (crash maintainer) to cc, as an FYI.

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23 16:46             ` Josh Poimboeuf
@ 2016-05-23 17:03               ` Andy Lutomirski
  2016-05-23 18:44                 ` Josh Poimboeuf
  2016-07-12 14:16                 ` Josh Poimboeuf
  0 siblings, 2 replies; 31+ messages in thread
From: Andy Lutomirski @ 2016-05-23 17:03 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Brian Gerst, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML,
	Dave Anderson

On Mon, May 23, 2016 at 9:46 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, May 23, 2016 at 06:49:03AM -0500, Josh Poimboeuf wrote:
>> On Mon, May 23, 2016 at 06:47:22AM -0500, Josh Poimboeuf wrote:
>> > On Mon, May 23, 2016 at 07:14:14AM -0400, Brian Gerst wrote:
>> > > On Sun, May 22, 2016 at 10:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > > > On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
>> > > >> cc: Josh Poimboeuf: do you care about the exact stack layout of the
>> > > >> bottom of the stack of an inactive task?
>> > > >
>> > > > So there's one minor issue with this patch, relating to unwinding the
>> > > > stack of a newly forked task.  For detecting reliable stacks, the
>> > > > unwinder needs to unwind all the way to the syscall pt_regs to make sure
>> > > > the stack is sane.  But for newly forked tasks, that won't be possible
>> > > > here because the unwinding will stop at the fork_frame instead.
>> > > >
>> > > > So from an unwinder standpoint it might be nice for copy_thread_tls() to
>> > > > place a frame pointer on the stack next to the ret_from_fork return
>> > > > address, so that it would resemble an actual stack frame.  The frame
>> > > > pointer could probably just be hard-coded to zero.  And then the first
>> > > > bp in fork_frame would need to be a pointer to it instead of zero.  That
>> > > > would make it nicely resemble the stack of any other task.
>> > > >
>> > > > Alternatively I could teach the unwinder that if the unwinding starts at
>> > > > the fork_frame offset from the end of the stack page, and the saved rbp
>> > > > is zero, it can assume that it's a newly forked task.  But that seems a
>> > > > little more brittle to me, as it requires the unwinder to understand
>> > > > more of the internal workings of the fork code.
>> > > >
>> > > > But overall I think this patch is a really nice cleanup, and other than
>> > > > the above minor issue it should be fine with my reliable unwinder, since
>> > > > rbp is still at the top of the stack.
>> > >
>> > > Ok, how about if it pushed RBP first, then we teach get_wchan() to add
>> > > the fixed offset from thread.sp to get bp?  that way it don't have to
>> > > push it twice.
>> >
>> > In theory I like the idea, and it would work: the unwinder could just
>> > use the inactive_task_frame struct (as Andy suggested) to find the frame
>> > pointer.
>> >
>> > But I suspect it would break all existing unwinders, both in-tree and
>> > out-of-tree.  The only out-of-tree one I know of is crash, not sure if
>> > there are more out there.
>>
>> I should mention it would only affect those unwinders which know how to
>> do sleeping kernel tasks.  So generic tools like gdb wouldn't be
>> affected.
>
> [continuing my conversation with myself...]
>
> To clarify, I still think we should do it.  The stack format of a
> sleeping task isn't exactly an ABI, and I wouldn't expect many tools to
> rely on it.  I can help with the fixing of in-tree unwinders if needed.
>
> Or I could even do the moving of the frame pointer as a separate patch
> on top of this one, since it might cause breakage elsewhere.

Do you have any understanding of why there are so many unwinder
implementations?  Your reliable unwinder seems to be yet another copy
of more or less the same code.

I'd like to see a single, high-quality unwinder implemented as a state
machine, along the lines of:

struct unwind_state state;
unwind_start_inactive_task(&state, ...); or
unwind_start_pt_regs(&state, regs); or whatever.
unwind_next_frame(&state);

where, after unwind_next_frame, state encodes whatever registers are
known (at least bp and ip, but all the GPRs would be nice and are
probably mandatory for DWARF) and an indication of whether this is a
real frame or a guessed frame (the things that currently show up as
'?').

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] x86-32, kgdb: Don't use thread.ip in sleeping_thread_to_gdb_regs()
  2016-05-21 16:04 ` [PATCH 2/4] x86-32, kgdb: Don't use thread.ip in sleeping_thread_to_gdb_regs() Brian Gerst
@ 2016-05-23 17:05   ` Andy Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2016-05-23 17:05 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Thomas Gleixner, Denys Vlasenko, Ingo Molnar, linux-kernel,
	Borislav Petkov, H. Peter Anvin, X86 ML

On May 21, 2016 9:04 AM, "Brian Gerst" <brgerst@gmail.com> wrote:
>
> Match 64-bit and set gdb_regs[GDB_PC] to zero.  thread.ip is always the
> same point in the scheduler (except for newly forked processes), and will
> be removed in a future patch.

Seems reasonable to me.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23 17:03               ` Andy Lutomirski
@ 2016-05-23 18:44                 ` Josh Poimboeuf
  2016-07-12 14:16                 ` Josh Poimboeuf
  1 sibling, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2016-05-23 18:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brian Gerst, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML,
	Dave Anderson

On Mon, May 23, 2016 at 10:03:54AM -0700, Andy Lutomirski wrote:
> On Mon, May 23, 2016 at 9:46 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, May 23, 2016 at 06:49:03AM -0500, Josh Poimboeuf wrote:
> >> On Mon, May 23, 2016 at 06:47:22AM -0500, Josh Poimboeuf wrote:
> >> > On Mon, May 23, 2016 at 07:14:14AM -0400, Brian Gerst wrote:
> >> > > On Sun, May 22, 2016 at 10:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > > > On Sun, May 22, 2016 at 10:59:38AM -0700, Andy Lutomirski wrote:
> >> > > >> cc: Josh Poimboeuf: do you care about the exact stack layout of the
> >> > > >> bottom of the stack of an inactive task?
> >> > > >
> >> > > > So there's one minor issue with this patch, relating to unwinding the
> >> > > > stack of a newly forked task.  For detecting reliable stacks, the
> >> > > > unwinder needs to unwind all the way to the syscall pt_regs to make sure
> >> > > > the stack is sane.  But for newly forked tasks, that won't be possible
> >> > > > here because the unwinding will stop at the fork_frame instead.
> >> > > >
> >> > > > So from an unwinder standpoint it might be nice for copy_thread_tls() to
> >> > > > place a frame pointer on the stack next to the ret_from_fork return
> >> > > > address, so that it would resemble an actual stack frame.  The frame
> >> > > > pointer could probably just be hard-coded to zero.  And then the first
> >> > > > bp in fork_frame would need to be a pointer to it instead of zero.  That
> >> > > > would make it nicely resemble the stack of any other task.
> >> > > >
> >> > > > Alternatively I could teach the unwinder that if the unwinding starts at
> >> > > > the fork_frame offset from the end of the stack page, and the saved rbp
> >> > > > is zero, it can assume that it's a newly forked task.  But that seems a
> >> > > > little more brittle to me, as it requires the unwinder to understand
> >> > > > more of the internal workings of the fork code.
> >> > > >
> >> > > > But overall I think this patch is a really nice cleanup, and other than
> >> > > > the above minor issue it should be fine with my reliable unwinder, since
> >> > > > rbp is still at the top of the stack.
> >> > >
> >> > > Ok, how about if it pushed RBP first, then we teach get_wchan() to add
> >> > > the fixed offset from thread.sp to get bp?  that way it don't have to
> >> > > push it twice.
> >> >
> >> > In theory I like the idea, and it would work: the unwinder could just
> >> > use the inactive_task_frame struct (as Andy suggested) to find the frame
> >> > pointer.
> >> >
> >> > But I suspect it would break all existing unwinders, both in-tree and
> >> > out-of-tree.  The only out-of-tree one I know of is crash, not sure if
> >> > there are more out there.
> >>
> >> I should mention it would only affect those unwinders which know how to
> >> do sleeping kernel tasks.  So generic tools like gdb wouldn't be
> >> affected.
> >
> > [continuing my conversation with myself...]
> >
> > To clarify, I still think we should do it.  The stack format of a
> > sleeping task isn't exactly an ABI, and I wouldn't expect many tools to
> > rely on it.  I can help with the fixing of in-tree unwinders if needed.
> >
> > Or I could even do the moving of the frame pointer as a separate patch
> > on top of this one, since it might cause breakage elsewhere.
> 
> Do you have any understanding of why there are so many unwinder
> implementations?  Your reliable unwinder seems to be yet another copy
> of more or less the same code.

Yeah, there are way too many instantations of stacktrace_ops and there's
definitely a lot of room for consolidation and simplification.  There
are different requirements needed by all the different codes relying on
dump_trace():

- starting with a given pt_regs
- starting with a given task
- whether to skip sched code functions
- whether to skip the random '?' ktext addresses found on the stack
- whether frame pointers are enabled
- whether the stack is reliable
- output to an arch-independent "struct stack_trace" array

So everybody implements their own callbacks for dump_trace().  It's kind
of a big mess.

> I'd like to see a single, high-quality unwinder implemented as a state
> machine, along the lines of:
> 
> struct unwind_state state;
> unwind_start_inactive_task(&state, ...); or
> unwind_start_pt_regs(&state, regs); or whatever.
> unwind_next_frame(&state);
> 
> where, after unwind_next_frame, state encodes whatever registers are
> known (at least bp and ip, but all the GPRs would be nice and are
> probably mandatory for DWARF) and an indication of whether this is a
> real frame or a guessed frame (the things that currently show up as
> '?').

I like the idea of a state machine.  I'll probably end up doing
something like that before introducing the DWARF unwinder.

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame
  2016-05-23 15:36     ` Andy Lutomirski
@ 2016-05-23 21:04       ` Brian Gerst
  0 siblings, 0 replies; 31+ messages in thread
From: Brian Gerst @ 2016-05-23 21:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, X86 ML, linux-kernel, Ingo Molnar,
	H. Peter Anvin, Denys Vlasenko, Borislav Petkov, Thomas Gleixner

On Mon, May 23, 2016 at 11:36 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, May 23, 2016 at 8:23 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> On Sat, May 21, 2016 at 12:04:51PM -0400, Brian Gerst wrote:
>>> --- a/arch/x86/entry/entry_64.S
>>> +++ b/arch/x86/entry/entry_64.S
>>> @@ -405,37 +405,29 @@ END(__switch_to_asm)
>>>   * A newly forked process directly context switches into this address.
>>>   *
>>>   * rax: prev task we switched from
>>> + * rbx: kernel thread func
>>> + * r12: kernel thread arg
>>>   */
>>>  ENTRY(ret_from_fork)
>>>       movq    %rax, %rdi
>>>       call    schedule_tail                   /* rdi: 'prev' task parameter */
>>>
>>> -     testb   $3, CS(%rsp)                    /* from kernel_thread? */
>>> +     testq   %rbx, %rbx                      /* from kernel_thread? */
>>>       jnz     1f
>>>
>>> -     /*
>>> -      * We came from kernel_thread.  This code path is quite twisted, and
>>> -      * someone should clean it up.
>>> -      *
>>> -      * copy_thread_tls stashes the function pointer in RBX and the
>>> -      * parameter to be passed in RBP.  The called function is permitted
>>> -      * to call do_execve and thereby jump to user mode.
>>> -      */
>>> -     movq    RBP(%rsp), %rdi
>>> -     call    *RBX(%rsp)
>>> -     movq    %rax, RAX(%rsp)
>>> -
>>> -     /*
>>> -      * Fall through as though we're exiting a syscall.  This makes a
>>> -      * twisted sort of sense if we just called do_execve.
>>> -      */
>>> -
>>> -1:
>>> +2:
>>>       movq    %rsp, %rdi
>>>       call    syscall_return_slowpath /* returns with IRQs disabled */
>>>       TRACE_IRQS_ON                   /* user mode is traced as IRQS on */
>>>       SWAPGS
>>>       jmp     restore_regs_and_iret
>>> +
>>> +1:
>>> +     /* kernel thread */
>>> +     movq    %r12, %rdi
>>> +     call    *%rbx
>>> +     movq    %rax, RAX(%rsp)
>>> +     jmp     2b
>>>  END(ret_from_fork)
>>
>> It seems really surprising that a kernel thread would be returning to
>> user space.  It would probably be a good idea to preserve the existing
>> comments about that.
>>
>
> Agreed.
>
> Which reminds me: at some point, on top of this series, we should
> consider either having multiple variants of ret_from_fork or otherwise
> generalizing the code.  If and when we implement CPL3 for *kernel*
> code (SGX and UEFI come to mind as possible use cases), we probably
> won't want to go through syscall_return_slowpath.

I don't understand what you mean by CPL3 kernel code.  Do you mean
something like the VDSO where the kernel maps the code into userspace?
 Why would you want to do this?

--
Brian Gerst

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-21 16:04 ` [PATCH 3/4] x86: Rewrite switch_to() code Brian Gerst
  2016-05-22 17:59   ` Andy Lutomirski
@ 2016-06-15  1:31   ` Andy Lutomirski
  2016-06-15  8:03     ` Ingo Molnar
  1 sibling, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2016-06-15  1:31 UTC (permalink / raw)
  To: Brian Gerst
  Cc: X86 ML, linux-kernel, Ingo Molnar, H. Peter Anvin,
	Denys Vlasenko, Borislav Petkov, Thomas Gleixner

On Sat, May 21, 2016 at 9:04 AM, Brian Gerst <brgerst@gmail.com> wrote:
> Move the low-level context switch code to an out-of-line asm stub instead of
> using complex inline asm.  This allows constructing a new stack frame for the
> child process to make it seamlessly flow to ret_from_fork without an extra
> test and branch in __switch_to().  It also improves code generation for
> __schedule() by using the C calling convention instead of clobbering all
> registers.

Just a heads up: I'm writing some code that conflicts with this patch.
The conflict will be easy to resolve, and, if this patch beats mine to
-tip, I'll rebase.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-06-15  1:31   ` Andy Lutomirski
@ 2016-06-15  8:03     ` Ingo Molnar
  2016-06-15 11:52       ` Brian Gerst
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2016-06-15  8:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brian Gerst, X86 ML, linux-kernel, H. Peter Anvin,
	Denys Vlasenko, Borislav Petkov, Thomas Gleixner


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Sat, May 21, 2016 at 9:04 AM, Brian Gerst <brgerst@gmail.com> wrote:
>
> > Move the low-level context switch code to an out-of-line asm stub instead of 
> > using complex inline asm.  This allows constructing a new stack frame for the 
> > child process to make it seamlessly flow to ret_from_fork without an extra 
> > test and branch in __switch_to().  It also improves code generation for 
> > __schedule() by using the C calling convention instead of clobbering all 
> > registers.
> 
> Just a heads up: I'm writing some code that conflicts with this patch. The 
> conflict will be easy to resolve, and, if this patch beats mine to -tip, I'll 
> rebase.

So I was expecting another iteration of this switch_to() series, but had no 
fundamental objections to the concept.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-06-15  8:03     ` Ingo Molnar
@ 2016-06-15 11:52       ` Brian Gerst
  0 siblings, 0 replies; 31+ messages in thread
From: Brian Gerst @ 2016-06-15 11:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, X86 ML, linux-kernel, H. Peter Anvin,
	Denys Vlasenko, Borislav Petkov, Thomas Gleixner

On Wed, Jun 15, 2016 at 4:03 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Sat, May 21, 2016 at 9:04 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>
>> > Move the low-level context switch code to an out-of-line asm stub instead of
>> > using complex inline asm.  This allows constructing a new stack frame for the
>> > child process to make it seamlessly flow to ret_from_fork without an extra
>> > test and branch in __switch_to().  It also improves code generation for
>> > __schedule() by using the C calling convention instead of clobbering all
>> > registers.
>>
>> Just a heads up: I'm writing some code that conflicts with this patch. The
>> conflict will be easy to resolve, and, if this patch beats mine to -tip, I'll
>> rebase.
>
> So I was expecting another iteration of this switch_to() series, but had no
> fundamental objections to the concept.

I'll have it ready soon.  I've just been busy with other things lately.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] x86: Rewrite switch_to() code
  2016-05-23 17:03               ` Andy Lutomirski
  2016-05-23 18:44                 ` Josh Poimboeuf
@ 2016-07-12 14:16                 ` Josh Poimboeuf
  1 sibling, 0 replies; 31+ messages in thread
From: Josh Poimboeuf @ 2016-07-12 14:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brian Gerst, Thomas Gleixner, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Borislav Petkov, H. Peter Anvin, X86 ML,
	Dave Anderson

On Mon, May 23, 2016 at 10:03:54AM -0700, Andy Lutomirski wrote:
> Do you have any understanding of why there are so many unwinder
> implementations?  Your reliable unwinder seems to be yet another copy
> of more or less the same code.
> 
> I'd like to see a single, high-quality unwinder implemented as a state
> machine, along the lines of:
> 
> struct unwind_state state;
> unwind_start_inactive_task(&state, ...); or
> unwind_start_pt_regs(&state, regs); or whatever.
> unwind_next_frame(&state);
> 
> where, after unwind_next_frame, state encodes whatever registers are
> known (at least bp and ip, but all the GPRs would be nice and are
> probably mandatory for DWARF) and an indication of whether this is a
> real frame or a guessed frame (the things that currently show up as
> '?').

FYI, I'm working on something very similar to this which replaces
dump_trace().  The frame pointer encoding patches were going to require
more changes to the unwinder than I expected, and more callback sprawl.
So it looks like it's going to be easier to just go ahead and rewrite
the unwinder first.

-- 
Josh

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-07-12 14:16 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-21 16:04 [PATCH 0/4] x86: Rewrite switch_to() Brian Gerst
2016-05-21 16:04 ` [PATCH 1/4] x86: Save return value from kernel_thread Brian Gerst
2016-05-22  1:44   ` Andy Lutomirski
2016-05-22  2:19     ` Brian Gerst
2016-05-21 16:04 ` [PATCH 2/4] x86-32, kgdb: Don't use thread.ip in sleeping_thread_to_gdb_regs() Brian Gerst
2016-05-23 17:05   ` Andy Lutomirski
2016-05-21 16:04 ` [PATCH 3/4] x86: Rewrite switch_to() code Brian Gerst
2016-05-22 17:59   ` Andy Lutomirski
2016-05-22 19:31     ` Brian Gerst
2016-05-22 21:07       ` Andy Lutomirski
2016-05-23  2:34     ` Josh Poimboeuf
2016-05-23  4:47       ` Andy Lutomirski
2016-05-23 11:40         ` Josh Poimboeuf
2016-05-23 11:49           ` Brian Gerst
2016-05-23 12:05             ` Josh Poimboeuf
2016-05-23 11:14       ` Brian Gerst
2016-05-23 11:47         ` Josh Poimboeuf
2016-05-23 11:49           ` Josh Poimboeuf
2016-05-23 16:46             ` Josh Poimboeuf
2016-05-23 17:03               ` Andy Lutomirski
2016-05-23 18:44                 ` Josh Poimboeuf
2016-07-12 14:16                 ` Josh Poimboeuf
2016-06-15  1:31   ` Andy Lutomirski
2016-06-15  8:03     ` Ingo Molnar
2016-06-15 11:52       ` Brian Gerst
2016-05-21 16:04 ` [PATCH 4/4] x86: Pass kernel thread parameters in fork_frame Brian Gerst
2016-05-22 18:01   ` Andy Lutomirski
2016-05-22 19:21     ` Brian Gerst
2016-05-23 15:23   ` Josh Poimboeuf
2016-05-23 15:36     ` Andy Lutomirski
2016-05-23 21:04       ` Brian Gerst

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).