All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] thread_info cleanups and stack caching
@ 2016-09-13 21:29 Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 01/12] x86/asm: Move 'status' from struct thread_info to struct thread_struct Andy Lutomirski
                   ` (11 more replies)
  0 siblings, 12 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn, Andy Lutomirski

[Sorry this is late.  I apparently never hit enter on the git
 send-email command.  This is what I meant to send, except that I
 folded in the collect_syscall() fix and redid the rebase (which
 was uneventful).]

This series extensively cleans up thread_info.  thread_info has been
partially redundant with thread_struct for a long time -- both are
places for arch code to add additional per-task variables.
thread_struct is much cleaner: it's always in task_struct, and
there's nothing particularly magical about it.  So this series moves
x86's status field from thread_info to thread_struct and to remove
x86's dependence on thread_info's position on the stack.  Then it
opts x86 into a new config option THREAD_INFO_IN_TASK to get rid of
arch-specific thread_info entirely and simply embed a defanged
thread_info (containing only flags) and 'int cpu' into task_struct.

Once thread_info stops being magical, there's another benefit: we
can free the thread stack as soon as the task is dead (without
waiting for RCU) and then, if vmapped stacks are in use, cache the
entire stack for reuse on the same cpu.

This seems to be an overall speedup of about 0.5-1 µs per
pthread_create/join compared to the old CONFIG_VMAP_STACK=n baseline
in a simple test -- a percpu cache of vmalloced stacks appears to be
a bit faster than a high-order stack allocation, at least when the
cache hits.  (I expect that workloads with a low cache hit rate are
likely to be dominated by other effects anyway.)

Changes from before:
 - A bunch of the series is already in 4.8-rc.
 - Added the get_wchan() and collect_syscall() patches.
 - Rebased.

Andy Lutomirski (9):
  x86/asm: Move 'status' from struct thread_info to struct thread_struct
  sched: Allow putting thread_info into task_struct
  x86: Move thread_info into task_struct
  sched: Add try_get_task_stack() and put_task_stack()
  x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  x86/process: Pin the target stack in get_wchan()
  lib/syscall: Pin the task stack in collect_syscall()
  sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
  fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set

Linus Torvalds (2):
  x86/entry: Get rid of pt_regs_to_thread_info()
  um: Stop conflating task_struct::stack with thread_info

Oleg Nesterov (1):
  kthread: to_live_kthread() needs try_get_task_stack()

 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/common.c            | 24 ++++------
 arch/x86/entry/entry_64.S          |  7 ++-
 arch/x86/include/asm/processor.h   | 12 +++++
 arch/x86/include/asm/syscall.h     | 20 ++------
 arch/x86/include/asm/thread_info.h | 69 ++-------------------------
 arch/x86/kernel/asm-offsets.c      |  5 +-
 arch/x86/kernel/fpu/init.c         |  1 -
 arch/x86/kernel/irq_64.c           |  3 +-
 arch/x86/kernel/process.c          | 28 ++++++-----
 arch/x86/kernel/process_64.c       |  4 +-
 arch/x86/kernel/ptrace.c           |  2 +-
 arch/x86/kernel/signal.c           |  2 +-
 arch/x86/kernel/stacktrace.c       |  5 ++
 arch/x86/um/ptrace_32.c            |  8 ++--
 include/linux/init_task.h          | 11 +++++
 include/linux/sched.h              | 66 +++++++++++++++++++++++++-
 include/linux/thread_info.h        | 15 ++++++
 init/Kconfig                       | 10 ++++
 init/init_task.c                   |  7 ++-
 kernel/fork.c                      | 97 ++++++++++++++++++++++++++++++++++----
 kernel/kthread.c                   |  8 +++-
 kernel/sched/core.c                |  4 ++
 kernel/sched/sched.h               |  4 ++
 lib/syscall.c                      | 15 +++++-
 25 files changed, 286 insertions(+), 142 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/12] x86/asm: Move 'status' from struct thread_info to struct thread_struct
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-15 10:41   ` [tip:x86/asm] x86/asm: Move the thread_info::status field to thread_struct tip-bot for Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 02/12] x86/entry: Get rid of pt_regs_to_thread_info() Andy Lutomirski
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn, Andy Lutomirski

Becuase sched.h and thread_info.h are a tangled mess, I turned
in_compat_syscall into a macro.  If we had current_thread_struct()
or similar and we could use it from thread_info.h, then this would
be a bit cleaner.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c            |  4 ++--
 arch/x86/include/asm/processor.h   | 12 ++++++++++++
 arch/x86/include/asm/syscall.h     | 20 +++++---------------
 arch/x86/include/asm/thread_info.h | 23 ++++-------------------
 arch/x86/kernel/asm-offsets.c      |  1 -
 arch/x86/kernel/fpu/init.c         |  1 -
 arch/x86/kernel/process_64.c       |  4 ++--
 arch/x86/kernel/ptrace.c           |  2 +-
 arch/x86/kernel/signal.c           |  2 +-
 9 files changed, 27 insertions(+), 42 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 1433f6b4607d..871bbf975d4c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -209,7 +209,7 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	 * special case only applies after poking regs and before the
 	 * very next return to user mode.
 	 */
-	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
+	current->thread.status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
 #endif
 
 	user_enter_irqoff();
@@ -307,7 +307,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 	unsigned int nr = (unsigned int)regs->orig_ax;
 
 #ifdef CONFIG_IA32_EMULATION
-	ti->status |= TS_COMPAT;
+	current->thread.status |= TS_COMPAT;
 #endif
 
 	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b22fb5a4ff3c..984a7bf17f6a 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -389,6 +389,9 @@ struct thread_struct {
 	unsigned short		fsindex;
 	unsigned short		gsindex;
 #endif
+
+	u32			status;		/* thread synchronous flags */
+
 #ifdef CONFIG_X86_64
 	unsigned long		fsbase;
 	unsigned long		gsbase;
@@ -435,6 +438,15 @@ struct thread_struct {
 };
 
 /*
+ * Thread-synchronous status.
+ *
+ * This is different from the flags in that nobody else
+ * ever touches our thread-synchronous status, so we don't
+ * have to worry about atomic accesses.
+ */
+#define TS_COMPAT		0x0002	/* 32bit syscall active (64BIT)*/
+
+/*
  * Set IOPL bits in EFLAGS from given mask
  */
 static inline void native_set_iopl_mask(unsigned mask)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 4e23dd15c661..e3c95e8e61c5 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -60,7 +60,7 @@ static inline long syscall_get_error(struct task_struct *task,
 	 * TS_COMPAT is set for 32-bit syscall entries and then
 	 * remains set until we return to user mode.
 	 */
-	if (task_thread_info(task)->status & (TS_COMPAT|TS_I386_REGS_POKED))
+	if (task->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
 		/*
 		 * Sign-extend the value so (int)-EFOO becomes (long)-EFOO
 		 * and will match correctly in comparisons.
@@ -116,7 +116,7 @@ static inline void syscall_get_arguments(struct task_struct *task,
 					 unsigned long *args)
 {
 # ifdef CONFIG_IA32_EMULATION
-	if (task_thread_info(task)->status & TS_COMPAT)
+	if (task->thread.status & TS_COMPAT)
 		switch (i) {
 		case 0:
 			if (!n--) break;
@@ -177,7 +177,7 @@ static inline void syscall_set_arguments(struct task_struct *task,
 					 const unsigned long *args)
 {
 # ifdef CONFIG_IA32_EMULATION
-	if (task_thread_info(task)->status & TS_COMPAT)
+	if (task->thread.status & TS_COMPAT)
 		switch (i) {
 		case 0:
 			if (!n--) break;
@@ -234,18 +234,8 @@ static inline void syscall_set_arguments(struct task_struct *task,
 
 static inline int syscall_get_arch(void)
 {
-#ifdef CONFIG_IA32_EMULATION
-	/*
-	 * TS_COMPAT is set for 32-bit syscall entry and then
-	 * remains set until we return to user mode.
-	 *
-	 * x32 tasks should be considered AUDIT_ARCH_X86_64.
-	 */
-	if (task_thread_info(current)->status & TS_COMPAT)
-		return AUDIT_ARCH_I386;
-#endif
-	/* Both x32 and x86_64 are considered "64-bit". */
-	return AUDIT_ARCH_X86_64;
+	/* x32 tasks should be considered AUDIT_ARCH_X86_64. */
+	return in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 }
 #endif	/* CONFIG_X86_32 */
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 494c4b5ada34..c9dcfe7c7e4b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -55,7 +55,6 @@ struct task_struct;
 struct thread_info {
 	struct task_struct	*task;		/* main task structure */
 	__u32			flags;		/* low level flags */
-	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
 };
 
@@ -253,31 +252,17 @@ static inline int arch_within_stack_frames(const void * const stack,
 
 #endif
 
-/*
- * Thread-synchronous status.
- *
- * This is different from the flags in that nobody else
- * ever touches our thread-synchronous status, so we don't
- * have to worry about atomic accesses.
- */
-#define TS_COMPAT		0x0002	/* 32bit syscall active (64BIT)*/
 #ifdef CONFIG_COMPAT
 #define TS_I386_REGS_POKED	0x0004	/* regs poked by 32-bit ptracer */
 #endif
-
 #ifndef __ASSEMBLY__
 
-static inline bool in_ia32_syscall(void)
-{
 #ifdef CONFIG_X86_32
-	return true;
-#endif
-#ifdef CONFIG_IA32_EMULATION
-	if (current_thread_info()->status & TS_COMPAT)
-		return true;
+#define in_ia32_syscall() true
+#else
+#define in_ia32_syscall() (IS_ENABLED(CONFIG_IA32_EMULATION) && \
+			   current->thread.status & TS_COMPAT)
 #endif
-	return false;
-}
 
 /*
  * Force syscall return via IRET by making it look as if there was
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index db3a0af9b9ec..add5f90b93d4 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -36,7 +36,6 @@ void common(void) {
 
 	BLANK();
 	OFFSET(TI_flags, thread_info, flags);
-	OFFSET(TI_status, thread_info, status);
 
 	BLANK();
 	OFFSET(TASK_addr_limit, task_struct, thread.addr_limit);
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 93982aebb398..2f2b8c7ccb85 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -317,7 +317,6 @@ static void __init fpu__init_system_ctx_switch(void)
 	on_boot_cpu = 0;
 
 	WARN_ON_FPU(current->thread.fpu.fpstate_active);
-	current_thread_info()->status = 0;
 
 	if (boot_cpu_has(X86_FEATURE_XSAVEOPT) && eagerfpu != DISABLE)
 		eagerfpu = ENABLE;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b812cd0d7889..de9acaf2d371 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -510,7 +510,7 @@ void set_personality_ia32(bool x32)
 		current->personality &= ~READ_IMPLIES_EXEC;
 		/* in_compat_syscall() uses the presence of the x32
 		   syscall bit flag to determine compat status */
-		current_thread_info()->status &= ~TS_COMPAT;
+		current->thread.status &= ~TS_COMPAT;
 	} else {
 		set_thread_flag(TIF_IA32);
 		clear_thread_flag(TIF_X32);
@@ -518,7 +518,7 @@ void set_personality_ia32(bool x32)
 			current->mm->context.ia32_compat = TIF_IA32;
 		current->personality |= force_personality32;
 		/* Prepare the first "return" to user space */
-		current_thread_info()->status |= TS_COMPAT;
+		current->thread.status |= TS_COMPAT;
 	}
 }
 EXPORT_SYMBOL_GPL(set_personality_ia32);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5b88a1b26fc7..ce94c38cf4d6 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -934,7 +934,7 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
 		 */
 		regs->orig_ax = value;
 		if (syscall_get_nr(child, regs) >= 0)
-			task_thread_info(child)->status |= TS_I386_REGS_POKED;
+			child->thread.status |= TS_I386_REGS_POKED;
 		break;
 
 	case offsetof(struct user32, regs.eflags):
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 04cb3212db2d..da20ecb5397a 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -783,7 +783,7 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
 	 * than the tracee.
 	 */
 #ifdef CONFIG_IA32_EMULATION
-	if (current_thread_info()->status & (TS_COMPAT|TS_I386_REGS_POKED))
+	if (current->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
 		return __NR_ia32_restart_syscall;
 #endif
 #ifdef CONFIG_X86_X32_ABI
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/12] x86/entry: Get rid of pt_regs_to_thread_info()
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 01/12] x86/asm: Move 'status' from struct thread_info to struct thread_struct Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-15  6:21   ` Ingo Molnar
  2016-09-15 10:42   ` [tip:x86/asm] " tip-bot for Linus Torvalds
  2016-09-13 21:29 ` [PATCH 03/12] um: Stop conflating task_struct::stack with thread_info Andy Lutomirski
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn,
	Linus Torvalds, Andy Lutomirski

From: Linus Torvalds <torvalds@linux-foundation.org>

It was a nice optimization while it lasted, but thread_info is moving
and this optimization will no longer work.

Quoting Linus:

    Oh Gods, Andy. That pt_regs_to_thread_info() thing made me want
    to do unspeakable acts on a poor innocent wax figure that looked
    _exactly_ like you.

[changelog written by Andy]
Message-Id: <CA+55aFxvZhBu9U1cqpVm4frv0p5mqu=0TxsSqE-=95ft8HvCVA@mail.gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/common.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 871bbf975d4c..bdd9cc59d20f 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,13 +31,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
-static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
-{
-	unsigned long top_of_stack =
-		(unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
-	return (struct thread_info *)(top_of_stack - THREAD_SIZE);
-}
-
 #ifdef CONFIG_CONTEXT_TRACKING
 /* Called on entry from user mode with IRQs off. */
 __visible inline void enter_from_user_mode(void)
@@ -71,7 +64,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
 {
 	u32 arch = in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long ret = 0;
 	bool emulated = false;
 	u32 work;
@@ -173,18 +166,17 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
-		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+		cached_flags = READ_ONCE(current_thread_info()->flags);
 
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 			break;
-
 	}
 }
 
 /* Called with IRQs disabled. */
 __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags;
 
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING) && WARN_ON(!irqs_disabled()))
@@ -247,7 +239,7 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
  */
 __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags = READ_ONCE(ti->flags);
 
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
@@ -270,7 +262,7 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 #ifdef CONFIG_X86_64
 __visible void do_syscall_64(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long nr = regs->orig_ax;
 
 	enter_from_user_mode();
@@ -303,7 +295,7 @@ __visible void do_syscall_64(struct pt_regs *regs)
  */
 static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned int nr = (unsigned int)regs->orig_ax;
 
 #ifdef CONFIG_IA32_EMULATION
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/12] um: Stop conflating task_struct::stack with thread_info
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 01/12] x86/asm: Move 'status' from struct thread_info to struct thread_struct Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 02/12] x86/entry: Get rid of pt_regs_to_thread_info() Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-15  6:21   ` Ingo Molnar
  2016-09-15 10:42   ` [tip:x86/asm] um/Stop " tip-bot for Linus Torvalds
  2016-09-13 21:29 ` [PATCH 04/12] sched: Allow putting thread_info into task_struct Andy Lutomirski
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn,
	Linus Torvalds, Andy Lutomirski

From: Linus Torvalds <torvalds@linux-foundation.org>

thread_info may move in the future, so use the accessors.

Andy Lutomirski wrote this changelog message and changed
"task_thread_info(child)->cpu" to "task_cpu(child)".

Message-Id: <CA+55aFxvZhBu9U1cqpVm4frv0p5mqu=0TxsSqE-=95ft8HvCVA@mail.gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/um/ptrace_32.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/um/ptrace_32.c b/arch/x86/um/ptrace_32.c
index ebd4dd6ef73b..2a6f781ce9cf 100644
--- a/arch/x86/um/ptrace_32.c
+++ b/arch/x86/um/ptrace_32.c
@@ -191,7 +191,7 @@ int peek_user(struct task_struct *child, long addr, long data)
 
 static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_cpu(child);
 	struct user_i387_struct fpregs;
 
 	err = save_i387_registers(userspace_pid[cpu],
@@ -208,7 +208,7 @@ static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_cpu(child);
 	struct user_i387_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
@@ -221,7 +221,7 @@ static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_cpu(child);
 	struct user_fxsr_struct fpregs;
 
 	err = save_fpx_registers(userspace_pid[cpu], (unsigned long *) &fpregs);
@@ -237,7 +237,7 @@ static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *
 
 static int set_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_cpu(child);
 	struct user_fxsr_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/12] sched: Allow putting thread_info into task_struct
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
                   ` (2 preceding siblings ...)
  2016-09-13 21:29 ` [PATCH 03/12] um: Stop conflating task_struct::stack with thread_info Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-15 10:43   ` [tip:x86/asm] sched/core: " tip-bot for Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 05/12] x86: Move " Andy Lutomirski
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn, Andy Lutomirski

If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
then thread_info is defined as a single 'u32 flags' and is the first
entry of task_struct.  thread_info::task is removed (it serves no
purpose if thread_info is embedded in task_struct), and
thread_info::cpu gets its own slot in task_struct.

This is heavily based on a patch written by Linus.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/init_task.h   |  9 +++++++++
 include/linux/sched.h       | 36 ++++++++++++++++++++++++++++++++++--
 include/linux/thread_info.h | 15 +++++++++++++++
 init/Kconfig                |  7 +++++++
 init/init_task.c            |  7 +++++--
 kernel/sched/sched.h        |  4 ++++
 6 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f8834f820ec2..9c04d44eeb3c 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,8 @@
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
 
+#include <asm/thread_info.h>
+
 #ifdef CONFIG_SMP
 # define INIT_PUSHABLE_TASKS(tsk)					\
 	.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO),
@@ -183,12 +185,19 @@ extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+# define INIT_TASK_TI(tsk) .thread_info = INIT_THREAD_INFO(tsk),
+#else
+# define INIT_TASK_TI(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
  */
 #define INIT_TASK(tsk)	\
 {									\
+	INIT_TASK_TI(tsk)						\
 	.state		= 0,						\
 	.stack		= init_stack,					\
 	.usage		= ATOMIC_INIT(2),				\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 20f9f47bcfd0..a287e8b13549 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1458,6 +1458,13 @@ struct tlbflush_unmap_batch {
 };
 
 struct task_struct {
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	/*
+	 * For reasons of header soup (see current_thread_info()), this
+	 * must be the first element of task_struct.
+	 */
+	struct thread_info thread_info;
+#endif
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
 	atomic_t usage;
@@ -1467,6 +1474,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	unsigned int cpu;	/* current CPU */
+#endif
 	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
 	struct task_struct *last_wakee;
@@ -2588,7 +2598,9 @@ extern void set_curr_task(int cpu, struct task_struct *p);
 void yield(void);
 
 union thread_union {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
 	struct thread_info thread_info;
+#endif
 	unsigned long stack[THREAD_SIZE/sizeof(long)];
 };
 
@@ -3076,10 +3088,26 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
 	cgroup_threadgroup_change_end(tsk);
 }
 
-#ifndef __HAVE_THREAD_FUNCTIONS
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+
+static inline struct thread_info *task_thread_info(struct task_struct *task)
+{
+	return &task->thread_info;
+}
+static inline void *task_stack_page(const struct task_struct *task)
+{
+	return task->stack;
+}
+#define setup_thread_stack(new,old)	do { } while(0)
+static inline unsigned long *end_of_stack(const struct task_struct *task)
+{
+	return task->stack;
+}
+
+#elif !defined(__HAVE_THREAD_FUNCTIONS)
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
-#define task_stack_page(task)	((task)->stack)
+#define task_stack_page(task)	((void *)(task)->stack)
 
 static inline void setup_thread_stack(struct task_struct *p, struct task_struct *org)
 {
@@ -3379,7 +3407,11 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
 
 static inline unsigned int task_cpu(const struct task_struct *p)
 {
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	return p->cpu;
+#else
 	return task_thread_info(p)->cpu;
+#endif
 }
 
 static inline int task_node(const struct task_struct *p)
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index cbd8990e2e77..cb0ed342d6b1 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -13,6 +13,21 @@
 struct timespec;
 struct compat_timespec;
 
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+struct thread_info {
+	u32			flags;		/* low level flags */
+};
+
+#define INIT_THREAD_INFO(tsk)			\
+{						\
+	.flags		= 0,			\
+}
+#endif
+
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+#define current_thread_info() ((struct thread_info *)current)
+#endif
+
 /*
  * System call restart block.
  */
diff --git a/init/Kconfig b/init/Kconfig
index cac3f096050d..ec8d43894b02 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -26,6 +26,13 @@ config IRQ_WORK
 config BUILDTIME_EXTABLE_SORT
 	bool
 
+config THREAD_INFO_IN_TASK
+	bool
+	help
+	  Select this to move thread_info off the stack into task_struct.  To
+	  make this work, an arch will need to remove all thread_info fields
+	  except flags and fix any runtime bugs.
+
 menu "General setup"
 
 config BROKEN
diff --git a/init/init_task.c b/init/init_task.c
index ba0a7f362d9e..11f83be1fa79 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -22,5 +22,8 @@ EXPORT_SYMBOL(init_task);
  * Initial thread structure. Alignment of this is handled by a special
  * linker map entry.
  */
-union thread_union init_thread_union __init_task_data =
-	{ INIT_THREAD_INFO(init_task) };
+union thread_union init_thread_union __init_task_data = {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
+	INIT_THREAD_INFO(init_task)
+#endif
+};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c64fc5114004..3655c9625e5b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1000,7 +1000,11 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 	 * per-task data have been completed by this moment.
 	 */
 	smp_wmb();
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	p->cpu = cpu;
+#else
 	task_thread_info(p)->cpu = cpu;
+#endif
 	p->wake_cpu = cpu;
 #endif
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 05/12] x86: Move thread_info into task_struct
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
                   ` (3 preceding siblings ...)
  2016-09-13 21:29 ` [PATCH 04/12] sched: Allow putting thread_info into task_struct Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-15 10:43   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 06/12] sched: Add try_get_task_stack() and put_task_stack() Andy Lutomirski
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn, Andy Lutomirski

Now that most of the thread_info users have been cleaned up,
this is straightforward.

Most of this code was written by Linus.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/entry_64.S          |  7 ++++--
 arch/x86/include/asm/thread_info.h | 46 --------------------------------------
 arch/x86/kernel/asm-offsets.c      |  4 +---
 arch/x86/kernel/irq_64.c           |  3 +--
 arch/x86/kernel/process.c          |  6 ++---
 6 files changed, 10 insertions(+), 57 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ce8860cccc34..452412cbd6c1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -158,6 +158,7 @@ config X86
 	select SPARSE_IRQ
 	select SRCU
 	select SYSCTL_EXCEPTION_TRACE
+	select THREAD_INFO_IN_TASK
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c0373d667674..0cfc665d6716 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -179,7 +179,8 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
 	 */
-	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	movq	PER_CPU_VAR(current_task), %r11
+	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
 	jnz	entry_SYSCALL64_slow_path
 
 entry_SYSCALL_64_fastpath:
@@ -217,7 +218,8 @@ entry_SYSCALL_64_fastpath:
 	 */
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	movq	PER_CPU_VAR(current_task), %r11
+	testl	$_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
 	jnz	1f
 
 	LOCKDEP_SYS_EXIT
@@ -370,6 +372,7 @@ END(ptregs_\func)
 /*
  * %rdi: prev task
  * %rsi: next task
+ * rsi: task we're switching to
  */
 ENTRY(__switch_to_asm)
 	/*
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index c9dcfe7c7e4b..2aaca53c0974 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -52,20 +52,6 @@ struct task_struct;
 #include <asm/cpufeature.h>
 #include <linux/atomic.h>
 
-struct thread_info {
-	struct task_struct	*task;		/* main task structure */
-	__u32			flags;		/* low level flags */
-	__u32			cpu;		/* current CPU */
-};
-
-#define INIT_THREAD_INFO(tsk)			\
-{						\
-	.task		= &tsk,			\
-	.flags		= 0,			\
-	.cpu		= 0,			\
-}
-
-#define init_thread_info	(init_thread_union.thread_info)
 #define init_stack		(init_thread_union.stack)
 
 #else /* !__ASSEMBLY__ */
@@ -157,11 +143,6 @@ struct thread_info {
  */
 #ifndef __ASSEMBLY__
 
-static inline struct thread_info *current_thread_info(void)
-{
-	return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);
-}
-
 static inline unsigned long current_stack_pointer(void)
 {
 	unsigned long sp;
@@ -223,33 +204,6 @@ static inline int arch_within_stack_frames(const void * const stack,
 # define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
 #endif
 
-/*
- * ASM operand which evaluates to a 'thread_info' address of
- * the current task, if it is known that "reg" is exactly "off"
- * bytes below the top of the stack currently.
- *
- * ( The kernel stack's size is known at build time, it is usually
- *   2 or 4 pages, and the bottom  of the kernel stack contains
- *   the thread_info structure. So to access the thread_info very
- *   quickly from assembly code we can calculate down from the
- *   top of the kernel stack to the bottom, using constant,
- *   build-time calculations only. )
- *
- * For example, to fetch the current thread_info->flags value into %eax
- * on x86-64 defconfig kernels, in syscall entry code where RSP is
- * currently at exactly SIZEOF_PTREGS bytes away from the top of the
- * stack:
- *
- *      mov ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS), %eax
- *
- * will translate to:
- *
- *      8b 84 24 b8 c0 ff ff      mov    -0x3f48(%rsp), %eax
- *
- * which is below the current RSP by almost 16K.
- */
-#define ASM_THREAD_INFO(field, reg, off) ((field)+(off)-THREAD_SIZE)(reg)
-
 #endif
 
 #ifdef CONFIG_COMPAT
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index add5f90b93d4..c62e015b126c 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -35,9 +35,7 @@ void common(void) {
 #endif
 
 	BLANK();
-	OFFSET(TI_flags, thread_info, flags);
-
-	BLANK();
+	OFFSET(TASK_TI_flags, task_struct, thread_info.flags);
 	OFFSET(TASK_addr_limit, task_struct, thread.addr_limit);
 
 	BLANK();
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 4a7903714065..9ebd0b0e73d9 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -40,8 +40,7 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (user_mode(regs))
 		return;
 
-	if (regs->sp >= curbase + sizeof(struct thread_info) +
-				  sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
+	if (regs->sp >= curbase + sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
 	    regs->sp <= curbase + THREAD_SIZE)
 		return;
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c1fa790c81cd..0b9ed8ec5226 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -549,9 +549,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 * PADDING
 	 * ----------- top = topmax - TOP_OF_KERNEL_STACK_PADDING
 	 * stack
-	 * ----------- bottom = start + sizeof(thread_info)
-	 * thread_info
-	 * ----------- start
+	 * ----------- bottom = start
 	 *
 	 * The tasks stack pointer points at the location where the
 	 * framepointer is stored. The data on the stack is:
@@ -562,7 +560,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 */
 	top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
 	top -= 2 * sizeof(unsigned long);
-	bottom = start + sizeof(struct thread_info);
+	bottom = start;
 
 	sp = READ_ONCE(p->thread.sp);
 	if (sp < bottom || sp > top)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/12] sched: Add try_get_task_stack() and put_task_stack()
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
                   ` (4 preceding siblings ...)
  2016-09-13 21:29 ` [PATCH 05/12] x86: Move " Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 07/12] kthread: to_live_kthread() needs try_get_task_stack() Andy Lutomirski
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn, Andy Lutomirski

There are a few places in the kernel that access stack memory
belonging to a different task.  Before we can start freeing task
stacks before the task_struct is freed, we need a way for those code
paths to pin the stack.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/sched.h | 16 ++++++++++++++++
 init/Kconfig          |  3 +++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a287e8b13549..a95867267e9f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3094,11 +3094,19 @@ static inline struct thread_info *task_thread_info(struct task_struct *task)
 {
 	return &task->thread_info;
 }
+
+/*
+ * When accessing the stack of a non-current task that might exit, use
+ * try_get_task_stack() instead.  task_stack_page will return a pointer
+ * that could get freed out from under you.
+ */
 static inline void *task_stack_page(const struct task_struct *task)
 {
 	return task->stack;
 }
+
 #define setup_thread_stack(new,old)	do { } while(0)
+
 static inline unsigned long *end_of_stack(const struct task_struct *task)
 {
 	return task->stack;
@@ -3134,6 +3142,14 @@ static inline unsigned long *end_of_stack(struct task_struct *p)
 }
 
 #endif
+
+static inline void *try_get_task_stack(struct task_struct *tsk)
+{
+	return task_stack_page(tsk);
+}
+
+static inline void put_task_stack(struct task_struct *tsk) {}
+
 #define task_stack_end_corrupted(task) \
 		(*(end_of_stack(task)) != STACK_END_MAGIC)
 
diff --git a/init/Kconfig b/init/Kconfig
index ec8d43894b02..3b9a47fe843b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -33,6 +33,9 @@ config THREAD_INFO_IN_TASK
 	  make this work, an arch will need to remove all thread_info fields
 	  except flags and fix any runtime bugs.
 
+	  One subtle change that will be needed is to use try_get_task_stack()
+	  and put_task_stack() in save_thread_stack_tsk() and get_wchan().
+
 menu "General setup"
 
 config BROKEN
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/12] kthread: to_live_kthread() needs try_get_task_stack()
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
                   ` (5 preceding siblings ...)
  2016-09-13 21:29 ` [PATCH 06/12] sched: Add try_get_task_stack() and put_task_stack() Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk() Andy Lutomirski
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn,
	Oleg Nesterov, Andy Lutomirski

From: Oleg Nesterov <oleg@redhat.com>

get_task_struct(tsk) no longer pins tsk->stack so all users of
to_live_kthread() should do try_get_task_stack/put_task_stack to protect
"struct kthread" which lives on kthread's stack.

TODO: Kill to_live_kthread(), perhaps we can even kill "struct kthread" too,
and rework kthread_stop(), it can use task_work_add() to sync with the exiting
kernel thread.

Message-Id: <20160629180357.GA7178@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/kthread.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 9ff173dca1ae..4ab4c3766a80 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -64,7 +64,7 @@ static inline struct kthread *to_kthread(struct task_struct *k)
 static struct kthread *to_live_kthread(struct task_struct *k)
 {
 	struct completion *vfork = ACCESS_ONCE(k->vfork_done);
-	if (likely(vfork))
+	if (likely(vfork) && try_get_task_stack(k))
 		return __to_kthread(vfork);
 	return NULL;
 }
@@ -425,8 +425,10 @@ void kthread_unpark(struct task_struct *k)
 {
 	struct kthread *kthread = to_live_kthread(k);
 
-	if (kthread)
+	if (kthread) {
 		__kthread_unpark(k, kthread);
+		put_task_stack(k);
+	}
 }
 EXPORT_SYMBOL_GPL(kthread_unpark);
 
@@ -455,6 +457,7 @@ int kthread_park(struct task_struct *k)
 				wait_for_completion(&kthread->parked);
 			}
 		}
+		put_task_stack(k);
 		ret = 0;
 	}
 	return ret;
@@ -490,6 +493,7 @@ int kthread_stop(struct task_struct *k)
 		__kthread_unpark(k, kthread);
 		wake_up_process(k);
 		wait_for_completion(&kthread->exited);
+		put_task_stack(k);
 	}
 	ret = k->exit_code;
 	put_task_struct(k);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
                   ` (6 preceding siblings ...)
  2016-09-13 21:29 ` [PATCH 07/12] kthread: to_live_kthread() needs try_get_task_stack() Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-14 14:55   ` Josh Poimboeuf
  2016-09-15  6:37   ` Ingo Molnar
  2016-09-13 21:29 ` [PATCH 09/12] x86/process: Pin the target stack in get_wchan() Andy Lutomirski
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn, Andy Lutomirski

This will prevent a crash if the target task dies before or while
dumping its stack once we start freeing task stacks early.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/stacktrace.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 4738f5e0f2ab..b3f32fbe3ba4 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -79,9 +79,14 @@ void save_stack_trace_regs(struct pt_regs *regs, struct stack_trace *trace)
 
 void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
 {
+	if (!try_get_task_stack(tsk))
+		return;
+
 	dump_trace(tsk, NULL, NULL, 0, &save_stack_ops_nosched, trace);
 	if (trace->nr_entries < trace->max_entries)
 		trace->entries[trace->nr_entries++] = ULONG_MAX;
+
+	put_task_stack(tsk);
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/12] x86/process: Pin the target stack in get_wchan()
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
                   ` (7 preceding siblings ...)
  2016-09-13 21:29 ` [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk() Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-17  2:00   ` Jann Horn
  2016-09-13 21:29 ` [PATCH 10/12] lib/syscall: Pin the task stack in collect_syscall() Andy Lutomirski
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn, Andy Lutomirski

This will prevent a crash if get_wchan() runs after the task stack
is freed.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/process.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0b9ed8ec5226..4002b475171c 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -532,15 +532,18 @@ unsigned long thread_saved_pc(struct task_struct *tsk)
  */
 unsigned long get_wchan(struct task_struct *p)
 {
-	unsigned long start, bottom, top, sp, fp, ip;
+	unsigned long start, bottom, top, sp, fp, ip, ret = 0;
 	int count = 0;
 
 	if (!p || p == current || p->state == TASK_RUNNING)
 		return 0;
 
+	if (!try_get_task_stack(p))
+		return 0;
+
 	start = (unsigned long)task_stack_page(p);
 	if (!start)
-		return 0;
+		goto out;
 
 	/*
 	 * Layout of the stack page:
@@ -564,16 +567,21 @@ unsigned long get_wchan(struct task_struct *p)
 
 	sp = READ_ONCE(p->thread.sp);
 	if (sp < bottom || sp > top)
-		return 0;
+		goto out;
 
 	fp = READ_ONCE_NOCHECK(((struct inactive_task_frame *)sp)->bp);
 	do {
 		if (fp < bottom || fp > top)
-			return 0;
+			goto out;
 		ip = READ_ONCE_NOCHECK(*(unsigned long *)(fp + sizeof(unsigned long)));
-		if (!in_sched_functions(ip))
-			return ip;
+		if (!in_sched_functions(ip)) {
+			ret = ip;
+			goto out;
+		}
 		fp = READ_ONCE_NOCHECK(*(unsigned long *)fp);
 	} while (count++ < 16 && p->state != TASK_RUNNING);
-	return 0;
+
+out:
+	put_task_stack(p);
+	return ret;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/12] lib/syscall: Pin the task stack in collect_syscall()
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
                   ` (8 preceding siblings ...)
  2016-09-13 21:29 ` [PATCH 09/12] x86/process: Pin the target stack in get_wchan() Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 11/12] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 12/12] fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set Andy Lutomirski
  11 siblings, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn, Andy Lutomirski

This will avoid a potential read-after-free if collect_syscall()
(e.g. /proc/PID/syscall) is called on an exiting task.

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 lib/syscall.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/lib/syscall.c b/lib/syscall.c
index e30e03932480..63239e097b13 100644
--- a/lib/syscall.c
+++ b/lib/syscall.c
@@ -7,9 +7,19 @@ static int collect_syscall(struct task_struct *target, long *callno,
 			   unsigned long args[6], unsigned int maxargs,
 			   unsigned long *sp, unsigned long *pc)
 {
-	struct pt_regs *regs = task_pt_regs(target);
-	if (unlikely(!regs))
+	struct pt_regs *regs;
+
+	if (!try_get_task_stack(target)) {
+		/* Task has no stack, so the task isn't in a syscall. */
+		*callno = -1;
+		return 0;
+	}
+
+	regs = task_pt_regs(target);
+	if (unlikely(!regs)) {
+		put_task_stack(target);
 		return -EAGAIN;
+	}
 
 	*sp = user_stack_pointer(regs);
 	*pc = instruction_pointer(regs);
@@ -18,6 +28,7 @@ static int collect_syscall(struct task_struct *target, long *callno,
 	if (*callno != -1L && maxargs > 0)
 		syscall_get_arguments(target, regs, 0, maxargs, args);
 
+	put_task_stack(target);
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 11/12] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
                   ` (9 preceding siblings ...)
  2016-09-13 21:29 ` [PATCH 10/12] lib/syscall: Pin the task stack in collect_syscall() Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  2016-09-13 21:29 ` [PATCH 12/12] fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set Andy Lutomirski
  11 siblings, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn,
	Andy Lutomirski, Oleg Nesterov, Peter Zijlstra

We currently keep every task's stack around until the task_struct
itself is freed.  This means that we keep the stack allocation alive
for longer than necessary and that, under load, we free stacks in
big batches whenever RCU drops the last task reference.  Neither of
these is good for reuse of cache-hot memory, and freeing in batches
prevents us from usefully caching small numbers of vmalloced stacks.

On architectures that have thread_info on the stack, we can't easily
change this, but on architectures that set THREAD_INFO_IN_TASK, we
can free it as soon as the task is dead.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/init_task.h |  4 +++-
 include/linux/sched.h     | 14 ++++++++++++++
 kernel/fork.c             | 35 ++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c       |  4 ++++
 4 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 9c04d44eeb3c..325f649d77ff 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -186,7 +186,9 @@ extern struct task_group root_task_group;
 #endif
 
 #ifdef CONFIG_THREAD_INFO_IN_TASK
-# define INIT_TASK_TI(tsk) .thread_info = INIT_THREAD_INFO(tsk),
+# define INIT_TASK_TI(tsk)			\
+	.thread_info = INIT_THREAD_INFO(tsk),	\
+	.stack_refcount = ATOMIC_INIT(1),
 #else
 # define INIT_TASK_TI(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a95867267e9f..abb795afc823 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1936,6 +1936,10 @@ struct task_struct {
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct *stack_vm_area;
 #endif
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	/* A live task holds one reference. */
+	atomic_t stack_refcount;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -3143,12 +3147,22 @@ static inline unsigned long *end_of_stack(struct task_struct *p)
 
 #endif
 
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+static inline void *try_get_task_stack(struct task_struct *tsk)
+{
+	return atomic_inc_not_zero(&tsk->stack_refcount) ?
+		task_stack_page(tsk) : NULL;
+}
+
+extern void put_task_stack(struct task_struct *tsk);
+#else
 static inline void *try_get_task_stack(struct task_struct *tsk)
 {
 	return task_stack_page(tsk);
 }
 
 static inline void put_task_stack(struct task_struct *tsk) {}
+#endif
 
 #define task_stack_end_corrupted(task) \
 		(*(end_of_stack(task)) != STACK_END_MAGIC)
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b85f6b2cdcd..e0fd5446df58 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -269,11 +269,40 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
 	}
 }
 
-void free_task(struct task_struct *tsk)
+static void release_task_stack(struct task_struct *tsk)
 {
 	account_kernel_stack(tsk, -1);
 	arch_release_thread_stack(tsk->stack);
 	free_thread_stack(tsk);
+	tsk->stack = NULL;
+#ifdef CONFIG_VMAP_STACK
+	tsk->stack_vm_area = NULL;
+#endif
+}
+
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+void put_task_stack(struct task_struct *tsk)
+{
+	if (atomic_dec_and_test(&tsk->stack_refcount))
+		release_task_stack(tsk);
+}
+#endif
+
+void free_task(struct task_struct *tsk)
+{
+#ifndef CONFIG_THREAD_INFO_IN_TASK
+	/*
+	 * The task is finally done with both the stack and thread_info,
+	 * so free both.
+	 */
+	release_task_stack(tsk);
+#else
+	/*
+	 * If the task had a separate stack allocation, it should be gone
+	 * by now.
+	 */
+	WARN_ON_ONCE(atomic_read(&tsk->stack_refcount) != 0);
+#endif
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
@@ -411,6 +440,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_VMAP_STACK
 	tsk->stack_vm_area = stack_vm_area;
 #endif
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	atomic_set(&tsk->stack_refcount, 1);
+#endif
 
 	if (err)
 		goto free_stack;
@@ -1750,6 +1782,7 @@ bad_fork_cleanup_count:
 	atomic_dec(&p->cred->user->processes);
 	exit_creds(p);
 bad_fork_free:
+	put_task_stack(p);
 	free_task(p);
 fork_out:
 	return ERR_PTR(retval);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d91b63dd2f6..d61bfae91093 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2750,6 +2750,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		 * task and put them back on the free list.
 		 */
 		kprobe_flush_task(prev);
+
+		/* Task is done with its stack. */
+		put_task_stack(prev);
+
 		put_task_struct(prev);
 	}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 12/12] fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set
  2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
                   ` (10 preceding siblings ...)
  2016-09-13 21:29 ` [PATCH 11/12] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK Andy Lutomirski
@ 2016-09-13 21:29 ` Andy Lutomirski
  11 siblings, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-13 21:29 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn, Andy Lutomirski

vmalloc is a bit slow, and pounding vmalloc/vfree will eventually
force a global TLB flush.

To reduce pressure on them, if CONFIG_VMAP_STACK, cache two thread
stacks per cpu.  This will let us quickly allocate a hopefully
cache-hot, TLB-hot stack under heavy forking workloads (shell script
style).

On my silly pthread_create benchmark, it saves about 2 µs per
pthread_create+join with CONFIG_VMAP_STACK=y.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/fork.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 53 insertions(+), 9 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index e0fd5446df58..128ef350ce53 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -159,15 +159,41 @@ void __weak arch_release_thread_stack(unsigned long *stack)
  * kmemcache based allocator.
  */
 # if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
+
+#ifdef CONFIG_VMAP_STACK
+/*
+ * vmalloc is a bit slow, and calling vfree enough times will force a TLB
+ * flush.  Try to minimize the number of calls by caching stacks.
+ */
+#define NR_CACHED_STACKS 2
+static DEFINE_PER_CPU(struct vm_struct *, cached_stacks[NR_CACHED_STACKS]);
+#endif
+
 static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 {
 #ifdef CONFIG_VMAP_STACK
-	void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
-					   VMALLOC_START, VMALLOC_END,
-					   THREADINFO_GFP | __GFP_HIGHMEM,
-					   PAGE_KERNEL,
-					   0, node,
-					   __builtin_return_address(0));
+	void *stack;
+	int i;
+
+	local_irq_disable();
+	for (i = 0; i < NR_CACHED_STACKS; i++) {
+		struct vm_struct *s = this_cpu_read(cached_stacks[i]);
+
+		if (!s)
+			continue;
+		this_cpu_write(cached_stacks[i], NULL);
+
+		tsk->stack_vm_area = s;
+		local_irq_enable();
+		return s->addr;
+	}
+	local_irq_enable();
+
+	stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
+				     VMALLOC_START, VMALLOC_END,
+				     THREADINFO_GFP | __GFP_HIGHMEM,
+				     PAGE_KERNEL,
+				     0, node, __builtin_return_address(0));
 
 	/*
 	 * We can't call find_vm_area() in interrupt context, and
@@ -187,10 +213,28 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 
 static inline void free_thread_stack(struct task_struct *tsk)
 {
-	if (task_stack_vm_area(tsk))
+#ifdef CONFIG_VMAP_STACK
+	if (task_stack_vm_area(tsk)) {
+		unsigned long flags;
+		int i;
+
+		local_irq_save(flags);
+		for (i = 0; i < NR_CACHED_STACKS; i++) {
+			if (this_cpu_read(cached_stacks[i]))
+				continue;
+
+			this_cpu_write(cached_stacks[i], tsk->stack_vm_area);
+			local_irq_restore(flags);
+			return;
+		}
+		local_irq_restore(flags);
+
 		vfree(tsk->stack);
-	else
-		__free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER);
+		return;
+	}
+#endif
+
+	__free_pages(virt_to_page(tsk->stack), THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_stack_cache;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-13 21:29 ` [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk() Andy Lutomirski
@ 2016-09-14 14:55   ` Josh Poimboeuf
  2016-09-14 18:22     ` Andy Lutomirski
  2016-09-15  6:37   ` Ingo Molnar
  1 sibling, 1 reply; 41+ messages in thread
From: Josh Poimboeuf @ 2016-09-14 14:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn

On Tue, Sep 13, 2016 at 02:29:28PM -0700, Andy Lutomirski wrote:
> This will prevent a crash if the target task dies before or while
> dumping its stack once we start freeing task stacks early.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Do we need a similar patch for show_stack()?

-- 
Josh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-14 14:55   ` Josh Poimboeuf
@ 2016-09-14 18:22     ` Andy Lutomirski
  2016-09-14 18:35       ` Josh Poimboeuf
  0 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-14 18:22 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Jann Horn

On Wed, Sep 14, 2016 at 7:55 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Tue, Sep 13, 2016 at 02:29:28PM -0700, Andy Lutomirski wrote:
>> This will prevent a crash if the target task dies before or while
>> dumping its stack once we start freeing task stacks early.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>
> Do we need a similar patch for show_stack()?

Probably.  Shouldn't it go in show_stack_log_lvl() instead, though?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-14 18:22     ` Andy Lutomirski
@ 2016-09-14 18:35       ` Josh Poimboeuf
  2016-09-15 18:04         ` Andy Lutomirski
  0 siblings, 1 reply; 41+ messages in thread
From: Josh Poimboeuf @ 2016-09-14 18:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Jann Horn

On Wed, Sep 14, 2016 at 11:22:00AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 7:55 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Tue, Sep 13, 2016 at 02:29:28PM -0700, Andy Lutomirski wrote:
> >> This will prevent a crash if the target task dies before or while
> >> dumping its stack once we start freeing task stacks early.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >
> > Do we need a similar patch for show_stack()?
> 
> Probably.  Shouldn't it go in show_stack_log_lvl() instead, though?

Yeah, that would probably be better.

-- 
Josh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/12] x86/entry: Get rid of pt_regs_to_thread_info()
  2016-09-13 21:29 ` [PATCH 02/12] x86/entry: Get rid of pt_regs_to_thread_info() Andy Lutomirski
@ 2016-09-15  6:21   ` Ingo Molnar
  2016-09-15 10:42   ` [tip:x86/asm] " tip-bot for Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2016-09-15  6:21 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: x86, Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn,
	Linus Torvalds


* Andy Lutomirski <luto@kernel.org> wrote:

> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> It was a nice optimization while it lasted, but thread_info is moving
> and this optimization will no longer work.
> 
> Quoting Linus:
> 
>     Oh Gods, Andy. That pt_regs_to_thread_info() thing made me want
>     to do unspeakable acts on a poor innocent wax figure that looked
>     _exactly_ like you.
> 
> [changelog written by Andy]

Linus, since this came from you I presume this is still fine, but I'm adding your 
Signed-off-by as well to make the SOB chain neat:

  Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Is this ok?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 03/12] um: Stop conflating task_struct::stack with thread_info
  2016-09-13 21:29 ` [PATCH 03/12] um: Stop conflating task_struct::stack with thread_info Andy Lutomirski
@ 2016-09-15  6:21   ` Ingo Molnar
  2016-09-15 10:42   ` [tip:x86/asm] um/Stop " tip-bot for Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2016-09-15  6:21 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: x86, Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn,
	Linus Torvalds


* Andy Lutomirski <luto@kernel.org> wrote:

> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> thread_info may move in the future, so use the accessors.
> 
> Andy Lutomirski wrote this changelog message and changed
> "task_thread_info(child)->cpu" to "task_cpu(child)".
> 
> Message-Id: <CA+55aFxvZhBu9U1cqpVm4frv0p5mqu=0TxsSqE-=95ft8HvCVA@mail.gmail.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Like the previous patch I'm adding Linus's primary author SOB:

  Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-13 21:29 ` [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk() Andy Lutomirski
  2016-09-14 14:55   ` Josh Poimboeuf
@ 2016-09-15  6:37   ` Ingo Molnar
       [not found]     ` <CA+55aFxt=HLrELBE=BXUrWdh6LYs4gtu9S=yCruiDffq4HN80w@mail.gmail.com>
  1 sibling, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2016-09-15  6:37 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: x86, Borislav Petkov, linux-kernel, Brian Gerst, Jann Horn,
	Thomas Gleixner, H. Peter Anvin


* Andy Lutomirski <luto@kernel.org> wrote:

> This will prevent a crash if the target task dies before or while
> dumping its stack once we start freeing task stacks early.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/kernel/stacktrace.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
> index 4738f5e0f2ab..b3f32fbe3ba4 100644
> --- a/arch/x86/kernel/stacktrace.c
> +++ b/arch/x86/kernel/stacktrace.c
> @@ -79,9 +79,14 @@ void save_stack_trace_regs(struct pt_regs *regs, struct stack_trace *trace)
>  
>  void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
>  {
> +	if (!try_get_task_stack(tsk))
> +		return;
> +
>  	dump_trace(tsk, NULL, NULL, 0, &save_stack_ops_nosched, trace);
>  	if (trace->nr_entries < trace->max_entries)
>  		trace->entries[trace->nr_entries++] = ULONG_MAX;
> +
> +	put_task_stack(tsk);
>  }
>  EXPORT_SYMBOL_GPL(save_stack_trace_tsk);

So I very much like the first half of the series, the thread_info merge into 
task_struct (yay!) and I am in the process of applying those bits to -tip.

So the above not (yet) fully correct patch is in preparation to a later change you 
are doing in this series:

    [PATCH 11/12] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK

But I am having serious second thoughts about the fact that we now have to sprinke 
non-obvious try_get_task_stack()/put_task_stack() pairs into debugging code that 
never really needed anything like that, which pairs are easy to get wrong, easy to 
miss - and generally if we miss them will it will result in a slightly buggy, very 
unpleasant, racy, crashy behavior of the kernel.

Can we just ... not do the delayed freeing?

I mean, the cache hotness arguments don't look overly convincing to me: if we just 
start not using a piece of formerly cache hot memory then those cache lines will 
eventually decay naturally in whatever LRU scheme the CPU is using and will be 
reused without much harm done. It's just "another day in CPU land" that happens 
all the time. Yes, we could reduce the latency of the freeing and thus slightly 
improve the locality of other bits ... but is it really measurable or noticeable 
in any fashion. It's like task/thread fork/clone/exit is _that_ much of a fast 
path.

And then there's also the whole CONFIG_THREAD_INFO_IN_TASK #ifdeffery spreading.

I.e I just don't see the justification for this kind of object life time assymetry 
and hard to debug fragility.

Am I missing something?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
       [not found]     ` <CA+55aFxt=HLrELBE=BXUrWdh6LYs4gtu9S=yCruiDffq4HN80w@mail.gmail.com>
@ 2016-09-15  9:27       ` Ingo Molnar
  0 siblings, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2016-09-15  9:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Borislav Petkov, Linux Kernel Mailing List,
	Jann Horn, Andrew Lutomirski, Brian Gerst, Peter Anvin,
	the arch/x86 maintainers


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> It absolutely is - compared to the vmalloc costs. Andy has quite noticeable 
> slowdowns without the stack reuse - and the stack reuse requires the delayed 
> freeing. With the stack reuse, fork/exit actually sped up.

Ok, great, that's convincing!

I'm still applying it two-phase and I'll wait for v2 of the second half of the 
series, to have the fix Josh alluded to.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [tip:x86/asm] x86/asm: Move the thread_info::status field to thread_struct
  2016-09-13 21:29 ` [PATCH 01/12] x86/asm: Move 'status' from struct thread_info to struct thread_struct Andy Lutomirski
@ 2016-09-15 10:41   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-09-15 10:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: brgerst, dvlasenk, hpa, tglx, luto, peterz, jann, linux-kernel,
	mingo, bp, jpoimboe, torvalds

Commit-ID:  b9d989c7218ac922185d82ad46f3e58b27a4bea9
Gitweb:     http://git.kernel.org/tip/b9d989c7218ac922185d82ad46f3e58b27a4bea9
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 13 Sep 2016 14:29:21 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 15 Sep 2016 08:25:12 +0200

x86/asm: Move the thread_info::status field to thread_struct

Because sched.h and thread_info.h are a tangled mess, I turned
in_compat_syscall() into a macro.  If we had current_thread_struct()
or similar and we could use it from thread_info.h, then this would
be a bit cleaner.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ccc8a1b2f41f9c264a41f771bb4a6539a642ad72.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/common.c            |  4 ++--
 arch/x86/include/asm/processor.h   | 12 ++++++++++++
 arch/x86/include/asm/syscall.h     | 20 +++++---------------
 arch/x86/include/asm/thread_info.h | 23 ++++-------------------
 arch/x86/kernel/asm-offsets.c      |  1 -
 arch/x86/kernel/fpu/init.c         |  1 -
 arch/x86/kernel/process_64.c       |  4 ++--
 arch/x86/kernel/ptrace.c           |  2 +-
 arch/x86/kernel/signal.c           |  2 +-
 9 files changed, 27 insertions(+), 42 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 1433f6b..871bbf9 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -209,7 +209,7 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	 * special case only applies after poking regs and before the
 	 * very next return to user mode.
 	 */
-	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
+	current->thread.status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
 #endif
 
 	user_enter_irqoff();
@@ -307,7 +307,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 	unsigned int nr = (unsigned int)regs->orig_ax;
 
 #ifdef CONFIG_IA32_EMULATION
-	ti->status |= TS_COMPAT;
+	current->thread.status |= TS_COMPAT;
 #endif
 
 	if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b22fb5a..984a7bf 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -389,6 +389,9 @@ struct thread_struct {
 	unsigned short		fsindex;
 	unsigned short		gsindex;
 #endif
+
+	u32			status;		/* thread synchronous flags */
+
 #ifdef CONFIG_X86_64
 	unsigned long		fsbase;
 	unsigned long		gsbase;
@@ -435,6 +438,15 @@ struct thread_struct {
 };
 
 /*
+ * Thread-synchronous status.
+ *
+ * This is different from the flags in that nobody else
+ * ever touches our thread-synchronous status, so we don't
+ * have to worry about atomic accesses.
+ */
+#define TS_COMPAT		0x0002	/* 32bit syscall active (64BIT)*/
+
+/*
  * Set IOPL bits in EFLAGS from given mask
  */
 static inline void native_set_iopl_mask(unsigned mask)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 4e23dd1..e3c95e8 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -60,7 +60,7 @@ static inline long syscall_get_error(struct task_struct *task,
 	 * TS_COMPAT is set for 32-bit syscall entries and then
 	 * remains set until we return to user mode.
 	 */
-	if (task_thread_info(task)->status & (TS_COMPAT|TS_I386_REGS_POKED))
+	if (task->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
 		/*
 		 * Sign-extend the value so (int)-EFOO becomes (long)-EFOO
 		 * and will match correctly in comparisons.
@@ -116,7 +116,7 @@ static inline void syscall_get_arguments(struct task_struct *task,
 					 unsigned long *args)
 {
 # ifdef CONFIG_IA32_EMULATION
-	if (task_thread_info(task)->status & TS_COMPAT)
+	if (task->thread.status & TS_COMPAT)
 		switch (i) {
 		case 0:
 			if (!n--) break;
@@ -177,7 +177,7 @@ static inline void syscall_set_arguments(struct task_struct *task,
 					 const unsigned long *args)
 {
 # ifdef CONFIG_IA32_EMULATION
-	if (task_thread_info(task)->status & TS_COMPAT)
+	if (task->thread.status & TS_COMPAT)
 		switch (i) {
 		case 0:
 			if (!n--) break;
@@ -234,18 +234,8 @@ static inline void syscall_set_arguments(struct task_struct *task,
 
 static inline int syscall_get_arch(void)
 {
-#ifdef CONFIG_IA32_EMULATION
-	/*
-	 * TS_COMPAT is set for 32-bit syscall entry and then
-	 * remains set until we return to user mode.
-	 *
-	 * x32 tasks should be considered AUDIT_ARCH_X86_64.
-	 */
-	if (task_thread_info(current)->status & TS_COMPAT)
-		return AUDIT_ARCH_I386;
-#endif
-	/* Both x32 and x86_64 are considered "64-bit". */
-	return AUDIT_ARCH_X86_64;
+	/* x32 tasks should be considered AUDIT_ARCH_X86_64. */
+	return in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 }
 #endif	/* CONFIG_X86_32 */
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 494c4b5..c9dcfe7 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -55,7 +55,6 @@ struct task_struct;
 struct thread_info {
 	struct task_struct	*task;		/* main task structure */
 	__u32			flags;		/* low level flags */
-	__u32			status;		/* thread synchronous flags */
 	__u32			cpu;		/* current CPU */
 };
 
@@ -253,31 +252,17 @@ static inline int arch_within_stack_frames(const void * const stack,
 
 #endif
 
-/*
- * Thread-synchronous status.
- *
- * This is different from the flags in that nobody else
- * ever touches our thread-synchronous status, so we don't
- * have to worry about atomic accesses.
- */
-#define TS_COMPAT		0x0002	/* 32bit syscall active (64BIT)*/
 #ifdef CONFIG_COMPAT
 #define TS_I386_REGS_POKED	0x0004	/* regs poked by 32-bit ptracer */
 #endif
-
 #ifndef __ASSEMBLY__
 
-static inline bool in_ia32_syscall(void)
-{
 #ifdef CONFIG_X86_32
-	return true;
-#endif
-#ifdef CONFIG_IA32_EMULATION
-	if (current_thread_info()->status & TS_COMPAT)
-		return true;
+#define in_ia32_syscall() true
+#else
+#define in_ia32_syscall() (IS_ENABLED(CONFIG_IA32_EMULATION) && \
+			   current->thread.status & TS_COMPAT)
 #endif
-	return false;
-}
 
 /*
  * Force syscall return via IRET by making it look as if there was
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index db3a0af..add5f90 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -36,7 +36,6 @@ void common(void) {
 
 	BLANK();
 	OFFSET(TI_flags, thread_info, flags);
-	OFFSET(TI_status, thread_info, status);
 
 	BLANK();
 	OFFSET(TASK_addr_limit, task_struct, thread.addr_limit);
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 93982ae..2f2b8c7 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -317,7 +317,6 @@ static void __init fpu__init_system_ctx_switch(void)
 	on_boot_cpu = 0;
 
 	WARN_ON_FPU(current->thread.fpu.fpstate_active);
-	current_thread_info()->status = 0;
 
 	if (boot_cpu_has(X86_FEATURE_XSAVEOPT) && eagerfpu != DISABLE)
 		eagerfpu = ENABLE;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b812cd0..de9acaf 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -510,7 +510,7 @@ void set_personality_ia32(bool x32)
 		current->personality &= ~READ_IMPLIES_EXEC;
 		/* in_compat_syscall() uses the presence of the x32
 		   syscall bit flag to determine compat status */
-		current_thread_info()->status &= ~TS_COMPAT;
+		current->thread.status &= ~TS_COMPAT;
 	} else {
 		set_thread_flag(TIF_IA32);
 		clear_thread_flag(TIF_X32);
@@ -518,7 +518,7 @@ void set_personality_ia32(bool x32)
 			current->mm->context.ia32_compat = TIF_IA32;
 		current->personality |= force_personality32;
 		/* Prepare the first "return" to user space */
-		current_thread_info()->status |= TS_COMPAT;
+		current->thread.status |= TS_COMPAT;
 	}
 }
 EXPORT_SYMBOL_GPL(set_personality_ia32);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5b88a1b..ce94c38 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -934,7 +934,7 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
 		 */
 		regs->orig_ax = value;
 		if (syscall_get_nr(child, regs) >= 0)
-			task_thread_info(child)->status |= TS_I386_REGS_POKED;
+			child->thread.status |= TS_I386_REGS_POKED;
 		break;
 
 	case offsetof(struct user32, regs.eflags):
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 04cb321..da20ecb 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -783,7 +783,7 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
 	 * than the tracee.
 	 */
 #ifdef CONFIG_IA32_EMULATION
-	if (current_thread_info()->status & (TS_COMPAT|TS_I386_REGS_POKED))
+	if (current->thread.status & (TS_COMPAT|TS_I386_REGS_POKED))
 		return __NR_ia32_restart_syscall;
 #endif
 #ifdef CONFIG_X86_X32_ABI

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:x86/asm] x86/entry: Get rid of pt_regs_to_thread_info()
  2016-09-13 21:29 ` [PATCH 02/12] x86/entry: Get rid of pt_regs_to_thread_info() Andy Lutomirski
  2016-09-15  6:21   ` Ingo Molnar
@ 2016-09-15 10:42   ` tip-bot for Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: tip-bot for Linus Torvalds @ 2016-09-15 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, mingo, jann, bp, dvlasenk, torvalds, luto, brgerst, tglx,
	linux-kernel, hpa, jpoimboe

Commit-ID:  97245d00585d82540f4538cf72d92a1e853c7b0e
Gitweb:     http://git.kernel.org/tip/97245d00585d82540f4538cf72d92a1e853c7b0e
Author:     Linus Torvalds <torvalds@linux-foundation.org>
AuthorDate: Tue, 13 Sep 2016 14:29:22 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 15 Sep 2016 08:25:12 +0200

x86/entry: Get rid of pt_regs_to_thread_info()

It was a nice optimization while it lasted, but thread_info is moving
and this optimization will no longer work.

Quoting Linus:

    Oh Gods, Andy. That pt_regs_to_thread_info() thing made me want
    to do unspeakable acts on a poor innocent wax figure that looked
    _exactly_ like you.

[ Changelog written by Andy. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/6376aa81c68798cc81631673f52bd91a3e078944.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/common.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 871bbf9..bdd9cc5 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,13 +31,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
 
-static struct thread_info *pt_regs_to_thread_info(struct pt_regs *regs)
-{
-	unsigned long top_of_stack =
-		(unsigned long)(regs + 1) + TOP_OF_KERNEL_STACK_PADDING;
-	return (struct thread_info *)(top_of_stack - THREAD_SIZE);
-}
-
 #ifdef CONFIG_CONTEXT_TRACKING
 /* Called on entry from user mode with IRQs off. */
 __visible inline void enter_from_user_mode(void)
@@ -71,7 +64,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
 {
 	u32 arch = in_ia32_syscall() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long ret = 0;
 	bool emulated = false;
 	u32 work;
@@ -173,18 +166,17 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
-		cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
+		cached_flags = READ_ONCE(current_thread_info()->flags);
 
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 			break;
-
 	}
 }
 
 /* Called with IRQs disabled. */
 __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags;
 
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING) && WARN_ON(!irqs_disabled()))
@@ -247,7 +239,7 @@ static void syscall_slow_exit_work(struct pt_regs *regs, u32 cached_flags)
  */
 __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	u32 cached_flags = READ_ONCE(ti->flags);
 
 	CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
@@ -270,7 +262,7 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 #ifdef CONFIG_X86_64
 __visible void do_syscall_64(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned long nr = regs->orig_ax;
 
 	enter_from_user_mode();
@@ -303,7 +295,7 @@ __visible void do_syscall_64(struct pt_regs *regs)
  */
 static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 {
-	struct thread_info *ti = pt_regs_to_thread_info(regs);
+	struct thread_info *ti = current_thread_info();
 	unsigned int nr = (unsigned int)regs->orig_ax;
 
 #ifdef CONFIG_IA32_EMULATION

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:x86/asm] um/Stop conflating task_struct::stack with thread_info
  2016-09-13 21:29 ` [PATCH 03/12] um: Stop conflating task_struct::stack with thread_info Andy Lutomirski
  2016-09-15  6:21   ` Ingo Molnar
@ 2016-09-15 10:42   ` tip-bot for Linus Torvalds
  1 sibling, 0 replies; 41+ messages in thread
From: tip-bot for Linus Torvalds @ 2016-09-15 10:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: luto, jann, bp, linux-kernel, mingo, dvlasenk, peterz, brgerst,
	jpoimboe, tglx, hpa, torvalds

Commit-ID:  d896fa20a70c9e596438728561e058a74ed3196b
Gitweb:     http://git.kernel.org/tip/d896fa20a70c9e596438728561e058a74ed3196b
Author:     Linus Torvalds <torvalds@linux-foundation.org>
AuthorDate: Tue, 13 Sep 2016 14:29:23 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 15 Sep 2016 08:25:12 +0200

um/Stop conflating task_struct::stack with thread_info

thread_info may move in the future, so use the accessors.

[ Andy Lutomirski wrote this changelog message and changed
  "task_thread_info(child)->cpu" to "task_cpu(child)". ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/3439705d9838940cc82733a7335fa8c654c37db8.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/um/ptrace_32.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/um/ptrace_32.c b/arch/x86/um/ptrace_32.c
index a7ef7b1..5766ead 100644
--- a/arch/x86/um/ptrace_32.c
+++ b/arch/x86/um/ptrace_32.c
@@ -194,7 +194,7 @@ int peek_user(struct task_struct *child, long addr, long data)
 
 static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_cpu(child);
 	struct user_i387_struct fpregs;
 
 	err = save_i387_registers(userspace_pid[cpu],
@@ -211,7 +211,7 @@ static int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_cpu(child);
 	struct user_i387_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));
@@ -224,7 +224,7 @@ static int set_fpregs(struct user_i387_struct __user *buf, struct task_struct *c
 
 static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int err, n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int err, n, cpu = task_cpu(child);
 	struct user_fxsr_struct fpregs;
 
 	err = save_fpx_registers(userspace_pid[cpu], (unsigned long *) &fpregs);
@@ -240,7 +240,7 @@ static int get_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *
 
 static int set_fpxregs(struct user_fxsr_struct __user *buf, struct task_struct *child)
 {
-	int n, cpu = ((struct thread_info *) child->stack)->cpu;
+	int n, cpu = task_cpu(child);
 	struct user_fxsr_struct fpregs;
 
 	n = copy_from_user(&fpregs, buf, sizeof(fpregs));

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:x86/asm] sched/core: Allow putting thread_info into task_struct
  2016-09-13 21:29 ` [PATCH 04/12] sched: Allow putting thread_info into task_struct Andy Lutomirski
@ 2016-09-15 10:43   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-09-15 10:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: brgerst, luto, peterz, tglx, jann, dvlasenk, linux-kernel,
	jpoimboe, torvalds, hpa, mingo, bp

Commit-ID:  c65eacbe290b8141554c71b2c94489e73ade8c8d
Gitweb:     http://git.kernel.org/tip/c65eacbe290b8141554c71b2c94489e73ade8c8d
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 13 Sep 2016 14:29:24 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 15 Sep 2016 08:25:13 +0200

sched/core: Allow putting thread_info into task_struct

If an arch opts in by setting CONFIG_THREAD_INFO_IN_TASK_STRUCT,
then thread_info is defined as a single 'u32 flags' and is the first
entry of task_struct.  thread_info::task is removed (it serves no
purpose if thread_info is embedded in task_struct), and
thread_info::cpu gets its own slot in task_struct.

This is heavily based on a patch written by Linus.

Originally-from: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a0898196f0476195ca02713691a5037a14f2aac5.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/init_task.h   |  9 +++++++++
 include/linux/sched.h       | 36 ++++++++++++++++++++++++++++++++++--
 include/linux/thread_info.h | 15 +++++++++++++++
 init/Kconfig                |  7 +++++++
 init/init_task.c            |  7 +++++--
 kernel/sched/sched.h        |  4 ++++
 6 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f8834f8..9c04d44 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,8 @@
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
 
+#include <asm/thread_info.h>
+
 #ifdef CONFIG_SMP
 # define INIT_PUSHABLE_TASKS(tsk)					\
 	.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO),
@@ -183,12 +185,19 @@ extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+# define INIT_TASK_TI(tsk) .thread_info = INIT_THREAD_INFO(tsk),
+#else
+# define INIT_TASK_TI(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
  */
 #define INIT_TASK(tsk)	\
 {									\
+	INIT_TASK_TI(tsk)						\
 	.state		= 0,						\
 	.stack		= init_stack,					\
 	.usage		= ATOMIC_INIT(2),				\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 20f9f47..a287e8b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1458,6 +1458,13 @@ struct tlbflush_unmap_batch {
 };
 
 struct task_struct {
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	/*
+	 * For reasons of header soup (see current_thread_info()), this
+	 * must be the first element of task_struct.
+	 */
+	struct thread_info thread_info;
+#endif
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
 	atomic_t usage;
@@ -1467,6 +1474,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	unsigned int cpu;	/* current CPU */
+#endif
 	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
 	struct task_struct *last_wakee;
@@ -2588,7 +2598,9 @@ extern void set_curr_task(int cpu, struct task_struct *p);
 void yield(void);
 
 union thread_union {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
 	struct thread_info thread_info;
+#endif
 	unsigned long stack[THREAD_SIZE/sizeof(long)];
 };
 
@@ -3076,10 +3088,26 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
 	cgroup_threadgroup_change_end(tsk);
 }
 
-#ifndef __HAVE_THREAD_FUNCTIONS
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+
+static inline struct thread_info *task_thread_info(struct task_struct *task)
+{
+	return &task->thread_info;
+}
+static inline void *task_stack_page(const struct task_struct *task)
+{
+	return task->stack;
+}
+#define setup_thread_stack(new,old)	do { } while(0)
+static inline unsigned long *end_of_stack(const struct task_struct *task)
+{
+	return task->stack;
+}
+
+#elif !defined(__HAVE_THREAD_FUNCTIONS)
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
-#define task_stack_page(task)	((task)->stack)
+#define task_stack_page(task)	((void *)(task)->stack)
 
 static inline void setup_thread_stack(struct task_struct *p, struct task_struct *org)
 {
@@ -3379,7 +3407,11 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
 
 static inline unsigned int task_cpu(const struct task_struct *p)
 {
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	return p->cpu;
+#else
 	return task_thread_info(p)->cpu;
+#endif
 }
 
 static inline int task_node(const struct task_struct *p)
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 2b5b10e..e2d0fd8 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -13,6 +13,21 @@
 struct timespec;
 struct compat_timespec;
 
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+struct thread_info {
+	u32			flags;		/* low level flags */
+};
+
+#define INIT_THREAD_INFO(tsk)			\
+{						\
+	.flags		= 0,			\
+}
+#endif
+
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+#define current_thread_info() ((struct thread_info *)current)
+#endif
+
 /*
  * System call restart block.
  */
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..ec8d438 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -26,6 +26,13 @@ config IRQ_WORK
 config BUILDTIME_EXTABLE_SORT
 	bool
 
+config THREAD_INFO_IN_TASK
+	bool
+	help
+	  Select this to move thread_info off the stack into task_struct.  To
+	  make this work, an arch will need to remove all thread_info fields
+	  except flags and fix any runtime bugs.
+
 menu "General setup"
 
 config BROKEN
diff --git a/init/init_task.c b/init/init_task.c
index ba0a7f36..11f83be1 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -22,5 +22,8 @@ EXPORT_SYMBOL(init_task);
  * Initial thread structure. Alignment of this is handled by a special
  * linker map entry.
  */
-union thread_union init_thread_union __init_task_data =
-	{ INIT_THREAD_INFO(init_task) };
+union thread_union init_thread_union __init_task_data = {
+#ifndef CONFIG_THREAD_INFO_IN_TASK
+	INIT_THREAD_INFO(init_task)
+#endif
+};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c64fc51..3655c96 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1000,7 +1000,11 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 	 * per-task data have been completed by this moment.
 	 */
 	smp_wmb();
+#ifdef CONFIG_THREAD_INFO_IN_TASK
+	p->cpu = cpu;
+#else
 	task_thread_info(p)->cpu = cpu;
+#endif
 	p->wake_cpu = cpu;
 #endif
 }

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [tip:x86/asm] x86: Move thread_info into task_struct
  2016-09-13 21:29 ` [PATCH 05/12] x86: Move " Andy Lutomirski
@ 2016-09-15 10:43   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 41+ messages in thread
From: tip-bot for Andy Lutomirski @ 2016-09-15 10:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, brgerst, peterz, dvlasenk, luto, torvalds, bp,
	linux-kernel, mingo, jpoimboe, jann, hpa

Commit-ID:  15f4eae70d365bba26854c90b6002aaabb18c8aa
Gitweb:     http://git.kernel.org/tip/15f4eae70d365bba26854c90b6002aaabb18c8aa
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Tue, 13 Sep 2016 14:29:25 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 15 Sep 2016 08:25:13 +0200

x86: Move thread_info into task_struct

Now that most of the thread_info users have been cleaned up,
this is straightforward.

Most of this code was written by Linus.

Originally-from: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a50eab40abeaec9cb9a9e3cbdeafd32190206654.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/entry/entry_64.S          |  7 ++++--
 arch/x86/include/asm/thread_info.h | 46 --------------------------------------
 arch/x86/kernel/asm-offsets.c      |  4 +---
 arch/x86/kernel/irq_64.c           |  3 +--
 arch/x86/kernel/process.c          |  6 ++---
 6 files changed, 10 insertions(+), 57 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4c39728..2a83bc8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -157,6 +157,7 @@ config X86
 	select SPARSE_IRQ
 	select SRCU
 	select SYSCTL_EXCEPTION_TRACE
+	select THREAD_INFO_IN_TASK
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e7fba58..2b46384 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -179,7 +179,8 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
 	 */
-	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	movq	PER_CPU_VAR(current_task), %r11
+	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
 	jnz	entry_SYSCALL64_slow_path
 
 entry_SYSCALL_64_fastpath:
@@ -217,7 +218,8 @@ entry_SYSCALL_64_fastpath:
 	 */
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+	movq	PER_CPU_VAR(current_task), %r11
+	testl	$_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
 	jnz	1f
 
 	LOCKDEP_SYS_EXIT
@@ -370,6 +372,7 @@ END(ptregs_\func)
 /*
  * %rdi: prev task
  * %rsi: next task
+ * rsi: task we're switching to
  */
 ENTRY(__switch_to_asm)
 	/*
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index c9dcfe7..2aaca53 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -52,20 +52,6 @@ struct task_struct;
 #include <asm/cpufeature.h>
 #include <linux/atomic.h>
 
-struct thread_info {
-	struct task_struct	*task;		/* main task structure */
-	__u32			flags;		/* low level flags */
-	__u32			cpu;		/* current CPU */
-};
-
-#define INIT_THREAD_INFO(tsk)			\
-{						\
-	.task		= &tsk,			\
-	.flags		= 0,			\
-	.cpu		= 0,			\
-}
-
-#define init_thread_info	(init_thread_union.thread_info)
 #define init_stack		(init_thread_union.stack)
 
 #else /* !__ASSEMBLY__ */
@@ -157,11 +143,6 @@ struct thread_info {
  */
 #ifndef __ASSEMBLY__
 
-static inline struct thread_info *current_thread_info(void)
-{
-	return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE);
-}
-
 static inline unsigned long current_stack_pointer(void)
 {
 	unsigned long sp;
@@ -223,33 +204,6 @@ static inline int arch_within_stack_frames(const void * const stack,
 # define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
 #endif
 
-/*
- * ASM operand which evaluates to a 'thread_info' address of
- * the current task, if it is known that "reg" is exactly "off"
- * bytes below the top of the stack currently.
- *
- * ( The kernel stack's size is known at build time, it is usually
- *   2 or 4 pages, and the bottom  of the kernel stack contains
- *   the thread_info structure. So to access the thread_info very
- *   quickly from assembly code we can calculate down from the
- *   top of the kernel stack to the bottom, using constant,
- *   build-time calculations only. )
- *
- * For example, to fetch the current thread_info->flags value into %eax
- * on x86-64 defconfig kernels, in syscall entry code where RSP is
- * currently at exactly SIZEOF_PTREGS bytes away from the top of the
- * stack:
- *
- *      mov ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS), %eax
- *
- * will translate to:
- *
- *      8b 84 24 b8 c0 ff ff      mov    -0x3f48(%rsp), %eax
- *
- * which is below the current RSP by almost 16K.
- */
-#define ASM_THREAD_INFO(field, reg, off) ((field)+(off)-THREAD_SIZE)(reg)
-
 #endif
 
 #ifdef CONFIG_COMPAT
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index add5f90..c62e015 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -35,9 +35,7 @@ void common(void) {
 #endif
 
 	BLANK();
-	OFFSET(TI_flags, thread_info, flags);
-
-	BLANK();
+	OFFSET(TASK_TI_flags, task_struct, thread_info.flags);
 	OFFSET(TASK_addr_limit, task_struct, thread.addr_limit);
 
 	BLANK();
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 4a79037..9ebd0b0 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -40,8 +40,7 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (user_mode(regs))
 		return;
 
-	if (regs->sp >= curbase + sizeof(struct thread_info) +
-				  sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
+	if (regs->sp >= curbase + sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
 	    regs->sp <= curbase + THREAD_SIZE)
 		return;
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c1fa790..0b9ed8e 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -549,9 +549,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 * PADDING
 	 * ----------- top = topmax - TOP_OF_KERNEL_STACK_PADDING
 	 * stack
-	 * ----------- bottom = start + sizeof(thread_info)
-	 * thread_info
-	 * ----------- start
+	 * ----------- bottom = start
 	 *
 	 * The tasks stack pointer points at the location where the
 	 * framepointer is stored. The data on the stack is:
@@ -562,7 +560,7 @@ unsigned long get_wchan(struct task_struct *p)
 	 */
 	top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
 	top -= 2 * sizeof(unsigned long);
-	bottom = start + sizeof(struct thread_info);
+	bottom = start;
 
 	sp = READ_ONCE(p->thread.sp);
 	if (sp < bottom || sp > top)

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-14 18:35       ` Josh Poimboeuf
@ 2016-09-15 18:04         ` Andy Lutomirski
  2016-09-15 18:37           ` Josh Poimboeuf
  0 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-15 18:04 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Jann Horn

On Wed, Sep 14, 2016 at 11:35 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Wed, Sep 14, 2016 at 11:22:00AM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 14, 2016 at 7:55 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Tue, Sep 13, 2016 at 02:29:28PM -0700, Andy Lutomirski wrote:
>> >> This will prevent a crash if the target task dies before or while
>> >> dumping its stack once we start freeing task stacks early.
>> >>
>> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> >
>> > Do we need a similar patch for show_stack()?
>>
>> Probably.  Shouldn't it go in show_stack_log_lvl() instead, though?
>
> Yeah, that would probably be better.

This code is a colossal mess.  I really hope that, some day, we can
clarify which entry points are used only in dumpstack*.c and which are
used elsewhere.  Creating an arch/x86/kernel/dumpstack.h or just
merging the three files and removing all the intermediate crap from
the headers could help a lot.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-15 18:04         ` Andy Lutomirski
@ 2016-09-15 18:37           ` Josh Poimboeuf
  2016-09-15 18:41             ` Andy Lutomirski
  0 siblings, 1 reply; 41+ messages in thread
From: Josh Poimboeuf @ 2016-09-15 18:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Jann Horn

On Thu, Sep 15, 2016 at 11:04:47AM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 11:35 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Wed, Sep 14, 2016 at 11:22:00AM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 7:55 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Tue, Sep 13, 2016 at 02:29:28PM -0700, Andy Lutomirski wrote:
> >> >> This will prevent a crash if the target task dies before or while
> >> >> dumping its stack once we start freeing task stacks early.
> >> >>
> >> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> >
> >> > Do we need a similar patch for show_stack()?
> >>
> >> Probably.  Shouldn't it go in show_stack_log_lvl() instead, though?
> >
> > Yeah, that would probably be better.
> 
> This code is a colossal mess.  I really hope that, some day, we can
> clarify which entry points are used only in dumpstack*.c and which are
> used elsewhere.  Creating an arch/x86/kernel/dumpstack.h or just
> merging the three files and removing all the intermediate crap from
> the headers could help a lot.

Agreed, it's a mess, though it's improving with some of my changes.
dumpstack_32.c and dumpstack_64.c will be shrinking, with dump_trace()
getting removed.  And they'll be shrinking even more with Linus's
suggestion to remove show_stack_log_lvl().  Then maybe we can look at
merging those two files into dumpstack.c with an #ifdef to separate the
subarch differences.

-- 
Josh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-15 18:37           ` Josh Poimboeuf
@ 2016-09-15 18:41             ` Andy Lutomirski
  2016-09-15 19:19               ` Josh Poimboeuf
  0 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-15 18:41 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Jann Horn

On Thu, Sep 15, 2016 at 11:37 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Thu, Sep 15, 2016 at 11:04:47AM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 14, 2016 at 11:35 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Wed, Sep 14, 2016 at 11:22:00AM -0700, Andy Lutomirski wrote:
>> >> On Wed, Sep 14, 2016 at 7:55 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > On Tue, Sep 13, 2016 at 02:29:28PM -0700, Andy Lutomirski wrote:
>> >> >> This will prevent a crash if the target task dies before or while
>> >> >> dumping its stack once we start freeing task stacks early.
>> >> >>
>> >> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> >> >
>> >> > Do we need a similar patch for show_stack()?
>> >>
>> >> Probably.  Shouldn't it go in show_stack_log_lvl() instead, though?
>> >
>> > Yeah, that would probably be better.
>>
>> This code is a colossal mess.  I really hope that, some day, we can
>> clarify which entry points are used only in dumpstack*.c and which are
>> used elsewhere.  Creating an arch/x86/kernel/dumpstack.h or just
>> merging the three files and removing all the intermediate crap from
>> the headers could help a lot.
>
> Agreed, it's a mess, though it's improving with some of my changes.
> dumpstack_32.c and dumpstack_64.c will be shrinking, with dump_trace()
> getting removed.  And they'll be shrinking even more with Linus's
> suggestion to remove show_stack_log_lvl().  Then maybe we can look at
> merging those two files into dumpstack.c with an #ifdef to separate the
> subarch differences.
>

I also wouldn't mind trying to do something to prevent ever dumping
the stack of an actively running task.  It's definitely safe to dump:

 - current

 - any task that's stopped via ptrace, etc

 - any task on the current CPU if running atomically enough that the
task can't migrate (which probably covers the nasty NMI cases, I hope)

What's *not* safe AFAIK is /proc/PID/stack.  I don't know if we can
somehow fix that short of actually sending an interrupt or NMI to
freeze the task if it's running.  I'm also not sure it's worth
worrying about it.

--Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-15 18:41             ` Andy Lutomirski
@ 2016-09-15 19:19               ` Josh Poimboeuf
  2016-09-16  7:47                 ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Josh Poimboeuf @ 2016-09-15 19:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Jann Horn, Peter Zijlstra, Ingo Molnar,
	live-patching

On Thu, Sep 15, 2016 at 11:41:25AM -0700, Andy Lutomirski wrote:
> I also wouldn't mind trying to do something to prevent ever dumping
> the stack of an actively running task.  It's definitely safe to dump:
> 
>  - current
> 
>  - any task that's stopped via ptrace, etc
> 
>  - any task on the current CPU if running atomically enough that the
> task can't migrate (which probably covers the nasty NMI cases, I hope)
> 
> What's *not* safe AFAIK is /proc/PID/stack.  I don't know if we can
> somehow fix that short of actually sending an interrupt or NMI to
> freeze the task if it's running.  I'm also not sure it's worth
> worrying about it.

Yeah, I proposed a fix for /proc/PID/stack a while back:

  https://lkml.kernel.org/r/cover.1424109806.git.jpoimboe@redhat.com

My idea was to use task_rq_lock() to lock the runqueue and then check
tsk->on_cpu.  I think Peter wasn't too keen on it.

We'll need something similar for the livepatch consistency model.

-- 
Josh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-15 19:19               ` Josh Poimboeuf
@ 2016-09-16  7:47                 ` Peter Zijlstra
  2016-09-16 15:12                   ` Andy Lutomirski
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-16  7:47 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, Andy Lutomirski, X86 ML, Borislav Petkov,
	linux-kernel, Brian Gerst, Jann Horn, Ingo Molnar, live-patching

On Thu, Sep 15, 2016 at 02:19:38PM -0500, Josh Poimboeuf wrote:
> On Thu, Sep 15, 2016 at 11:41:25AM -0700, Andy Lutomirski wrote:
> > I also wouldn't mind trying to do something to prevent ever dumping
> > the stack of an actively running task.  It's definitely safe to dump:
> > 
> >  - current
> > 
> >  - any task that's stopped via ptrace, etc
> > 
> >  - any task on the current CPU if running atomically enough that the
> > task can't migrate (which probably covers the nasty NMI cases, I hope)
> > 
> > What's *not* safe AFAIK is /proc/PID/stack.  I don't know if we can
> > somehow fix that short of actually sending an interrupt or NMI to
> > freeze the task if it's running.  I'm also not sure it's worth
> > worrying about it.
> 
> Yeah, I proposed a fix for /proc/PID/stack a while back:
> 
>   https://lkml.kernel.org/r/cover.1424109806.git.jpoimboe@redhat.com
> 
> My idea was to use task_rq_lock() to lock the runqueue and then check
> tsk->on_cpu.  I think Peter wasn't too keen on it.

That basically allows a DoS on the scheduler, since a user can run tasks
on every cpu (through sys_sched_setaffinity()). Then doing while (1) cat
/proc/$PID/stack would saturate the rq->lock on every CPU.

The more tasks the merrier.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-16  7:47                 ` Peter Zijlstra
@ 2016-09-16 15:12                   ` Andy Lutomirski
  2016-09-16 15:31                     ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-16 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, Andy Lutomirski, X86 ML, Borislav Petkov,
	linux-kernel, Brian Gerst, Jann Horn, Ingo Molnar, live-patching

On Fri, Sep 16, 2016 at 12:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Sep 15, 2016 at 02:19:38PM -0500, Josh Poimboeuf wrote:
>> On Thu, Sep 15, 2016 at 11:41:25AM -0700, Andy Lutomirski wrote:
>> > I also wouldn't mind trying to do something to prevent ever dumping
>> > the stack of an actively running task.  It's definitely safe to dump:
>> >
>> >  - current
>> >
>> >  - any task that's stopped via ptrace, etc
>> >
>> >  - any task on the current CPU if running atomically enough that the
>> > task can't migrate (which probably covers the nasty NMI cases, I hope)
>> >
>> > What's *not* safe AFAIK is /proc/PID/stack.  I don't know if we can
>> > somehow fix that short of actually sending an interrupt or NMI to
>> > freeze the task if it's running.  I'm also not sure it's worth
>> > worrying about it.
>>
>> Yeah, I proposed a fix for /proc/PID/stack a while back:
>>
>>   https://lkml.kernel.org/r/cover.1424109806.git.jpoimboe@redhat.com
>>
>> My idea was to use task_rq_lock() to lock the runqueue and then check
>> tsk->on_cpu.  I think Peter wasn't too keen on it.
>
> That basically allows a DoS on the scheduler, since a user can run tasks
> on every cpu (through sys_sched_setaffinity()). Then doing while (1) cat
> /proc/$PID/stack would saturate the rq->lock on every CPU.
>
> The more tasks the merrier.
>
>

Is this worse than it would be if this code used preempt_disable()
(which I think it did until very recently)?

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-16 15:12                   ` Andy Lutomirski
@ 2016-09-16 15:31                     ` Peter Zijlstra
  2016-09-16 15:32                       ` Andy Lutomirski
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-16 15:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, Andy Lutomirski, X86 ML, Borislav Petkov,
	linux-kernel, Brian Gerst, Jann Horn, Ingo Molnar, live-patching

On Fri, Sep 16, 2016 at 08:12:40AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 16, 2016 at 12:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Sep 15, 2016 at 02:19:38PM -0500, Josh Poimboeuf wrote:

> >> My idea was to use task_rq_lock() to lock the runqueue and then check
> >> tsk->on_cpu.  I think Peter wasn't too keen on it.
> >
> > That basically allows a DoS on the scheduler, since a user can run tasks
> > on every cpu (through sys_sched_setaffinity()). Then doing while (1) cat
> > /proc/$PID/stack would saturate the rq->lock on every CPU.
> >
> > The more tasks the merrier.
> 
> Is this worse than it would be if this code used preempt_disable()
> (which I think it did until very recently)?

Much worse, since the proposed task_rq_lock() not only disables
preemption, it also disables IRQs and takes 2 locks. And hogging the
rq->lock affects other tasks their ability to schedule.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-16 15:31                     ` Peter Zijlstra
@ 2016-09-16 15:32                       ` Andy Lutomirski
  2016-09-16 16:35                         ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-16 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, Andy Lutomirski, X86 ML, Borislav Petkov,
	linux-kernel, Brian Gerst, Jann Horn, Ingo Molnar, live-patching

On Fri, Sep 16, 2016 at 8:31 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Sep 16, 2016 at 08:12:40AM -0700, Andy Lutomirski wrote:
>> On Fri, Sep 16, 2016 at 12:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Thu, Sep 15, 2016 at 02:19:38PM -0500, Josh Poimboeuf wrote:
>
>> >> My idea was to use task_rq_lock() to lock the runqueue and then check
>> >> tsk->on_cpu.  I think Peter wasn't too keen on it.
>> >
>> > That basically allows a DoS on the scheduler, since a user can run tasks
>> > on every cpu (through sys_sched_setaffinity()). Then doing while (1) cat
>> > /proc/$PID/stack would saturate the rq->lock on every CPU.
>> >
>> > The more tasks the merrier.
>>
>> Is this worse than it would be if this code used preempt_disable()
>> (which I think it did until very recently)?
>
> Much worse, since the proposed task_rq_lock() not only disables
> preemption, it also disables IRQs and takes 2 locks. And hogging the
> rq->lock affects other tasks their ability to schedule.
>

Fair enough.

I'm not sure I care quite enough about /proc/PID/stack to personally
dig through the scheduler and find a way to cleanly say "please don't
run this task for a little while".

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk()
  2016-09-16 15:32                       ` Andy Lutomirski
@ 2016-09-16 16:35                         ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2016-09-16 16:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Jann Horn, Ingo Molnar, live-patching

On Fri, Sep 16, 2016 at 08:32:44AM -0700, Andy Lutomirski wrote:
> 
> I'm not sure I care quite enough about /proc/PID/stack to personally
> dig through the scheduler and find a way to cleanly say "please don't
> run this task for a little while".

The 'best' we can do is prod the task awake and then have it take itself
out, which is what SIGSTOP and ptrace() do. It is also rather ugly, due
to avoiding overhead on regular hot paths.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/12] x86/process: Pin the target stack in get_wchan()
  2016-09-13 21:29 ` [PATCH 09/12] x86/process: Pin the target stack in get_wchan() Andy Lutomirski
@ 2016-09-17  2:00   ` Jann Horn
  2016-09-22 22:44     ` Andy Lutomirski
  0 siblings, 1 reply; 41+ messages in thread
From: Jann Horn @ 2016-09-17  2:00 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, Borislav Petkov, linux-kernel, Brian Gerst

[-- Attachment #1: Type: text/plain, Size: 425 bytes --]

On Tue, Sep 13, 2016 at 02:29:29PM -0700, Andy Lutomirski wrote:
> This will prevent a crash if get_wchan() runs after the task stack
> is freed.

I think I found some more stuff. Have a look at KSTK_EIP() and KSTK_ESP(), I think
they read from the saved userspace registers area at the top of the kernel stack?

Used on remote processes in:
  vma_is_stack_for_task() (via /proc/$pid/maps)
  do_task_stat() (/proc/$pid/stat)

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/12] x86/process: Pin the target stack in get_wchan()
  2016-09-17  2:00   ` Jann Horn
@ 2016-09-22 22:44     ` Andy Lutomirski
  2016-09-22 22:50       ` Andy Lutomirski
  2016-09-23  7:43       ` Jann Horn
  0 siblings, 2 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-22 22:44 UTC (permalink / raw)
  To: Jann Horn, Linus Torvalds, Kees Cook
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel, Brian Gerst

On Fri, Sep 16, 2016 at 7:00 PM, Jann Horn <jann@thejh.net> wrote:
> On Tue, Sep 13, 2016 at 02:29:29PM -0700, Andy Lutomirski wrote:
>> This will prevent a crash if get_wchan() runs after the task stack
>> is freed.
>
> I think I found some more stuff. Have a look at KSTK_EIP() and KSTK_ESP(), I think
> they read from the saved userspace registers area at the top of the kernel stack?
>
> Used on remote processes in:
>   vma_is_stack_for_task() (via /proc/$pid/maps)

This isn't used in /proc/$pid/maps -- it's only used in
/proc/$pid/task/$tid/maps.  I wonder if anyone actually cares about it
-- it certainly won't work reliably.

I could pin the stack in vma_is_stack_for_task, but it seems
potentially better to me to change it to vma_is_stack_for_current()
and remove the offending caller in /proc, replacing it with "return
0".  Thoughts?

>   do_task_stat() (/proc/$pid/stat)

Like this:

        mm = get_task_mm(task);
        if (mm) {
                vsize = task_vsize(mm);
                if (permitted) {
                        eip = KSTK_EIP(task);
                        esp = KSTK_ESP(task);
                }
        }

Can we just delete this outright?  It seems somewhere between mostly
and entirely useless, and it also seems dangerous.  Until very
recently, on x86_64, this would have been a potential info leak, as
SYSCALL followed closely by a hardware interrupt would cause *kernel*
values to land in task_pt_regs().  I don't even want to think about
what this code does if the task is in vm86 mode.  I wouldn't be at all
surprised if non-x86 architectures have all kinds of interesting
thinks happen if you do this to a task that isn't running normal
non-atomic kernel code at the time.

I would advocate for unconditionally returning zeros in these two stat fields.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/12] x86/process: Pin the target stack in get_wchan()
  2016-09-22 22:44     ` Andy Lutomirski
@ 2016-09-22 22:50       ` Andy Lutomirski
  2016-09-23  7:43       ` Jann Horn
  1 sibling, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-09-22 22:50 UTC (permalink / raw)
  To: Jann Horn, Linus Torvalds, Kees Cook, Johannes Weiner
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel, Brian Gerst

On Thu, Sep 22, 2016 at 3:44 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Sep 16, 2016 at 7:00 PM, Jann Horn <jann@thejh.net> wrote:
>> On Tue, Sep 13, 2016 at 02:29:29PM -0700, Andy Lutomirski wrote:
>>> This will prevent a crash if get_wchan() runs after the task stack
>>> is freed.
>>
>> I think I found some more stuff. Have a look at KSTK_EIP() and KSTK_ESP(), I think
>> they read from the saved userspace registers area at the top of the kernel stack?
>>
>> Used on remote processes in:
>>   vma_is_stack_for_task() (via /proc/$pid/maps)
>
> This isn't used in /proc/$pid/maps -- it's only used in
> /proc/$pid/task/$tid/maps.  I wonder if anyone actually cares about it
> -- it certainly won't work reliably.
>
> I could pin the stack in vma_is_stack_for_task, but it seems
> potentially better to me to change it to vma_is_stack_for_current()
> and remove the offending caller in /proc, replacing it with "return
> 0".  Thoughts?

The history here is strange:

Before March 2012, we used to only consider the "stack" of the mm --
that is, the VMA that the mm actually treats as a stack.  Then:

commit b76437579d1344b612cf1851ae610c636cec7db0
Author: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Date:   Wed Mar 21 16:34:04 2012 -0700

    procfs: mark thread stack correctly in proc/<pid>/maps

and we did something extra horrible to try to find out whose stack was
where.  This got partially reverted by:

commit 65376df582174ffcec9e6471bf5b0dd79ba05e4a
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Tue Feb 2 16:57:29 2016 -0800

    proc: revert /proc/<pid>/maps [stack:TID] annotation

and now we're in the current situation where it's fast but still racy.

Any objection if I finish reverting the patch and restore the pre-2012
behavior?  Frankly, I wouldn't mind trying to excise KSTK_EIP and
KSTK_ESP from the kernel entirely, to be replaced with
current_user_sp() and current_user_ip().  Having those macros around
seems likely to make people think they're safe to use.

>
>>   do_task_stat() (/proc/$pid/stat)
>
> Like this:
>
>         mm = get_task_mm(task);
>         if (mm) {
>                 vsize = task_vsize(mm);
>                 if (permitted) {
>                         eip = KSTK_EIP(task);
>                         esp = KSTK_ESP(task);
>                 }
>         }
>
> Can we just delete this outright?  It seems somewhere between mostly
> and entirely useless, and it also seems dangerous.  Until very
> recently, on x86_64, this would have been a potential info leak, as
> SYSCALL followed closely by a hardware interrupt would cause *kernel*
> values to land in task_pt_regs().  I don't even want to think about
> what this code does if the task is in vm86 mode.  I wouldn't be at all
> surprised if non-x86 architectures have all kinds of interesting
> thinks happen if you do this to a task that isn't running normal
> non-atomic kernel code at the time.
>
> I would advocate for unconditionally returning zeros in these two stat fields.

--Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/12] x86/process: Pin the target stack in get_wchan()
  2016-09-22 22:44     ` Andy Lutomirski
  2016-09-22 22:50       ` Andy Lutomirski
@ 2016-09-23  7:43       ` Jann Horn
  2016-09-23 18:28         ` Kees Cook
  1 sibling, 1 reply; 41+ messages in thread
From: Jann Horn @ 2016-09-23  7:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Kees Cook, Andy Lutomirski, X86 ML,
	Borislav Petkov, linux-kernel, Brian Gerst

[-- Attachment #1: Type: text/plain, Size: 3091 bytes --]

On Thu, Sep 22, 2016 at 03:44:37PM -0700, Andy Lutomirski wrote:
> On Fri, Sep 16, 2016 at 7:00 PM, Jann Horn <jann@thejh.net> wrote:
> > On Tue, Sep 13, 2016 at 02:29:29PM -0700, Andy Lutomirski wrote:
> >> This will prevent a crash if get_wchan() runs after the task stack
> >> is freed.
> >
> > I think I found some more stuff. Have a look at KSTK_EIP() and KSTK_ESP(), I think
> > they read from the saved userspace registers area at the top of the kernel stack?
> >
> > Used on remote processes in:
> >   vma_is_stack_for_task() (via /proc/$pid/maps)
> 
> This isn't used in /proc/$pid/maps -- it's only used in
> /proc/$pid/task/$tid/maps.  I wonder if anyone actually cares about it
> -- it certainly won't work reliably.
> 
> I could pin the stack in vma_is_stack_for_task, but it seems
> potentially better to me to change it to vma_is_stack_for_current()
> and remove the offending caller in /proc, replacing it with "return
> 0".  Thoughts?

I just scrolled through the debian codesearch results for "\[stack\]" -
there seem to only be 105 across all of debian's packages, many of them
duplicates - and I didn't see any that looked like they used the tid map.
So I think this might work.

( https://codesearch.debian.net/search?q=%22%5C%5Bstack%5C%5D%22 )


> >   do_task_stat() (/proc/$pid/stat)
> 
> Like this:
> 
>         mm = get_task_mm(task);
>         if (mm) {
>                 vsize = task_vsize(mm);
>                 if (permitted) {
>                         eip = KSTK_EIP(task);
>                         esp = KSTK_ESP(task);
>                 }
>         }
> 
> Can we just delete this outright?  It seems somewhere between mostly
> and entirely useless, and it also seems dangerous.  Until very
> recently, on x86_64, this would have been a potential info leak, as
> SYSCALL followed closely by a hardware interrupt would cause *kernel*
> values to land in task_pt_regs().  I don't even want to think about
> what this code does if the task is in vm86 mode.  I wouldn't be at all
> surprised if non-x86 architectures have all kinds of interesting
> thinks happen if you do this to a task that isn't running normal
> non-atomic kernel code at the time.
> 
> I would advocate for unconditionally returning zeros in these two stat fields.

I'd like that a lot.

I guess the two things that might theoretically use it are ptrace users
and (very theoretically) sampling profiling stuff or so?

In gdb, the only code I can find that reads this is in gdb/linux-nat.c, but
it's behind an "#ifdef 0":

  #if 0   /* Don't know how architecture-dependent the rest is...
             Anyway the signal bitmap info is available from "status".  */
            if (fscanf (procfile, "%lu ", &ltmp) > 0)     /* FIXME arch?  */
              printf_filtered (_("Kernel stack pointer: 0x%lx\n"), ltmp);
            if (fscanf (procfile, "%lu ", &ltmp) > 0)     /* FIXME arch?  */
              printf_filtered (_("Kernel instr pointer: 0x%lx\n"), ltmp);
  [...]

strace and ltrace don't seem to be using it.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/12] x86/process: Pin the target stack in get_wchan()
  2016-09-23  7:43       ` Jann Horn
@ 2016-09-23 18:28         ` Kees Cook
  2016-09-23 18:34           ` Jann Horn
  0 siblings, 1 reply; 41+ messages in thread
From: Kees Cook @ 2016-09-23 18:28 UTC (permalink / raw)
  To: Jann Horn, Tycho Andersen
  Cc: Andy Lutomirski, Linus Torvalds, Andy Lutomirski, X86 ML,
	Borislav Petkov, linux-kernel, Brian Gerst

On Fri, Sep 23, 2016 at 12:43 AM, Jann Horn <jann@thejh.net> wrote:
> On Thu, Sep 22, 2016 at 03:44:37PM -0700, Andy Lutomirski wrote:
>> On Fri, Sep 16, 2016 at 7:00 PM, Jann Horn <jann@thejh.net> wrote:
>> > On Tue, Sep 13, 2016 at 02:29:29PM -0700, Andy Lutomirski wrote:
>> >> This will prevent a crash if get_wchan() runs after the task stack
>> >> is freed.
>> >
>> > I think I found some more stuff. Have a look at KSTK_EIP() and KSTK_ESP(), I think
>> > they read from the saved userspace registers area at the top of the kernel stack?
>> >
>> > Used on remote processes in:
>> >   vma_is_stack_for_task() (via /proc/$pid/maps)
>>
>> This isn't used in /proc/$pid/maps -- it's only used in
>> /proc/$pid/task/$tid/maps.  I wonder if anyone actually cares about it
>> -- it certainly won't work reliably.
>>
>> I could pin the stack in vma_is_stack_for_task, but it seems
>> potentially better to me to change it to vma_is_stack_for_current()
>> and remove the offending caller in /proc, replacing it with "return
>> 0".  Thoughts?
>
> I just scrolled through the debian codesearch results for "\[stack\]" -
> there seem to only be 105 across all of debian's packages, many of them
> duplicates - and I didn't see any that looked like they used the tid map.
> So I think this might work.
>
> ( https://codesearch.debian.net/search?q=%22%5C%5Bstack%5C%5D%22 )
>
>
>> >   do_task_stat() (/proc/$pid/stat)
>>
>> Like this:
>>
>>         mm = get_task_mm(task);
>>         if (mm) {
>>                 vsize = task_vsize(mm);
>>                 if (permitted) {
>>                         eip = KSTK_EIP(task);
>>                         esp = KSTK_ESP(task);
>>                 }
>>         }
>>
>> Can we just delete this outright?  It seems somewhere between mostly
>> and entirely useless, and it also seems dangerous.  Until very
>> recently, on x86_64, this would have been a potential info leak, as
>> SYSCALL followed closely by a hardware interrupt would cause *kernel*
>> values to land in task_pt_regs().  I don't even want to think about
>> what this code does if the task is in vm86 mode.  I wouldn't be at all
>> surprised if non-x86 architectures have all kinds of interesting
>> thinks happen if you do this to a task that isn't running normal
>> non-atomic kernel code at the time.
>>
>> I would advocate for unconditionally returning zeros in these two stat fields.
>
> I'd like that a lot.
>
> I guess the two things that might theoretically use it are ptrace users
> and (very theoretically) sampling profiling stuff or so?
>
> In gdb, the only code I can find that reads this is in gdb/linux-nat.c, but
> it's behind an "#ifdef 0":
>
>   #if 0   /* Don't know how architecture-dependent the rest is...
>              Anyway the signal bitmap info is available from "status".  */
>             if (fscanf (procfile, "%lu ", &ltmp) > 0)     /* FIXME arch?  */
>               printf_filtered (_("Kernel stack pointer: 0x%lx\n"), ltmp);
>             if (fscanf (procfile, "%lu ", &ltmp) > 0)     /* FIXME arch?  */
>               printf_filtered (_("Kernel instr pointer: 0x%lx\n"), ltmp);
>   [...]
>
> strace and ltrace don't seem to be using it.

Does CRIU use this? I wouldn't expect so, since they're using ptrace,
IIUC, to freeze/restore.

-Kees

-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/12] x86/process: Pin the target stack in get_wchan()
  2016-09-23 18:28         ` Kees Cook
@ 2016-09-23 18:34           ` Jann Horn
  2016-09-26  5:10             ` Tycho Andersen
  0 siblings, 1 reply; 41+ messages in thread
From: Jann Horn @ 2016-09-23 18:34 UTC (permalink / raw)
  To: Kees Cook
  Cc: Tycho Andersen, Andy Lutomirski, Linus Torvalds, Andy Lutomirski,
	X86 ML, Borislav Petkov, linux-kernel, Brian Gerst

[-- Attachment #1: Type: text/plain, Size: 3876 bytes --]

On Fri, Sep 23, 2016 at 11:28:26AM -0700, Kees Cook wrote:
> On Fri, Sep 23, 2016 at 12:43 AM, Jann Horn <jann@thejh.net> wrote:
> > On Thu, Sep 22, 2016 at 03:44:37PM -0700, Andy Lutomirski wrote:
> >> On Fri, Sep 16, 2016 at 7:00 PM, Jann Horn <jann@thejh.net> wrote:
> >> > On Tue, Sep 13, 2016 at 02:29:29PM -0700, Andy Lutomirski wrote:
> >> >> This will prevent a crash if get_wchan() runs after the task stack
> >> >> is freed.
> >> >
> >> > I think I found some more stuff. Have a look at KSTK_EIP() and KSTK_ESP(), I think
> >> > they read from the saved userspace registers area at the top of the kernel stack?
> >> >
> >> > Used on remote processes in:
> >> >   vma_is_stack_for_task() (via /proc/$pid/maps)
> >>
> >> This isn't used in /proc/$pid/maps -- it's only used in
> >> /proc/$pid/task/$tid/maps.  I wonder if anyone actually cares about it
> >> -- it certainly won't work reliably.
> >>
> >> I could pin the stack in vma_is_stack_for_task, but it seems
> >> potentially better to me to change it to vma_is_stack_for_current()
> >> and remove the offending caller in /proc, replacing it with "return
> >> 0".  Thoughts?
> >
> > I just scrolled through the debian codesearch results for "\[stack\]" -
> > there seem to only be 105 across all of debian's packages, many of them
> > duplicates - and I didn't see any that looked like they used the tid map.
> > So I think this might work.
> >
> > ( https://codesearch.debian.net/search?q=%22%5C%5Bstack%5C%5D%22 )
> >
> >
> >> >   do_task_stat() (/proc/$pid/stat)
> >>
> >> Like this:
> >>
> >>         mm = get_task_mm(task);
> >>         if (mm) {
> >>                 vsize = task_vsize(mm);
> >>                 if (permitted) {
> >>                         eip = KSTK_EIP(task);
> >>                         esp = KSTK_ESP(task);
> >>                 }
> >>         }
> >>
> >> Can we just delete this outright?  It seems somewhere between mostly
> >> and entirely useless, and it also seems dangerous.  Until very
> >> recently, on x86_64, this would have been a potential info leak, as
> >> SYSCALL followed closely by a hardware interrupt would cause *kernel*
> >> values to land in task_pt_regs().  I don't even want to think about
> >> what this code does if the task is in vm86 mode.  I wouldn't be at all
> >> surprised if non-x86 architectures have all kinds of interesting
> >> thinks happen if you do this to a task that isn't running normal
> >> non-atomic kernel code at the time.
> >>
> >> I would advocate for unconditionally returning zeros in these two stat fields.
> >
> > I'd like that a lot.
> >
> > I guess the two things that might theoretically use it are ptrace users
> > and (very theoretically) sampling profiling stuff or so?
> >
> > In gdb, the only code I can find that reads this is in gdb/linux-nat.c, but
> > it's behind an "#ifdef 0":
> >
> >   #if 0   /* Don't know how architecture-dependent the rest is...
> >              Anyway the signal bitmap info is available from "status".  */
> >             if (fscanf (procfile, "%lu ", &ltmp) > 0)     /* FIXME arch?  */
> >               printf_filtered (_("Kernel stack pointer: 0x%lx\n"), ltmp);
> >             if (fscanf (procfile, "%lu ", &ltmp) > 0)     /* FIXME arch?  */
> >               printf_filtered (_("Kernel instr pointer: 0x%lx\n"), ltmp);
> >   [...]
> >
> > strace and ltrace don't seem to be using it.
> 
> Does CRIU use this? I wouldn't expect so, since they're using ptrace,
> IIUC, to freeze/restore.

As far as I can tell:

parse_pid_stat() parses them into a struct proc_pid_stat as "esp" and "eip",
but those struct members are never used (like, probably, most other members
of that struct).

child_opened_proc.c just opens /proc/%d/stat and then closes it again
immediately.

So in summary: I don't think so.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/12] x86/process: Pin the target stack in get_wchan()
  2016-09-23 18:34           ` Jann Horn
@ 2016-09-26  5:10             ` Tycho Andersen
  0 siblings, 0 replies; 41+ messages in thread
From: Tycho Andersen @ 2016-09-26  5:10 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, Andy Lutomirski, Linus Torvalds, Andy Lutomirski,
	X86 ML, Borislav Petkov, linux-kernel, Brian Gerst

On Fri, Sep 23, 2016 at 08:34:43PM +0200, Jann Horn wrote:
> On Fri, Sep 23, 2016 at 11:28:26AM -0700, Kees Cook wrote:
> > Does CRIU use this? I wouldn't expect so, since they're using ptrace,
> > IIUC, to freeze/restore.
> 
> As far as I can tell:
> 
> parse_pid_stat() parses them into a struct proc_pid_stat as "esp" and "eip",
> but those struct members are never used (like, probably, most other members
> of that struct).

Yes, that's my reading of it too.

> child_opened_proc.c just opens /proc/%d/stat and then closes it again
> immediately.

This is just a test for ordering of things that are restored, and it
could use any file in /proc, stat was just convenient.

> So in summary: I don't think so.

Yep, agreed.

Tycho

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2016-09-26  5:11 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-13 21:29 [PATCH 00/12] thread_info cleanups and stack caching Andy Lutomirski
2016-09-13 21:29 ` [PATCH 01/12] x86/asm: Move 'status' from struct thread_info to struct thread_struct Andy Lutomirski
2016-09-15 10:41   ` [tip:x86/asm] x86/asm: Move the thread_info::status field to thread_struct tip-bot for Andy Lutomirski
2016-09-13 21:29 ` [PATCH 02/12] x86/entry: Get rid of pt_regs_to_thread_info() Andy Lutomirski
2016-09-15  6:21   ` Ingo Molnar
2016-09-15 10:42   ` [tip:x86/asm] " tip-bot for Linus Torvalds
2016-09-13 21:29 ` [PATCH 03/12] um: Stop conflating task_struct::stack with thread_info Andy Lutomirski
2016-09-15  6:21   ` Ingo Molnar
2016-09-15 10:42   ` [tip:x86/asm] um/Stop " tip-bot for Linus Torvalds
2016-09-13 21:29 ` [PATCH 04/12] sched: Allow putting thread_info into task_struct Andy Lutomirski
2016-09-15 10:43   ` [tip:x86/asm] sched/core: " tip-bot for Andy Lutomirski
2016-09-13 21:29 ` [PATCH 05/12] x86: Move " Andy Lutomirski
2016-09-15 10:43   ` [tip:x86/asm] " tip-bot for Andy Lutomirski
2016-09-13 21:29 ` [PATCH 06/12] sched: Add try_get_task_stack() and put_task_stack() Andy Lutomirski
2016-09-13 21:29 ` [PATCH 07/12] kthread: to_live_kthread() needs try_get_task_stack() Andy Lutomirski
2016-09-13 21:29 ` [PATCH 08/12] x86/dumpstack: Pin the target stack in save_stack_trace_tsk() Andy Lutomirski
2016-09-14 14:55   ` Josh Poimboeuf
2016-09-14 18:22     ` Andy Lutomirski
2016-09-14 18:35       ` Josh Poimboeuf
2016-09-15 18:04         ` Andy Lutomirski
2016-09-15 18:37           ` Josh Poimboeuf
2016-09-15 18:41             ` Andy Lutomirski
2016-09-15 19:19               ` Josh Poimboeuf
2016-09-16  7:47                 ` Peter Zijlstra
2016-09-16 15:12                   ` Andy Lutomirski
2016-09-16 15:31                     ` Peter Zijlstra
2016-09-16 15:32                       ` Andy Lutomirski
2016-09-16 16:35                         ` Peter Zijlstra
2016-09-15  6:37   ` Ingo Molnar
     [not found]     ` <CA+55aFxt=HLrELBE=BXUrWdh6LYs4gtu9S=yCruiDffq4HN80w@mail.gmail.com>
2016-09-15  9:27       ` Ingo Molnar
2016-09-13 21:29 ` [PATCH 09/12] x86/process: Pin the target stack in get_wchan() Andy Lutomirski
2016-09-17  2:00   ` Jann Horn
2016-09-22 22:44     ` Andy Lutomirski
2016-09-22 22:50       ` Andy Lutomirski
2016-09-23  7:43       ` Jann Horn
2016-09-23 18:28         ` Kees Cook
2016-09-23 18:34           ` Jann Horn
2016-09-26  5:10             ` Tycho Andersen
2016-09-13 21:29 ` [PATCH 10/12] lib/syscall: Pin the task stack in collect_syscall() Andy Lutomirski
2016-09-13 21:29 ` [PATCH 11/12] sched: Free the stack early if CONFIG_THREAD_INFO_IN_TASK Andy Lutomirski
2016-09-13 21:29 ` [PATCH 12/12] fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.