linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/21] Preparatory patches for x86 KAISER support
@ 2017-11-27 10:45 Ingo Molnar
  2017-11-27 10:45 ` [PATCH 01/21] x86/unwinder/orc: Don't bail on stack overflow Ingo Molnar
                   ` (20 more replies)
  0 siblings, 21 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

This is mostly Andy's syscall trampoline stack patches, but also
some other preparatory cleanups and enhancements.

Sending this separately to make it (slightly) easier to review.

The Git version of the full queue can be found at:

  git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/mm

Thanks,

    Ingo

=============>

Andy Lutomirski (20):
  x86/unwinder/orc: Don't bail on stack overflow
  x86/irq: Remove an old outdated comment about context tracking races
  x86/irq/64: Print the offending IP in the stack overflow warning
  x86/entry/64: Allocate and enable the SYSENTER stack
  x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  x86/entry/gdt: Put per-CPU GDT remaps in ascending order
  x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce 'struct cpu_entry_area'
  x86/kasan/64: Teach KASAN about the cpu_entry_area
  x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  x86/dumpstack: Handle stack overflow on all stacks
  x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
  x86/entry: Remap the TSS into the CPU entry area
  x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
  x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  x86/entry/64: Use a per-CPU trampoline stack for IDT entries
  x86/entry/64: Return to userspace from the trampoline stack
  x86/entry/64: Create a per-CPU SYSCALL entry trampoline
  x86/entry/64: Move the IST stacks into 'struct cpu_entry_area'
  x86/entry/64: Remove the SYSENTER stack canary
  x86/entry: Clean up the SYSENTER_stack code

Josh Poimboeuf (1):
  x86/unwinder: Handle stack overflows more gracefully

 arch/x86/entry/entry_32.S          |   6 +-
 arch/x86/entry/entry_64.S          | 170 ++++++++++++++++++++++++++++++++-----
 arch/x86/entry/entry_64_compat.S   |   7 +-
 arch/x86/include/asm/desc.h        |  11 +--
 arch/x86/include/asm/fixmap.h      |  61 ++++++++++++-
 arch/x86/include/asm/kdebug.h      |   1 +
 arch/x86/include/asm/processor.h   |  54 +++++++-----
 arch/x86/include/asm/stacktrace.h  |   3 +
 arch/x86/include/asm/switch_to.h   |   3 +-
 arch/x86/include/asm/thread_info.h |   2 +-
 arch/x86/include/asm/traps.h       |   1 -
 arch/x86/include/asm/unwind.h      |   7 ++
 arch/x86/kernel/asm-offsets.c      |   7 ++
 arch/x86/kernel/asm-offsets_32.c   |   5 --
 arch/x86/kernel/asm-offsets_64.c   |   1 +
 arch/x86/kernel/cpu/common.c       | 136 +++++++++++++++++++++--------
 arch/x86/kernel/doublefault.c      |  36 ++++----
 arch/x86/kernel/dumpstack.c        |  74 ++++++++++++----
 arch/x86/kernel/dumpstack_32.c     |   6 ++
 arch/x86/kernel/dumpstack_64.c     |   6 ++
 arch/x86/kernel/irq.c              |  12 ---
 arch/x86/kernel/irq_64.c           |   4 +-
 arch/x86/kernel/process.c          |  13 ++-
 arch/x86/kernel/process_64.c       |  12 +--
 arch/x86/kernel/stacktrace.c       |   2 +-
 arch/x86/kernel/traps.c            |  57 ++++++++-----
 arch/x86/kernel/unwind_orc.c       |  88 ++++++++-----------
 arch/x86/kernel/vmlinux.lds.S      |   9 ++
 arch/x86/mm/kasan_init_64.c        |  18 +++-
 arch/x86/power/cpu.c               |  16 ++--
 arch/x86/xen/mmu_pv.c              |   2 +-
 31 files changed, 590 insertions(+), 240 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/21] x86/unwinder/orc: Don't bail on stack overflow
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 14:47   ` Borislav Petkov
  2017-11-27 10:45 ` [PATCH 02/21] x86/unwinder: Handle stack overflows more gracefully Ingo Molnar
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

If we overflow the stack into a guard page and then try to unwind it
with ORC, it should work well: by construction, there can't be any
meaningful data in the guard page because no writes to the guard page
will have succeeded.

This patch fixes a bug that unwinding from working correctly: if the
starting register state has RSP pointing into a stack guard page, the
ORC unwinder bails out immediately.  This patch fixes that: the ORC
unwinder will start the unwind.

I tested this by intentionally overflowing the task stack.  The
result is an accurate call trace instead of a trace consisting
purely of '?' entries.

There are a few other bugs that are triggered if the unwinder
encounters a stack overflow after the first step, and Josh has WIP
patches to fix those as well.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/927042950d7f1a7007dd0f58538966a593508f8b.1511715954.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/unwind_orc.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index a3f973b2c97a..7f6e3935666b 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -553,8 +553,18 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
 	}
 
 	if (get_stack_info((unsigned long *)state->sp, state->task,
-			   &state->stack_info, &state->stack_mask))
-		return;
+			   &state->stack_info, &state->stack_mask)) {
+		/*
+		 * We weren't on a valid stack.  It's possible that
+		 * we overflowed a valid stack into a guard page.
+		 * See if the next page up is valid so that we can
+		 * generate some kind of backtrace if this happens.
+		 */
+		void *next_page = (void *)PAGE_ALIGN((unsigned long)regs->sp);
+		if (get_stack_info(next_page, state->task, &state->stack_info,
+				   &state->stack_mask))
+			return;
+	}
 
 	/*
 	 * The caller can provide the address of the first frame directly
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 02/21] x86/unwinder: Handle stack overflows more gracefully
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
  2017-11-27 10:45 ` [PATCH 01/21] x86/unwinder/orc: Don't bail on stack overflow Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 17:36   ` Borislav Petkov
  2017-11-27 10:45 ` [PATCH 03/21] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Josh Poimboeuf <jpoimboe@redhat.com>

There are at least two unwinder bugs hindering the debugging of
stack-overflow crashes:

- It doesn't deal gracefully with the case where the stack overflows and
  the stack pointer itself isn't on a valid stack but the
  to-be-dereferenced data *is*.

- The ORC oops dump code doesn't know how to print partial pt_regs, for the
  case where if we get an interrupt/exception in *early* entry code
  before the full pt_regs have been saved.

Fix both issues.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171126024031.uxi4numpbjm5rlbr@treble
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/kdebug.h |  1 +
 arch/x86/include/asm/unwind.h |  7 ++++
 arch/x86/kernel/dumpstack.c   | 32 ++++++++++++++++---
 arch/x86/kernel/process_64.c  | 11 +++----
 arch/x86/kernel/stacktrace.c  |  2 +-
 arch/x86/kernel/unwind_orc.c  | 74 +++++++++++++++----------------------------
 6 files changed, 66 insertions(+), 61 deletions(-)

diff --git a/arch/x86/include/asm/kdebug.h b/arch/x86/include/asm/kdebug.h
index f86a8caa561e..395c9631e000 100644
--- a/arch/x86/include/asm/kdebug.h
+++ b/arch/x86/include/asm/kdebug.h
@@ -26,6 +26,7 @@ extern void die(const char *, struct pt_regs *,long);
 extern int __must_check __die(const char *, struct pt_regs *, long);
 extern void show_stack_regs(struct pt_regs *regs);
 extern void __show_regs(struct pt_regs *regs, int all);
+extern void show_iret_regs(struct pt_regs *regs);
 extern unsigned long oops_begin(void);
 extern void oops_end(unsigned long, struct pt_regs *, int signr);
 
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index e9cc6fe1fc6f..5be2fb23825a 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -7,6 +7,9 @@
 #include <asm/ptrace.h>
 #include <asm/stacktrace.h>
 
+#define IRET_FRAME_OFFSET (offsetof(struct pt_regs, ip))
+#define IRET_FRAME_SIZE   (sizeof(struct pt_regs) - IRET_FRAME_OFFSET)
+
 struct unwind_state {
 	struct stack_info stack_info;
 	unsigned long stack_mask;
@@ -52,6 +55,10 @@ void unwind_start(struct unwind_state *state, struct task_struct *task,
 }
 
 #if defined(CONFIG_UNWINDER_ORC) || defined(CONFIG_UNWINDER_FRAME_POINTER)
+/*
+ * WARNING: The entire pt_regs may not be safe to dereference.  In some cases,
+ * only the iret registers are accessible.  Use with caution!
+ */
 static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	if (unwind_done(state))
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index f13b4c00a5de..fc918744ad7d 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -50,6 +50,28 @@ static void printk_stack_address(unsigned long address, int reliable,
 	printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address);
 }
 
+void show_iret_regs(struct pt_regs *regs)
+{
+	printk(KERN_DEFAULT "RIP: %04x:%pS\n", (int)regs->cs, (void *)regs->ip);
+	printk(KERN_DEFAULT "RSP: %04x:%016lx EFLAGS: %08lx", (int)regs->ss,
+		regs->sp, regs->flags);
+}
+
+static void show_regs_safe(struct stack_info *info, struct pt_regs *regs)
+{
+	if (on_stack(info, regs, sizeof(*regs)))
+		__show_regs(regs, 0);
+	else if (on_stack(info, (void *)regs + IRET_FRAME_OFFSET,
+			  IRET_FRAME_SIZE)) {
+		/*
+		 * When an interrupt or exception occurs in entry code, the
+		 * full pt_regs might not have been saved yet.  In that case
+		 * just print the iret return frame.
+		 */
+		show_iret_regs(regs);
+	}
+}
+
 void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			unsigned long *stack, char *log_lvl)
 {
@@ -94,8 +116,8 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		if (stack_name)
 			printk("%s <%s>\n", log_lvl, stack_name);
 
-		if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
-			__show_regs(regs, 0);
+		if (regs)
+			show_regs_safe(&stack_info, regs);
 
 		/*
 		 * Scan the stack, printing any text addresses we find.  At the
@@ -119,7 +141,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 			/*
 			 * Don't print regs->ip again if it was already printed
-			 * by __show_regs() below.
+			 * by show_regs_safe() below.
 			 */
 			if (regs && stack == &regs->ip)
 				goto next;
@@ -155,8 +177,8 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 
 			/* if the frame has entry regs, print them */
 			regs = unwind_get_entry_regs(&state);
-			if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
-				__show_regs(regs, 0);
+			if (regs)
+				show_regs_safe(&stack_info, regs);
 		}
 
 		if (stack_name)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index eeeb34f85c25..01b119bebb68 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -69,9 +69,8 @@ void __show_regs(struct pt_regs *regs, int all)
 	unsigned int fsindex, gsindex;
 	unsigned int ds, cs, es;
 
-	printk(KERN_DEFAULT "RIP: %04lx:%pS\n", regs->cs, (void *)regs->ip);
-	printk(KERN_DEFAULT "RSP: %04lx:%016lx EFLAGS: %08lx", regs->ss,
-		regs->sp, regs->flags);
+	show_iret_regs(regs);
+
 	if (regs->orig_ax != -1)
 		pr_cont(" ORIG_RAX: %016lx\n", regs->orig_ax);
 	else
@@ -88,6 +87,9 @@ void __show_regs(struct pt_regs *regs, int all)
 	printk(KERN_DEFAULT "R13: %016lx R14: %016lx R15: %016lx\n",
 	       regs->r13, regs->r14, regs->r15);
 
+	if (!all)
+		return;
+
 	asm("movl %%ds,%0" : "=r" (ds));
 	asm("movl %%cs,%0" : "=r" (cs));
 	asm("movl %%es,%0" : "=r" (es));
@@ -98,9 +100,6 @@ void __show_regs(struct pt_regs *regs, int all)
 	rdmsrl(MSR_GS_BASE, gs);
 	rdmsrl(MSR_KERNEL_GS_BASE, shadowgs);
 
-	if (!all)
-		return;
-
 	cr0 = read_cr0();
 	cr2 = read_cr2();
 	cr3 = __read_cr3();
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index 77835bc021c7..7dd0d2a0d142 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -102,7 +102,7 @@ __save_stack_trace_reliable(struct stack_trace *trace,
 	for (unwind_start(&state, task, NULL, NULL); !unwind_done(&state);
 	     unwind_next_frame(&state)) {
 
-		regs = unwind_get_entry_regs(&state);
+		regs = unwind_get_entry_regs(&state, NULL);
 		if (regs) {
 			/*
 			 * Kernel mode registers on the stack indicate an
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 7f6e3935666b..5b9297299f72 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -253,22 +253,15 @@ unsigned long *unwind_get_return_address_ptr(struct unwind_state *state)
 	return NULL;
 }
 
-static bool stack_access_ok(struct unwind_state *state, unsigned long addr,
+static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
 			    size_t len)
 {
 	struct stack_info *info = &state->stack_info;
+	void *addr = (void *)_addr;
 
-	/*
-	 * If the address isn't on the current stack, switch to the next one.
-	 *
-	 * We may have to traverse multiple stacks to deal with the possibility
-	 * that info->next_sp could point to an empty stack and the address
-	 * could be on a subsequent stack.
-	 */
-	while (!on_stack(info, (void *)addr, len))
-		if (get_stack_info(info->next_sp, state->task, info,
-				   &state->stack_mask))
-			return false;
+	if (!on_stack(info, addr, len) &&
+	    (get_stack_info(addr, state->task, info, &state->stack_mask)))
+		return false;
 
 	return true;
 }
@@ -283,42 +276,32 @@ static bool deref_stack_reg(struct unwind_state *state, unsigned long addr,
 	return true;
 }
 
-#define REGS_SIZE (sizeof(struct pt_regs))
-#define SP_OFFSET (offsetof(struct pt_regs, sp))
-#define IRET_REGS_SIZE (REGS_SIZE - offsetof(struct pt_regs, ip))
-#define IRET_SP_OFFSET (SP_OFFSET - offsetof(struct pt_regs, ip))
-
 static bool deref_stack_regs(struct unwind_state *state, unsigned long addr,
-			     unsigned long *ip, unsigned long *sp, bool full)
+			     unsigned long *ip, unsigned long *sp)
 {
-	size_t regs_size = full ? REGS_SIZE : IRET_REGS_SIZE;
-	size_t sp_offset = full ? SP_OFFSET : IRET_SP_OFFSET;
-	struct pt_regs *regs = (struct pt_regs *)(addr + regs_size - REGS_SIZE);
-
-	if (IS_ENABLED(CONFIG_X86_64)) {
-		if (!stack_access_ok(state, addr, regs_size))
-			return false;
-
-		*ip = regs->ip;
-		*sp = regs->sp;
+	struct pt_regs *regs = (struct pt_regs *)addr;
 
-		return true;
-	}
+	/* x86-32 support will be more complicated due to the &regs->sp hack */
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_X86_32));
 
-	if (!stack_access_ok(state, addr, sp_offset))
+	if (!stack_access_ok(state, addr, sizeof(struct pt_regs)))
 		return false;
 
 	*ip = regs->ip;
+	*sp = regs->sp;
+	return true;
+}
 
-	if (user_mode(regs)) {
-		if (!stack_access_ok(state, addr + sp_offset,
-				     REGS_SIZE - SP_OFFSET))
-			return false;
+static bool deref_stack_iret_regs(struct unwind_state *state, unsigned long addr,
+				  unsigned long *ip, unsigned long *sp)
+{
+	struct pt_regs *regs = (void *)addr - IRET_FRAME_OFFSET;
 
-		*sp = regs->sp;
-	} else
-		*sp = (unsigned long)&regs->sp;
+	if (!stack_access_ok(state, addr, IRET_FRAME_SIZE))
+		return false;
 
+	*ip = regs->ip;
+	*sp = regs->sp;
 	return true;
 }
 
@@ -327,7 +310,6 @@ bool unwind_next_frame(struct unwind_state *state)
 	unsigned long ip_p, sp, orig_ip, prev_sp = state->sp;
 	enum stack_type prev_type = state->stack_info.type;
 	struct orc_entry *orc;
-	struct pt_regs *ptregs;
 	bool indirect = false;
 
 	if (unwind_done(state))
@@ -435,7 +417,7 @@ bool unwind_next_frame(struct unwind_state *state)
 		break;
 
 	case ORC_TYPE_REGS:
-		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, true)) {
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp)) {
 			orc_warn("can't dereference registers at %p for ip %pB\n",
 				 (void *)sp, (void *)orig_ip);
 			goto done;
@@ -447,20 +429,14 @@ bool unwind_next_frame(struct unwind_state *state)
 		break;
 
 	case ORC_TYPE_REGS_IRET:
-		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, false)) {
+		if (!deref_stack_iret_regs(state, sp, &state->ip, &state->sp)) {
 			orc_warn("can't dereference iret registers at %p for ip %pB\n",
 				 (void *)sp, (void *)orig_ip);
 			goto done;
 		}
 
-		ptregs = container_of((void *)sp, struct pt_regs, ip);
-		if ((unsigned long)ptregs >= prev_sp &&
-		    on_stack(&state->stack_info, ptregs, REGS_SIZE)) {
-			state->regs = ptregs;
-			state->full_regs = false;
-		} else
-			state->regs = NULL;
-
+		state->regs = (void *)sp - IRET_FRAME_OFFSET;
+		state->full_regs = false;
 		state->signal = true;
 		break;
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 03/21] x86/irq: Remove an old outdated comment about context tracking races
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
  2017-11-27 10:45 ` [PATCH 01/21] x86/unwinder/orc: Don't bail on stack overflow Ingo Molnar
  2017-11-27 10:45 ` [PATCH 02/21] x86/unwinder: Handle stack overflows more gracefully Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 04/21] x86/irq/64: Print the offending IP in the stack overflow warning Ingo Molnar
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

That race has been fixed and code cleaned up for a while now.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/12e75976dbbb7ece2b0a64238f1d3892dfed1e16.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/irq.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 49cfd9fe7589..68e1867cca80 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -219,18 +219,6 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 	/* high bit used in ret_from_ code  */
 	unsigned vector = ~regs->orig_ax;
 
-	/*
-	 * NB: Unlike exception entries, IRQ entries do not reliably
-	 * handle context tracking in the low-level entry code.  This is
-	 * because syscall entries execute briefly with IRQs on before
-	 * updating context tracking state, so we can take an IRQ from
-	 * kernel mode with CONTEXT_USER.  The low-level entry code only
-	 * updates the context if we came from user mode, so we won't
-	 * switch to CONTEXT_KERNEL.  We'll fix that once the syscall
-	 * code is cleaned up enough that we can cleanly defer enabling
-	 * IRQs.
-	 */
-
 	entering_irq();
 
 	/* entering_irq() tells RCU that we're not quiescent.  Check it. */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 04/21] x86/irq/64: Print the offending IP in the stack overflow warning
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (2 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 03/21] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 05/21] x86/entry/64: Allocate and enable the SYSENTER stack Ingo Molnar
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

In case something goes wrong with unwind (not unlikely in case of
overflow), print the offending IP where we detected the overflow.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/6fcf700cc5ee884fb739b67d1246ab4185c41409.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/irq_64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 020efbf5786b..d86e344f5b3d 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -57,10 +57,10 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (regs->sp >= estack_top && regs->sp <= estack_bottom)
 		return;
 
-	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx)\n",
+	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx,ip:%pF)\n",
 		current->comm, curbase, regs->sp,
 		irq_stack_top, irq_stack_bottom,
-		estack_top, estack_bottom);
+		estack_top, estack_bottom, (void *)regs->ip);
 
 	if (sysctl_panic_on_stackoverflow)
 		panic("low stack detected by irq handler - check messages\n");
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 05/21] x86/entry/64: Allocate and enable the SYSENTER stack
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (3 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 04/21] x86/irq/64: Print the offending IP in the stack overflow warning Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 06/21] x86/dumpstack: Add get_stack_info() support for " Ingo Molnar
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

This will simplify future changes that want scratch variables early in
the SYSENTER handler -- they'll be able to spill registers to the
stack.  It also lets us get rid of a SWAPGS_UNSAFE_STACK user.

This does not depend on CONFIG_IA32_EMULATION=y because we'll want the
stack space even without IA32 emulation.

As far as I can tell, the reason that this wasn't done from day 1 is
that we use IST for #DB and #BP, which is IMO rather nasty and causes
a lot more problems than it solves.  But, since #DB uses IST, we don't
actually need a real stack for SYSENTER (because SYSENTER with TF set
will invoke #DB on the IST stack rather than the SYSENTER stack).

I want to remove IST usage from these vectors some day, and this patch
is a prerequisite for that as well.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/c37d6e68a73e1b5b1203e0e95b488fa8092b3cfb.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64_compat.S | 2 +-
 arch/x86/include/asm/processor.h | 3 ---
 arch/x86/kernel/asm-offsets.c    | 5 +++++
 arch/x86/kernel/asm-offsets_32.c | 5 -----
 arch/x86/kernel/cpu/common.c     | 4 +++-
 arch/x86/kernel/process.c        | 2 --
 arch/x86/kernel/traps.c          | 3 +--
 7 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 568e130d932c..dcc6987f9bae 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -48,7 +48,7 @@
  */
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
-	SWAPGS_UNSAFE_STACK
+	SWAPGS
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index cc16fa882e3e..504a3bb4d5f0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -340,14 +340,11 @@ struct tss_struct {
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 
-#ifdef CONFIG_X86_32
 	/*
 	 * Space for the temporary SYSENTER stack.
 	 */
 	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
-#endif
-
 } ____cacheline_aligned;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 8ea78275480d..b275863128eb 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -93,4 +93,9 @@ void common(void) {
 
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
+
+	/* Offset from cpu_tss to SYSENTER_stack */
+	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	/* Size of SYSENTER_stack */
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
 }
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index dedf428b20b6..52ce4ea16e53 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -50,11 +50,6 @@ void foo(void)
 	DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) -
 	       offsetofend(struct tss_struct, SYSENTER_stack));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
-
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
 	OFFSET(stack_canary_offset, stack_canary, canary);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fa998ca8aa5a..ccb5f66c4e5b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1386,7 +1386,9 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
+		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index bb988a24db92..ceedb9c431cc 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -71,9 +71,7 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-#ifdef CONFIG_X86_32
 	.SYSENTER_stack_canary	= STACK_END_MAGIC,
-#endif
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b7b0f74a2150..2008dd0f8ccb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -800,14 +800,13 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-#if defined(CONFIG_X86_32)
 	/*
 	 * This is the most likely code path that involves non-trivial use
 	 * of the SYSENTER stack.  Check that we haven't overrun it.
 	 */
 	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
 	     "Overran or corrupted SYSENTER stack\n");
-#endif
+
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 06/21] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (4 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 05/21] x86/entry/64: Allocate and enable the SYSENTER stack Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 07/21] x86/entry/gdt: Put per-CPU GDT remaps in ascending order Ingo Molnar
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

get_stack_info() doesn't currently know about the SYSENTER stack, so
unwinding will fail if we entered the kernel on the SYSENTER stack
and haven't fully switched off.  Teach get_stack_info() about the
SYSENTER stack.

With future patches applied that run part of the entry code on the
SYSENTER stack and introduce an intentional BUG(), I would get:

  PANIC: double fault, error_code: 0x0
  ...
  RIP: 0010:do_error_trap+0x33/0x1c0
  ...
  Call Trace:
  Code: ...

With this patch, I get:

  PANIC: double fault, error_code: 0x0
  ...
  Call Trace:
   <SYSENTER>
   ? async_page_fault+0x36/0x60
   ? invalid_op+0x22/0x40
   ? async_page_fault+0x36/0x60
   ? sync_regs+0x3c/0x40
   ? sync_regs+0x2e/0x40
   ? error_entry+0x6c/0xd0
   ? async_page_fault+0x36/0x60
   </SYSENTER>
  Code: ...

which is a lot more informative.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/c32ce8b363e27fa9b4a4773297d5b4b0f4b39e94.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/stacktrace.h |  3 +++
 arch/x86/kernel/dumpstack.c       | 19 +++++++++++++++++++
 arch/x86/kernel/dumpstack_32.c    |  6 ++++++
 arch/x86/kernel/dumpstack_64.c    |  6 ++++++
 4 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 8da111b3c342..f8062bfd43a0 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -16,6 +16,7 @@ enum stack_type {
 	STACK_TYPE_TASK,
 	STACK_TYPE_IRQ,
 	STACK_TYPE_SOFTIRQ,
+	STACK_TYPE_SYSENTER,
 	STACK_TYPE_EXCEPTION,
 	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
 };
@@ -28,6 +29,8 @@ struct stack_info {
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info);
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info);
+
 int get_stack_info(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info, unsigned long *visit_mask);
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index fc918744ad7d..a49d3c86d324 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -43,6 +43,25 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 	return true;
 }
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
+{
+	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+
+	/* Treat the canary as part of the stack for unwinding purposes. */
+	void *begin = &tss->SYSENTER_stack_canary;
+	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+
+	if ((void *)stack < begin || (void *)stack >= end)
+		return false;
+
+	info->type	= STACK_TYPE_SYSENTER;
+	info->begin	= begin;
+	info->end	= end;
+	info->next_sp	= NULL;
+
+	return true;
+}
+
 static void printk_stack_address(unsigned long address, int reliable,
 				 char *log_lvl)
 {
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index daefae83a3aa..5ff13a6b3680 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -26,6 +26,9 @@ const char *stack_type_name(enum stack_type type)
 	if (type == STACK_TYPE_SOFTIRQ)
 		return "SOFTIRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	return NULL;
 }
 
@@ -93,6 +96,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (task != current)
 		goto unknown;
 
+	if (in_sysenter_stack(stack, info))
+		goto recursion_check;
+
 	if (in_hardirq_stack(stack, info))
 		goto recursion_check;
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 88ce2ffdb110..abc828f8c297 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -37,6 +37,9 @@ const char *stack_type_name(enum stack_type type)
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	if (type >= STACK_TYPE_EXCEPTION && type <= STACK_TYPE_EXCEPTION_LAST)
 		return exception_stack_names[type - STACK_TYPE_EXCEPTION];
 
@@ -115,6 +118,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (in_irq_stack(stack, info))
 		goto recursion_check;
 
+	if (in_sysenter_stack(stack, info))
+		goto recursion_check;
+
 	goto unknown;
 
 recursion_check:
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 07/21] x86/entry/gdt: Put per-CPU GDT remaps in ascending order
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (5 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 06/21] x86/dumpstack: Add get_stack_info() support for " Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 08/21] x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce 'struct cpu_entry_area' Ingo Molnar
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

We currently have CPU 0's GDT at the top of the GDT range and
higher-numbered CPUs at lower addresses.  This happens because the
fixmap is upside down (index 0 is the top of the fixmap).

Flip it so that GDTs are in ascending order by virtual address.
This will simplify a future patch that will generalize the GDT
remap to contain multiple pages.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/3966a6edf6fd45deca4cf52a9b9276402499dda9.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 4011cb03ef08..95cd95eb7285 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -63,7 +63,7 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 /* Get the fixmap index for a specific processor */
 static inline unsigned int get_cpu_gdt_ro_index(int cpu)
 {
-	return FIX_GDT_REMAP_BEGIN + cpu;
+	return FIX_GDT_REMAP_END - cpu;
 }
 
 /* Provide the fixmap address of the remapped GDT */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 08/21] x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce 'struct cpu_entry_area'
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (6 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 07/21] x86/entry/gdt: Put per-CPU GDT remaps in ascending order Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 09/21] x86/kasan/64: Teach KASAN about the cpu_entry_area Ingo Molnar
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Currently, the GDT is an ad-hoc array of pages, one per CPU, in the
fixmap.  Generalize it to be an array of a new 'struct cpu_entry_area'
so that we can cleanly add new things to it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/22571d77ba1f3c714df9fa37db9a58218bc17597.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h   |  9 +--------
 arch/x86/include/asm/fixmap.h | 37 +++++++++++++++++++++++++++++++++++--
 arch/x86/kernel/cpu/common.c  | 14 +++++++-------
 arch/x86/xen/mmu_pv.c         |  2 +-
 4 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 95cd95eb7285..194ffab00ebe 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -60,17 +60,10 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 	return this_cpu_ptr(&gdt_page)->gdt;
 }
 
-/* Get the fixmap index for a specific processor */
-static inline unsigned int get_cpu_gdt_ro_index(int cpu)
-{
-	return FIX_GDT_REMAP_END - cpu;
-}
-
 /* Provide the fixmap address of the remapped GDT */
 static inline struct desc_struct *get_cpu_gdt_ro(int cpu)
 {
-	unsigned int idx = get_cpu_gdt_ro_index(cpu);
-	return (struct desc_struct *)__fix_to_virt(idx);
+	return (struct desc_struct *)&get_cpu_entry_area(cpu)->gdt;
 }
 
 /* Provide the current read-only GDT */
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..b5037e426f1f 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -44,6 +44,19 @@ extern unsigned long __FIXADDR_TOP;
 			 PAGE_SIZE)
 #endif
 
+/*
+ * cpu_entry_area is a percpu region in the fixmap that contains things
+ * needed by the CPU and early entry/exit code.  Real types aren't used
+ * for all fields here to avoid circular header dependencies.
+ *
+ * Every field is a virtual alias of some other allocated backing store.
+ * There is no direct allocation of a struct cpu_entry_area.
+ */
+struct cpu_entry_area {
+	char gdt[PAGE_SIZE];
+};
+
+#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
 
 /*
  * Here we define all the compile-time 'special' virtual
@@ -101,8 +114,8 @@ enum fixed_addresses {
 	FIX_LNW_VRTC,
 #endif
 	/* Fixmap entries to remap the GDTs, one per processor. */
-	FIX_GDT_REMAP_BEGIN,
-	FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+	FIX_CPU_ENTRY_AREA_TOP,
+	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
 
 	__end_of_permanent_fixed_addresses,
 
@@ -185,5 +198,25 @@ void __init *early_memremap_decrypted_wp(resource_size_t phys_addr,
 void __early_set_fixmap(enum fixed_addresses idx,
 			phys_addr_t phys, pgprot_t flags);
 
+static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
+{
+	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
+
+	return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
+}
+
+#define __get_cpu_entry_area_offset_index(cpu, offset) ({		\
+	BUILD_BUG_ON(offset % PAGE_SIZE != 0);				\
+	__get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);	\
+	})
+
+#define get_cpu_entry_area_index(cpu, field)				\
+	__get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))
+
+static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
+{
+	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ccb5f66c4e5b..c0fb3eb37ee0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,12 +490,12 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
-/* Setup the fixmap mapping only once per-processor */
-static inline void setup_fixmap_gdt(int cpu)
+/* Setup the fixmap mappings only once per-processor */
+static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
-	pgprot_t prot = PAGE_KERNEL_RO;
+	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
 	/*
 	 * On native 32-bit systems, the GDT cannot be read-only because
@@ -506,11 +506,11 @@ static inline void setup_fixmap_gdt(int cpu)
 	 * On Xen PV, the GDT must be read-only because the hypervisor requires
 	 * it.
 	 */
-	pgprot_t prot = boot_cpu_has(X86_FEATURE_XENPV) ?
+	pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
 		PAGE_KERNEL_RO : PAGE_KERNEL;
 #endif
 
-	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
+	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1614,7 +1614,7 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1676,7 +1676,7 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2ccdaba31a07..c2454237fa67 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2272,7 +2272,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 #endif
 	case FIX_TEXT_POKE0:
 	case FIX_TEXT_POKE1:
-	case FIX_GDT_REMAP_BEGIN ... FIX_GDT_REMAP_END:
+	case FIX_CPU_ENTRY_AREA_TOP ... FIX_CPU_ENTRY_AREA_BOTTOM:
 		/* All local page mappings */
 		pte = pfn_pte(phys, prot);
 		break;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 09/21] x86/kasan/64: Teach KASAN about the cpu_entry_area
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (7 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 08/21] x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce 'struct cpu_entry_area' Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 10/21] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss Ingo Molnar
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

The cpu_entry_area will contain stacks.  Make sure that KASAN has
appropriate shadow mappings for them.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: kasan-dev@googlegroups.com
Link: https://lkml.kernel.org/r/8407adf9126440d6467dade88fdb3e3b75fc1019.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/kasan_init_64.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 99dfed6dfef8..9ec70d780f1f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -277,6 +277,7 @@ void __init kasan_early_init(void)
 void __init kasan_init(void)
 {
 	int i;
+	void *shadow_cpu_entry_begin, *shadow_cpu_entry_end;
 
 #ifdef CONFIG_KASAN_INLINE
 	register_die_notifier(&kasan_die_notifier);
@@ -329,8 +330,23 @@ void __init kasan_init(void)
 			      (unsigned long)kasan_mem_to_shadow(_end),
 			      early_pfn_to_nid(__pa(_stext)));
 
+	shadow_cpu_entry_begin = (void *)__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM);
+	shadow_cpu_entry_begin = kasan_mem_to_shadow(shadow_cpu_entry_begin);
+	shadow_cpu_entry_begin = (void *)round_down((unsigned long)shadow_cpu_entry_begin,
+						PAGE_SIZE);
+
+	shadow_cpu_entry_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
+	shadow_cpu_entry_end = kasan_mem_to_shadow(shadow_cpu_entry_end);
+	shadow_cpu_entry_end = (void *)round_up((unsigned long)shadow_cpu_entry_end,
+					PAGE_SIZE);
+
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
-			(void *)KASAN_SHADOW_END);
+				   shadow_cpu_entry_begin);
+
+	kasan_populate_shadow((unsigned long)shadow_cpu_entry_begin,
+			      (unsigned long)shadow_cpu_entry_end, 0);
+
+	kasan_populate_zero_shadow(shadow_cpu_entry_end, (void *)KASAN_SHADOW_END);
 
 	load_cr3(init_top_pgt);
 	__flush_tlb_all();
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 10/21] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (8 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 09/21] x86/kasan/64: Teach KASAN about the cpu_entry_area Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 11/21] x86/dumpstack: Handle stack overflow on all stacks Ingo Molnar
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

A future patch will move SYSENTER_stack to the beginning of cpu_tss
to help detect overflow.  Before this can happen, fix several code
paths that hardcode assumptions about the old layout.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/d40a2c5ae4539d64090849a374f3169ec492f4e2.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h      |  2 +-
 arch/x86/include/asm/processor.h |  9 +++++++--
 arch/x86/kernel/cpu/common.c     |  8 ++++----
 arch/x86/kernel/doublefault.c    | 36 +++++++++++++++++-------------------
 arch/x86/power/cpu.c             | 13 +++++++------
 5 files changed, 36 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 194ffab00ebe..aab4fe9f49f8 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -178,7 +178,7 @@ static inline void set_tssldt_descriptor(void *d, unsigned long addr,
 #endif
 }
 
-static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr)
+static inline void __set_tss_desc(unsigned cpu, unsigned int entry, struct x86_hw_tss *addr)
 {
 	struct desc_struct *d = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 504a3bb4d5f0..6f0391f5d307 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -163,7 +163,7 @@ enum cpuid_regs_idx {
 extern struct cpuinfo_x86	boot_cpu_data;
 extern struct cpuinfo_x86	new_cpu_data;
 
-extern struct tss_struct	doublefault_tss;
+extern struct x86_hw_tss	doublefault_tss;
 extern __u32			cpu_caps_cleared[NCAPINTS];
 extern __u32			cpu_caps_set[NCAPINTS];
 
@@ -253,6 +253,11 @@ static inline void load_cr3(pgd_t *pgdir)
 	write_cr3(__sme_pa(pgdir));
 }
 
+/*
+ * Note that while the legacy 'TSS' name comes from 'Task State Segment',
+ * on modern x86 CPUs the TSS also holds information important to 64-bit mode,
+ * unrelated to the task-switch mechanism:
+ */
 #ifdef CONFIG_X86_32
 /* This is the TSS defined by the hardware. */
 struct x86_hw_tss {
@@ -323,7 +328,7 @@ struct x86_hw_tss {
 #define IO_BITMAP_BITS			65536
 #define IO_BITMAP_BYTES			(IO_BITMAP_BITS/8)
 #define IO_BITMAP_LONGS			(IO_BITMAP_BYTES/sizeof(long))
-#define IO_BITMAP_OFFSET		offsetof(struct tss_struct, io_bitmap)
+#define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
 struct tss_struct {
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c0fb3eb37ee0..62cdc10a7d94 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1582,7 +1582,7 @@ void cpu_init(void)
 		}
 	}
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 	/*
 	 * <= is required because the CPU will access up to
@@ -1601,7 +1601,7 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1659,12 +1659,12 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 #ifdef CONFIG_DOUBLEFAULT
 	/* Set up doublefault TSS pointer in the GDT */
diff --git a/arch/x86/kernel/doublefault.c b/arch/x86/kernel/doublefault.c
index 0e662c55ae90..0b8cedb20d6d 100644
--- a/arch/x86/kernel/doublefault.c
+++ b/arch/x86/kernel/doublefault.c
@@ -50,25 +50,23 @@ static void doublefault_fn(void)
 		cpu_relax();
 }
 
-struct tss_struct doublefault_tss __cacheline_aligned = {
-	.x86_tss = {
-		.sp0		= STACK_START,
-		.ss0		= __KERNEL_DS,
-		.ldt		= 0,
-		.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
-
-		.ip		= (unsigned long) doublefault_fn,
-		/* 0x2 bit is always set */
-		.flags		= X86_EFLAGS_SF | 0x2,
-		.sp		= STACK_START,
-		.es		= __USER_DS,
-		.cs		= __KERNEL_CS,
-		.ss		= __KERNEL_DS,
-		.ds		= __USER_DS,
-		.fs		= __KERNEL_PERCPU,
-
-		.__cr3		= __pa_nodebug(swapper_pg_dir),
-	}
+struct x86_hw_tss doublefault_tss __cacheline_aligned = {
+	.sp0		= STACK_START,
+	.ss0		= __KERNEL_DS,
+	.ldt		= 0,
+	.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
+
+	.ip		= (unsigned long) doublefault_fn,
+	/* 0x2 bit is always set */
+	.flags		= X86_EFLAGS_SF | 0x2,
+	.sp		= STACK_START,
+	.es		= __USER_DS,
+	.cs		= __KERNEL_CS,
+	.ss		= __KERNEL_DS,
+	.ds		= __USER_DS,
+	.fs		= __KERNEL_PERCPU,
+
+	.__cr3		= __pa_nodebug(swapper_pg_dir),
 };
 
 /* dummy for do_double_fault() call */
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 84fcfde53f8f..50593e138281 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -165,12 +165,13 @@ static void fix_processor_context(void)
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
-	set_tss_desc(cpu, t);	/*
-				 * This just modifies memory; should not be
-				 * necessary. But... This is necessary, because
-				 * 386 hardware has concept of busy TSS or some
-				 * similar stupidity.
-				 */
+
+	/*
+	 * This just modifies memory; should not be necessary. But... This is
+	 * necessary, because 386 hardware has concept of busy TSS or some
+	 * similar stupidity.
+	 */
+	set_tss_desc(cpu, &t->x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 11/21] x86/dumpstack: Handle stack overflow on all stacks
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (9 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 10/21] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 19:26   ` Linus Torvalds
  2017-11-27 10:45 ` [PATCH 12/21] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct Ingo Molnar
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

We currently special-case stack overflow on the task stack.  We're
going to start putting special stacks in the fixmap with a custom
layout, so they'll have guard pages, too.  Teach the unwinder to be
able to unwind an overflow of any of the stacks.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/5454bb325cb30a70457a47b50f22317be65eba7d.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index a49d3c86d324..de3083a1f820 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -112,24 +112,28 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * - task stack
 	 * - interrupt stack
 	 * - HW exception stacks (double fault, nmi, debug, mce)
+	 * - SYSENTER stack
 	 *
-	 * x86-32 can have up to three stacks:
+	 * x86-32 can have up to four stacks:
 	 * - task stack
 	 * - softirq stack
 	 * - hardirq stack
+	 * - SYSENTER stack
 	 */
 	for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
 		const char *stack_name;
 
-		/*
-		 * If we overflowed the task stack into a guard page, jump back
-		 * to the bottom of the usable stack.
-		 */
-		if (task_stack_page(task) - (void *)stack < PAGE_SIZE)
-			stack = task_stack_page(task);
-
-		if (get_stack_info(stack, task, &stack_info, &visit_mask))
-			break;
+		if (get_stack_info(stack, task, &stack_info, &visit_mask)) {
+			/*
+			 * We weren't on a valid stack.  It's possible that
+			 * we overflowed a valid stack into a guard page.
+			 * See if the next page up is valid so that we can
+			 * generate some kind of backtrace if this happens.
+			 */
+			stack = (unsigned long *)PAGE_ALIGN((unsigned long)stack);
+			if (get_stack_info(stack, task, &stack_info, &visit_mask))
+				break;
+		}
 
 		stack_name = stack_type_name(stack_info.type);
 		if (stack_name)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 12/21] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (10 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 11/21] x86/dumpstack: Handle stack overflow on all stacks Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 13/21] x86/entry: Remap the TSS into the CPU entry area Ingo Molnar
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

SYSENTER_stack should have reliable overflow detection, which
means that it needs to be at the bottom of a page, not the top.
Move it to the beginning of struct tss_struct and page-align it.

Also add an assertion to make sure that the fixed hardware TSS
doesn't cross a page boundary.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/8de9901e7c3a6aa8fac95b37b9c7b96f1900f11a.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/processor.h | 21 ++++++++++++---------
 arch/x86/kernel/cpu/common.c     | 21 +++++++++++++++++++++
 2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 6f0391f5d307..522bb5f1a7ff 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -333,7 +333,16 @@ struct x86_hw_tss {
 
 struct tss_struct {
 	/*
-	 * The hardware state:
+	 * Space for the temporary SYSENTER stack, used for SYSENTER
+	 * and the entry trampoline as well.
+	 */
+	unsigned long		SYSENTER_stack_canary;
+	unsigned long		SYSENTER_stack[64];
+
+	/*
+	 * The fixed hardware portion.  This must not cross a page boundary
+	 * at risk of violating the SDM's advice and potentially triggering
+	 * errata.
 	 */
 	struct x86_hw_tss	x86_tss;
 
@@ -344,15 +353,9 @@ struct tss_struct {
 	 * be within the limit.
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
+} __aligned(PAGE_SIZE);
 
-	/*
-	 * Space for the temporary SYSENTER stack.
-	 */
-	unsigned long		SYSENTER_stack_canary;
-	unsigned long		SYSENTER_stack[64];
-} ____cacheline_aligned;
-
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 62cdc10a7d94..d173f6013467 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -511,6 +511,27 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
+
+	/*
+	 * The Intel SDM says (Volume 3, 7.2.1):
+	 *
+	 *  Avoid placing a page boundary in the part of the TSS that the
+	 *  processor reads during a task switch (the first 104 bytes). The
+	 *  processor may not correctly perform address translations if a
+	 *  boundary occurs in this area. During a task switch, the processor
+	 *  reads and writes into the first 104 bytes of each TSS (using
+	 *  contiguous physical addresses beginning with the physical address
+	 *  of the first byte of the TSS). So, after TSS access begins, if
+	 *  part of the 104 bytes is not physically contiguous, the processor
+	 *  will access incorrect information without generating a page-fault
+	 *  exception.
+	 *
+	 * There are also a lot of errata involving the TSS spanning a page
+	 * boundary.  Assert that we're not doing that.
+	 */
+	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
+		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+
 }
 
 /* Load the original GDT from the per-cpu structure */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 13/21] x86/entry: Remap the TSS into the CPU entry area
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (11 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 12/21] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 14/21] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0 Ingo Molnar
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

This has a secondary purpose: it puts the entry stack into a region
with a well-controlled layout.  A subsequent patch will take
advantage of this to streamline the SYSCALL entry code to be able to
find it more easily.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/cdcba7e1e82122461b3ca36bb3ef6713ba605e35.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_32.S     |  6 ++++--
 arch/x86/include/asm/fixmap.h |  7 +++++++
 arch/x86/kernel/asm-offsets.c |  3 +++
 arch/x86/kernel/cpu/common.c  | 38 ++++++++++++++++++++++++++++++++------
 arch/x86/kernel/dumpstack.c   |  3 ++-
 arch/x86/power/cpu.c          | 11 ++++++-----
 6 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 4838037f97f6..0ab316c46806 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -941,7 +941,8 @@ ENTRY(debug)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -984,7 +985,8 @@ ENTRY(nmi)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index b5037e426f1f..c8aaf4b56699 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -54,6 +54,13 @@ extern unsigned long __FIXADDR_TOP;
  */
 struct cpu_entry_area {
 	char gdt[PAGE_SIZE];
+
+	/*
+	 * The GDT is just below cpu_tss and thus serves (on x86_64) as a
+	 * a read-only guard page for the SYSENTER stack at the bottom
+	 * of the TSS region.
+	 */
+	struct tss_struct tss;
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index b275863128eb..55858b277cf6 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -98,4 +98,7 @@ void common(void) {
 	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
 	/* Size of SYSENTER_stack */
 	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+
+	/* Layout info for cpu_entry_area */
+	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d173f6013467..c67742df569a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,6 +490,19 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
+static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, pgprot_t prot)
+{
+	int i;
+
+	for (i = 0; i < pages; i++)
+		__set_fixmap(fixmap_index - i, per_cpu_ptr_to_phys(ptr + i*PAGE_SIZE), prot);
+}
+
+#ifdef CONFIG_X86_32
+/* The 32-bit entry code needs to find cpu_entry_area. */
+DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -531,7 +544,15 @@ static inline void setup_cpu_entry_area(int cpu)
 	 */
 	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
 		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
+				&per_cpu(cpu_tss, cpu),
+				sizeof(struct tss_struct) / PAGE_SIZE,
+				PAGE_KERNEL);
 
+#ifdef CONFIG_X86_32
+	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1282,7 +1303,8 @@ void enable_sep_cpu(void)
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
 
 	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)tss + offsetofend(struct tss_struct, SYSENTER_stack),
+	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
+	      offsetofend(struct tss_struct, SYSENTER_stack),
 	      0);
 
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
@@ -1395,6 +1417,8 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	int cpu = smp_processor_id();
+
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
 	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
@@ -1408,7 +1432,7 @@ void syscall_init(void)
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
 	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
 		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
@@ -1618,11 +1642,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, me);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1635,7 +1661,6 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1676,11 +1701,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, curr);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1697,7 +1724,6 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index de3083a1f820..47ea8e0daf81 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,7 +45,8 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 
 bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 {
-	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+	int cpu = smp_processor_id();
+	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
 	/* Treat the canary as part of the stack for unwinding purposes. */
 	void *begin = &tss->SYSENTER_stack_canary;
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 50593e138281..04d5157fe7f8 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -160,18 +160,19 @@ static void do_fpu_end(void)
 static void fix_processor_context(void)
 {
 	int cpu = smp_processor_id();
-	struct tss_struct *t = &per_cpu(cpu_tss, cpu);
 #ifdef CONFIG_X86_64
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
 
 	/*
-	 * This just modifies memory; should not be necessary. But... This is
-	 * necessary, because 386 hardware has concept of busy TSS or some
-	 * similar stupidity.
+	 * We need to reload TR, which requires that we change the
+	 * GDT entry to indicate "available" first.
+	 *
+	 * XXX: This could probably all be replaced by a call to
+	 * force_reload_TR().
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 14/21] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (12 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 13/21] x86/entry: Remap the TSS into the CPU entry area Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 15/21] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Ingo Molnar
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

On 64-bit kernels, we used to assume that TSS.sp0 was the current
top of stack.  With the addition of an entry trampoline, this will
no longer be the case.  Store the current top of stack in TSS.sp1,
which is otherwise unused but shares the same cacheline.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/f56634c746a2926eb7bae61e7b80ed51a1940769.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/processor.h   | 18 +++++++++++++-----
 arch/x86/include/asm/thread_info.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c   |  1 +
 arch/x86/kernel/process.c          | 10 ++++++++++
 arch/x86/kernel/process_64.c       |  1 +
 5 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 522bb5f1a7ff..2f6d5634da55 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -310,7 +310,13 @@ struct x86_hw_tss {
 struct x86_hw_tss {
 	u32			reserved1;
 	u64			sp0;
+
+	/*
+	 * We store cpu_current_top_of_stack in sp1 so it's always accessible.
+	 * Linux does not use ring 1, so sp1 is not otherwise needed.
+	 */
 	u64			sp1;
+
 	u64			sp2;
 	u64			reserved2;
 	u64			ist[7];
@@ -369,6 +375,8 @@ DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 #ifdef CONFIG_X86_32
 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
+#else
+#define cpu_current_top_of_stack cpu_tss.x86_tss.sp1
 #endif
 
 /*
@@ -540,12 +548,12 @@ static inline void native_swapgs(void)
 
 static inline unsigned long current_top_of_stack(void)
 {
-#ifdef CONFIG_X86_64
-	return this_cpu_read_stable(cpu_tss.x86_tss.sp0);
-#else
-	/* sp0 on x86_32 is special in and around vm86 mode. */
+	/*
+	 *  We can't read directly from tss.sp0: sp0 on x86_32 is special in
+	 *  and around vm86 mode and sp0 on x86_64 is special because of the
+	 *  entry trampoline.
+	 */
 	return this_cpu_read_stable(cpu_current_top_of_stack);
-#endif
 }
 
 static inline bool on_thread_stack(void)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 70f425947dc5..44a04999791e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -207,7 +207,7 @@ static inline int arch_within_stack_frames(const void * const stack,
 #else /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_64
-# define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
+# define cpu_current_top_of_stack (cpu_tss + TSS_sp1)
 #endif
 
 #endif
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index 630212fa9b9d..ad649a8a74a0 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -63,6 +63,7 @@ int main(void)
 
 	OFFSET(TSS_ist, tss_struct, x86_tss.ist);
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
+	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	BLANK();
 
 #ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index ceedb9c431cc..2cf7e2aa1730 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -56,6 +56,16 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 		 * Poison it.
 		 */
 		.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
+
+#ifdef CONFIG_X86_64
+		/*
+		 * .sp1 is cpu_current_top_of_stack.  The init task never
+		 * runs user code, but cpu_current_top_of_stack should still
+		 * be well defined before the first context switch.
+		 */
+		.sp1 = TOP_OF_INIT_STACK,
+#endif
+
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
 		.ss1 = __KERNEL_CS,
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 01b119bebb68..157f81816915 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -461,6 +461,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 * Switch the PDA and FPU contexts.
 	 */
 	this_cpu_write(current_task, next_p);
+	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
 	/* Reload sp0. */
 	update_sp0(next_p);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 15/21] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (13 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 14/21] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0 Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries Ingo Molnar
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

When we start using an entry trampoline, a #GP from userspace will
be delivered on the entry stack, not on the task stack.  Fix the
espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
assuming that pt_regs + 1 == SP0.  This won't change anything
without an entry stack, but it will make the code continue to work
when an entry stack is added.

While we're at it, improve the comments to explain what's actually
going on.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/b1ef4136616c6bd2a75d1fd2736d1d54437d65a8.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/traps.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 2008dd0f8ccb..7d74d84896c6 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -349,9 +349,15 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 
 	/*
 	 * If IRET takes a non-IST fault on the espfix64 stack, then we
-	 * end up promoting it to a doublefault.  In that case, modify
-	 * the stack to make it look like we just entered the #GP
-	 * handler from user space, similar to bad_iret.
+	 * end up promoting it to a doublefault.  In that case, take
+	 * advantage of the fact that we're not using the normal (TSS.sp0)
+	 * stack right now.  We can write a fake #GP(0) frame at TSS.sp0
+	 * and then modify our own IRET frame so that, when we return,
+	 * we land directly at the #GP(0) vector with the stack already
+	 * set up according to its expectations.
+	 *
+	 * The net result is that our #GP handler will think that we
+	 * entered from usermode with the bad user context.
 	 *
 	 * No need for ist_enter here because we don't use RCU.
 	 */
@@ -359,13 +365,26 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs = task_pt_regs(current);
+		struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
+
+		/*
+		 * regs->sp points to the failing IRET frame on the
+		 * ESPFIX64 stack.  Copy it to the entry stack.  This fills
+		 * in gpregs->ss through gpregs->ip.
+		 *
+		 */
+		memmove(&gpregs->ip, (void *)regs->sp, 5*8);
+		gpregs->orig_ax = 0;  /* Missing (lost) #GP error code */
 
-		/* Fake a #GP(0) from userspace. */
-		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
-		normal_regs->orig_ax = 0;  /* Missing (lost) #GP error code */
+		/*
+		 * Adjust our frame so that we return straight to the #GP
+		 * vector with the expected RSP value.  This is safe because
+		 * we won't enable interupts or schedule before we invoke
+		 * general_protection, so nothing will clobber the stack
+		 * frame we just set up.
+		 */
 		regs->ip = (unsigned long)general_protection;
-		regs->sp = (unsigned long)&normal_regs->orig_ax;
+		regs->sp = (unsigned long)&gpregs->orig_ax;
 
 		return;
 	}
@@ -390,7 +409,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	 *
 	 *   Processors update CR2 whenever a page fault is detected. If a
 	 *   second page fault occurs while an earlier page fault is being
-	 *   deliv- ered, the faulting linear address of the second fault will
+	 *   delivered, the faulting linear address of the second fault will
 	 *   overwrite the contents of CR2 (replacing the previous
 	 *   address). These updates to CR2 occur even if the page fault
 	 *   results in a double fault or occurs during the delivery of a
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (14 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 15/21] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-12-01 17:06   ` Dave Hansen
  2017-11-27 10:45 ` [PATCH 17/21] x86/entry/64: Return to userspace from the trampoline stack Ingo Molnar
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Historically, IDT entries from usermode have always gone directly
to the running task's kernel stack.  Rearrange it so that we enter on
a per-CPU trampoline stack and then manually switch to the task's stack.
This touches a couple of extra cachelines, but it gives us a chance
to run some code before we touch the kernel stack.

The asm isn't exactly beautiful, but I think that fully refactoring
it can wait.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/fa3958723a1a85baeaf309c735b775841205800e.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S        | 67 ++++++++++++++++++++++++++++++----------
 arch/x86/entry/entry_64_compat.S |  5 ++-
 arch/x86/include/asm/switch_to.h |  3 +-
 arch/x86/include/asm/traps.h     |  1 -
 arch/x86/kernel/cpu/common.c     |  6 ++--
 arch/x86/kernel/traps.c          | 12 +++----
 6 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f81d50d7ceac..7d47199f405f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -563,6 +563,13 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
 	.macro interrupt func
 	cld
+
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jz	1f
+	SWAPGS
+	call	switch_to_thread_stack
+1:
+
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
@@ -572,12 +579,8 @@ END(irq_entries_start)
 	jz	1f
 
 	/*
-	 * IRQ from user mode.  Switch to kernel gsbase and inform context
-	 * tracking that we're in kernel mode.
-	 */
-	SWAPGS
-
-	/*
+	 * IRQ from user mode.
+	 *
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
 	 * (which can take locks).  Since TRACE_IRQS_OFF idempotent,
@@ -831,6 +834,32 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
 
+/*
+ * Switch to the thread stack.  This is called with the IRET frame and
+ * orig_ax on the stack.  (That is, RDI..R12 are not on the stack and
+ * space has not been allocated for them.)
+ */
+ENTRY(switch_to_thread_stack)
+	UNWIND_HINT_FUNC
+
+	pushq	%rdi
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
+
+	pushq	7*8(%rdi)		/* regs->ss */
+	pushq	6*8(%rdi)		/* regs->rsp */
+	pushq	5*8(%rdi)		/* regs->eflags */
+	pushq	4*8(%rdi)		/* regs->cs */
+	pushq	3*8(%rdi)		/* regs->ip */
+	pushq	2*8(%rdi)		/* regs->orig_ax */
+	pushq	8(%rdi)			/* return address */
+	UNWIND_HINT_FUNC
+
+	movq	(%rdi), %rdi
+	ret
+END(switch_to_thread_stack)
+
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -848,11 +877,12 @@ ENTRY(\sym)
 
 	ALLOC_PT_GPREGS_ON_STACK
 
-	.if \paranoid
-	.if \paranoid == 1
+	.if \paranoid < 2
 	testb	$3, CS(%rsp)			/* If coming from userspace, switch stacks */
-	jnz	1f
+	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
+
+	.if \paranoid
 	call	paranoid_entry
 	.else
 	call	error_entry
@@ -894,20 +924,15 @@ ENTRY(\sym)
 	jmp	error_exit
 	.endif
 
-	.if \paranoid == 1
+	.if \paranoid < 2
 	/*
-	 * Paranoid entry from userspace.  Switch stacks and treat it
+	 * Entry from userspace.  Switch stacks and treat it
 	 * as a normal entry.  This means that paranoid handlers
 	 * run in real process context if user_mode(regs).
 	 */
-1:
+.Lfrom_usermode_switch_stack_\@:
 	call	error_entry
 
-
-	movq	%rsp, %rdi			/* pt_regs pointer */
-	call	sync_regs
-	movq	%rax, %rsp			/* switch stack */
-
 	movq	%rsp, %rdi			/* pt_regs pointer */
 
 	.if \has_error_code
@@ -1170,6 +1195,14 @@ ENTRY(error_entry)
 	SWAPGS
 
 .Lerror_entry_from_usermode_after_swapgs:
+	/* Put us onto the real thread stack. */
+	popq	%r12				/* save return addr in %12 */
+	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
+	call	sync_regs
+	movq	%rax, %rsp			/* switch stack */
+	ENCODE_FRAME_POINTER
+	pushq	%r12
+
 	/*
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index dcc6987f9bae..95ad40eb7eff 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -306,8 +306,11 @@ ENTRY(entry_INT80_compat)
 	 */
 	movl	%eax, %eax
 
-	/* Construct struct pt_regs on stack (iret frame is already on stack) */
 	pushq	%rax			/* pt_regs->orig_ax */
+
+	/* switch to thread stack expects orig_ax to be pushed */
+	call	switch_to_thread_stack
+
 	pushq	%rdi			/* pt_regs->di */
 	pushq	%rsi			/* pt_regs->si */
 	pushq	%rdx			/* pt_regs->dx */
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8c6bd6863db9..fab453ad2460 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -90,10 +90,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
 /* This is used when switching tasks or entering/exiting vm86 mode. */
 static inline void update_sp0(struct task_struct *task)
 {
+	/* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
 #ifdef CONFIG_X86_32
 	load_sp0(task->thread.sp0);
-#else
-	load_sp0(task_top_of_stack(task));
 #endif
 }
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 1fadd310ff68..31051f35cbb7 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -75,7 +75,6 @@ dotraplinkage void do_segment_not_present(struct pt_regs *, long);
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
-asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);
 dotraplinkage void do_page_fault(struct pt_regs *, unsigned long);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c67742df569a..7c82a8a8bfda 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1645,11 +1645,13 @@ void cpu_init(void)
 	setup_cpu_entry_area(cpu);
 
 	/*
-	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
-	 * task never enters user mode.
+	 * Initialize the TSS.  sp0 points to the entry trampoline stack
+	 * regardless of what task is running.
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
+	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
+		 offsetofend(struct tss_struct, SYSENTER_stack));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7d74d84896c6..450688c75b3b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -631,7 +631,7 @@ NOKPROBE_SYMBOL(do_int3);
  */
 asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
-	struct pt_regs *regs = task_pt_regs(current);
+	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(cpu_current_top_of_stack) - 1;
 	*regs = *eregs;
 	return regs;
 }
@@ -648,13 +648,13 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 	/*
 	 * This is called from entry_64.S early in handling a fault
 	 * caused by a bad iret to user mode.  To handle the fault
-	 * correctly, we want move our stack frame to task_pt_regs
-	 * and we want to pretend that the exception came from the
-	 * iret target.
+	 * correctly, we want to move our stack frame to where it would
+	 * be had we entered directly on the entry stack (rather than
+	 * just below the IRET frame) and we want to pretend that the
+	 * exception came from the IRET target.
 	 */
 	struct bad_iret_stack *new_stack =
-		container_of(task_pt_regs(current),
-			     struct bad_iret_stack, regs);
+		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 	/* Copy the IRET target to the new stack. */
 	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 17/21] x86/entry/64: Return to userspace from the trampoline stack
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (15 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 18/21] x86/entry/64: Create a per-CPU SYSCALL entry trampoline Ingo Molnar
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

By itself, this is useless.  It gives us the ability to run some final
code before exit that cannnot run on the kernel stack.  This could
include a CR3 switch a la KAISER or some kernel stack erasing, for
example.  (Or even weird things like *changing* which kernel stack
gets used as an ASLR-strengthening mechanism.)

The SYSRET32 path is not covered yet.  It could be in the future or
we could just ignore it and force the slow path if needed.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/d350017000eed20922c3b2711a2d9229dc809256.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S | 55 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7d47199f405f..426b8c669d6a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -330,8 +330,24 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	popq	%rsi	/* skip rcx */
 	popq	%rdx
 	popq	%rsi
+
+	/*
+	 * Now all regs are restored except RSP and RDI.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	pushq	RSP-RDI(%rdi)	/* RSP */
+	pushq	(%rdi)		/* RDI */
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
 	popq	%rdi
-	movq	RSP-ORIG_RAX(%rsp), %rsp
+	popq	%rsp
 	USERGS_SYSRET64
 END(entry_SYSCALL_64)
 
@@ -633,10 +649,41 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	ud2
 1:
 #endif
-	SWAPGS
 	POP_EXTRA_REGS
-	POP_C_REGS
-	addq	$8, %rsp	/* skip regs->orig_ax */
+	popq	%r11
+	popq	%r10
+	popq	%r9
+	popq	%r8
+	popq	%rax
+	popq	%rcx
+	popq	%rdx
+	popq	%rsi
+
+	/*
+	 * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	/* Copy the IRET frame to the trampoline stack. */
+	pushq	6*8(%rdi)	/* SS */
+	pushq	5*8(%rdi)	/* RSP */
+	pushq	4*8(%rdi)	/* EFLAGS */
+	pushq	3*8(%rdi)	/* CS */
+	pushq	2*8(%rdi)	/* RIP */
+
+	/* Push user RDI on the trampoline stack. */
+	pushq	(%rdi)
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
+	/* Restore RDI. */
+	popq	%rdi
+	SWAPGS
 	INTERRUPT_RETURN
 
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 18/21] x86/entry/64: Create a per-CPU SYSCALL entry trampoline
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (16 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 17/21] x86/entry/64: Return to userspace from the trampoline stack Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 19/21] x86/entry/64: Move the IST stacks into 'struct cpu_entry_area' Ingo Molnar
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live.  It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer.  The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.

With KAISER-like pagetable switching, this is problematic.  Without a
scratch register, switching CR3 is impossible, so %gs-based percpu
memory would need to be mapped in the user pagetables.  Doing that
without information leaks is difficult or impossible.

Instead, use a different sneaky trick.  Map a copy of the first part
of the SYSCALL asm at a different address for each CPU.  Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory.  By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.

A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.

The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
at all.

XXX: Whenever we settle how KAISER gets turned on and off, we should do
the same to this.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S     | 48 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/fixmap.h |  2 ++
 arch/x86/kernel/asm-offsets.c |  1 +
 arch/x86/kernel/cpu/common.c  | 15 +++++++++++++-
 arch/x86/kernel/vmlinux.lds.S |  9 ++++++++
 5 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 426b8c669d6a..5e674b6362e3 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -140,6 +140,54 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
+	.pushsection .entry_trampoline, "ax"
+
+/*
+ * The code in here gets remapped into cpu_entry_area's trampoline.  This means
+ * that the assembler and linker have the wrong idea as to where this code
+ * lives (and, in fact, it's mapped more than once, so it's not even at a
+ * fixed address).  So we can't reference any symbols outside the entry
+ * trampoline and expect it to work.
+ *
+ * Instead, we carefully abuse %rip-relative addressing.
+ * _entry_trampoline(%rip) refers to the start of the remapped) entry
+ * trampoline.  We can thus find cpu_entry_area with this macro:
+ */
+
+#define CPU_ENTRY_AREA \
+	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
+
+/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
+#define RSP_SCRATCH	CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+			SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+
+ENTRY(entry_SYSCALL_64_trampoline)
+	UNWIND_HINT_EMPTY
+	swapgs
+
+	/* Stash the user RSP. */
+	movq	%rsp, RSP_SCRATCH
+
+	/* Load the top of the task stack into RSP */
+	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+
+	/* Start building the simulated IRET frame. */
+	pushq	$__USER_DS			/* pt_regs->ss */
+	pushq	RSP_SCRATCH			/* pt_regs->sp */
+	pushq	%r11				/* pt_regs->flags */
+	pushq	$__USER_CS			/* pt_regs->cs */
+	pushq	%rcx				/* pt_regs->ip */
+
+	/*
+	 * x86 lacks a near absolute jump, and we can't jump to the real
+	 * entry text with a relative jump, so we fake it using retq.
+	 */
+	pushq	$entry_SYSCALL_64_after_hwframe
+	retq
+END(entry_SYSCALL_64_trampoline)
+
+	.popsection
+
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index c8aaf4b56699..17be88e73c92 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -61,6 +61,8 @@ struct cpu_entry_area {
 	 * of the TSS region.
 	 */
 	struct tss_struct tss;
+
+	char entry_trampoline[PAGE_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 55858b277cf6..61b1af88ac07 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -101,4 +101,5 @@ void common(void) {
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
+	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7c82a8a8bfda..9af1f176cc5b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,8 @@ DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
+	extern char _entry_trampoline[];
+
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
@@ -553,6 +555,11 @@ static inline void setup_cpu_entry_area(int cpu)
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
 #endif
+
+#ifdef CONFIG_X86_64
+	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1417,10 +1424,16 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	extern char _entry_trampoline[];
+	extern char entry_SYSCALL_64_trampoline[];
+
 	int cpu = smp_processor_id();
+	unsigned long SYSCALL64_entry_trampoline =
+		(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
+		(entry_SYSCALL_64_trampoline - _entry_trampoline);
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index a4009fb9be87..d2a8b5a24a44 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -107,6 +107,15 @@ SECTIONS
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)
+
+#ifdef CONFIG_X86_64
+		. = ALIGN(PAGE_SIZE);
+		_entry_trampoline = .;
+		*(.entry_trampoline)
+		. = ALIGN(PAGE_SIZE);
+		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
+#endif
+
 		/* End of text section */
 		_etext = .;
 	} :text = 0x9090
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 19/21] x86/entry/64: Move the IST stacks into 'struct cpu_entry_area'
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (17 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 18/21] x86/entry/64: Create a per-CPU SYSCALL entry trampoline Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 20/21] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
  2017-11-27 10:45 ` [PATCH 21/21] x86/entry: Clean up the SYSENTER_stack code Ingo Molnar
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

The IST stacks are needed when an IST exception occurs and are
accessed before any kernel code at all runs.  Move them into
struct cpu_entry_area.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/0ffddccdc0ce1953f950a553142662cf68258fb7.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/fixmap.h | 10 ++++++++++
 arch/x86/kernel/cpu/common.c  | 40 +++++++++++++++++++++++++---------------
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 17be88e73c92..81fdc405ab1f 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -63,6 +63,16 @@ struct cpu_entry_area {
 	struct tss_struct tss;
 
 	char entry_trampoline[PAGE_SIZE];
+
+#ifdef CONFIG_X86_64
+	/*
+	 * Exception stacks used for IST entries.
+	 *
+	 * In the future, this should have a separate slot for each stack
+	 * with guard pages between them.
+	 */
+	char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ];
+#endif
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9af1f176cc5b..f24d13cd73dc 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -503,6 +503,22 @@ static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, pgpr
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * Special IST stacks which the CPU switches to when it calls
+ * an IST-marked descriptor entry. Up to 7 stacks (hardware
+ * limit), all of them are 4K, except the debug stack which
+ * is 8K.
+ */
+static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
+	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
+	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
+};
+
+static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -557,6 +573,14 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 #ifdef CONFIG_X86_64
+	BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
+	BUILD_BUG_ON(sizeof(exception_stacks) !=
+		     sizeof(((struct cpu_entry_area *)0)->exception_stacks));
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
+				&per_cpu(exception_stacks, cpu),
+				sizeof(exception_stacks) / PAGE_SIZE,
+				PAGE_KERNEL);
+
 	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
@@ -1407,20 +1431,6 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
-/*
- * Special IST stacks which the CPU switches to when it calls
- * an IST-marked descriptor entry. Up to 7 stacks (hardware
- * limit), all of them are 4K, except the debug stack which
- * is 8K.
- */
-static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
-	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
-	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
-};
-
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
-	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
@@ -1629,7 +1639,7 @@ void cpu_init(void)
 	 * set up and load the per-CPU TSS
 	 */
 	if (!oist->ist[0]) {
-		char *estacks = per_cpu(exception_stacks, cpu);
+		char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
 			estacks += exception_stack_sizes[v];
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 20/21] x86/entry/64: Remove the SYSENTER stack canary
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (18 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 19/21] x86/entry/64: Move the IST stacks into 'struct cpu_entry_area' Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-11-27 10:45 ` [PATCH 21/21] x86/entry: Clean up the SYSENTER_stack code Ingo Molnar
  20 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Now that the SYSENTER stack has a guard page, there's no need for a
canary to detect overflow after the fact.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/be3179c0a38c392fa44ebeb7dd89391ff5c010c3.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/processor.h | 1 -
 arch/x86/kernel/dumpstack.c      | 3 +--
 arch/x86/kernel/process.c        | 1 -
 arch/x86/kernel/traps.c          | 7 -------
 4 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2f6d5634da55..806011202733 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -342,7 +342,6 @@ struct tss_struct {
 	 * Space for the temporary SYSENTER stack, used for SYSENTER
 	 * and the entry trampoline as well.
 	 */
-	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
 
 	/*
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 47ea8e0daf81..35876e94286d 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -48,8 +48,7 @@ bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 	int cpu = smp_processor_id();
 	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
-	/* Treat the canary as part of the stack for unwinding purposes. */
-	void *begin = &tss->SYSENTER_stack_canary;
+	void *begin = &tss->SYSENTER_stack;
 	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
 
 	if ((void *)stack < begin || (void *)stack >= end)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 2cf7e2aa1730..298be43d63de 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -81,7 +81,6 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-	.SYSENTER_stack_canary	= STACK_END_MAGIC,
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 450688c75b3b..de4e3c80cf34 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -819,13 +819,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-	/*
-	 * This is the most likely code path that involves non-trivial use
-	 * of the SYSENTER stack.  Check that we haven't overrun it.
-	 */
-	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
-	     "Overran or corrupted SYSENTER stack\n");
-
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 21/21] x86/entry: Clean up the SYSENTER_stack code
  2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
                   ` (19 preceding siblings ...)
  2017-11-27 10:45 ` [PATCH 20/21] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
@ 2017-11-27 10:45 ` Ingo Molnar
  2017-12-01 17:59   ` Borislav Petkov
  20 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

The existing code was a mess, mainly because C arrays are nasty.
Turn SYSENTER_stack into a struct, add a helper to find it, and do
all the obvious cleanups this enables.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: https://lkml.kernel.org/r/38ff640712c9b591b32de24a080daf13afaba234.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_32.S        |  4 ++--
 arch/x86/entry/entry_64.S        |  2 +-
 arch/x86/include/asm/fixmap.h    |  5 +++++
 arch/x86/include/asm/processor.h |  6 +++++-
 arch/x86/kernel/asm-offsets.c    |  6 ++----
 arch/x86/kernel/cpu/common.c     | 14 +++-----------
 arch/x86/kernel/dumpstack.c      |  7 +++----
 7 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 0ab316c46806..3629bcbf85a2 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -942,7 +942,7 @@ ENTRY(debug)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -986,7 +986,7 @@ ENTRY(nmi)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 5e674b6362e3..caf74a1bb3de 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -158,7 +158,7 @@ END(native_usergs_sysret64)
 	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
 
 /* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH	CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+#define RSP_SCRATCH	CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + \
 			SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
 
 ENTRY(entry_SYSCALL_64_trampoline)
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 81fdc405ab1f..5a1013df456e 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -237,5 +237,10 @@ static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
 	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
 }
 
+static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
+{
+	return &get_cpu_entry_area((cpu))->tss.SYSENTER_stack;
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 806011202733..0eb62e7eeb33 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -337,12 +337,16 @@ struct x86_hw_tss {
 #define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
+struct SYSENTER_stack {
+	unsigned long		words[64];
+};
+
 struct tss_struct {
 	/*
 	 * Space for the temporary SYSENTER stack, used for SYSENTER
 	 * and the entry trampoline as well.
 	 */
-	unsigned long		SYSENTER_stack[64];
+	struct SYSENTER_stack	SYSENTER_stack;
 
 	/*
 	 * The fixed hardware portion.  This must not cross a page boundary
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 61b1af88ac07..46c0995344aa 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -94,10 +94,8 @@ void common(void) {
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+	OFFSET(TSS_STRUCT_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f24d13cd73dc..1509f09abf5e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1332,12 +1332,7 @@ void enable_sep_cpu(void)
 
 	tss->x86_tss.ss1 = __KERNEL_CS;
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
-
-	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
-	      offsetofend(struct tss_struct, SYSENTER_stack),
-	      0);
-
+	wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1), 0);
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
 
 	put_cpu();
@@ -1454,9 +1449,7 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
-		    offsetofend(struct tss_struct, SYSENTER_stack));
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
@@ -1673,8 +1666,7 @@ void cpu_init(void)
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
-	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
-		 offsetofend(struct tss_struct, SYSENTER_stack));
+	load_sp0((unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 35876e94286d..f214f7a6ccad 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,11 +45,10 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 
 bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 {
-	int cpu = smp_processor_id();
-	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
+	struct SYSENTER_stack *ss = cpu_SYSENTER_stack(smp_processor_id());
 
-	void *begin = &tss->SYSENTER_stack;
-	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+	void *begin = ss;
+	void *end = ss + 1;
 
 	if ((void *)stack < begin || (void *)stack >= end)
 		return false;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/21] x86/unwinder/orc: Don't bail on stack overflow
  2017-11-27 10:45 ` [PATCH 01/21] x86/unwinder/orc: Don't bail on stack overflow Ingo Molnar
@ 2017-11-27 14:47   ` Borislav Petkov
  0 siblings, 0 replies; 35+ messages in thread
From: Borislav Petkov @ 2017-11-27 14:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Mon, Nov 27, 2017 at 11:45:09AM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> If we overflow the stack into a guard page and then try to unwind it
> with ORC, it should work well: by construction, there can't be any
> meaningful data in the guard page because no writes to the guard page
> will have succeeded.
> 
> This patch fixes a bug that unwinding from working correctly: if the
			     ^
			     prevents

> starting register state has RSP pointing into a stack guard page, the
> ORC unwinder bails out immediately.  This patch fixes that: the ORC

I believe here we can kill the second "This patch" :)

> unwinder will start the unwind.
> 
> I tested this by intentionally overflowing the task stack.  The
> result is an accurate call trace instead of a trace consisting
> purely of '?' entries.
> 
> There are a few other bugs that are triggered if the unwinder
> encounters a stack overflow after the first step, and Josh has WIP
> patches to fix those as well.

I guess we don't need that paragraph.

> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Link: http://lkml.kernel.org/r/927042950d7f1a7007dd0f58538966a593508f8b.1511715954.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/kernel/unwind_orc.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
> index a3f973b2c97a..7f6e3935666b 100644
> --- a/arch/x86/kernel/unwind_orc.c
> +++ b/arch/x86/kernel/unwind_orc.c
> @@ -553,8 +553,18 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
>  	}
>  
>  	if (get_stack_info((unsigned long *)state->sp, state->task,
> -			   &state->stack_info, &state->stack_mask))
> -		return;
> +			   &state->stack_info, &state->stack_mask)) {
> +		/*
> +		 * We weren't on a valid stack.  It's possible that
> +		 * we overflowed a valid stack into a guard page.
> +		 * See if the next page up is valid so that we can
> +		 * generate some kind of backtrace if this happens.
> +		 */

Right, should we issue a marker or somesuch here to denote that we somehow
walked into the guard page?

It might be helpful when debugging issues, to see the big picture...

> +		void *next_page = (void *)PAGE_ALIGN((unsigned long)regs->sp);
> +		if (get_stack_info(next_page, state->task, &state->stack_info,
> +				   &state->stack_mask))
> +			return;
> +	}
>  
>  	/*
>  	 * The caller can provide the address of the first frame directly
> -- 

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/21] x86/unwinder: Handle stack overflows more gracefully
  2017-11-27 10:45 ` [PATCH 02/21] x86/unwinder: Handle stack overflows more gracefully Ingo Molnar
@ 2017-11-27 17:36   ` Borislav Petkov
  2017-11-28  4:55     ` Josh Poimboeuf
  0 siblings, 1 reply; 35+ messages in thread
From: Borislav Petkov @ 2017-11-27 17:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Mon, Nov 27, 2017 at 11:45:10AM +0100, Ingo Molnar wrote:
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> 
> There are at least two unwinder bugs hindering the debugging of
> stack-overflow crashes:
> 
> - It doesn't deal gracefully with the case where the stack overflows and
>   the stack pointer itself isn't on a valid stack but the
>   to-be-dereferenced data *is*.
> 
> - The ORC oops dump code doesn't know how to print partial pt_regs, for the
>   case where if we get an interrupt/exception in *early* entry code
>   before the full pt_regs have been saved.
> 
> Fix both issues.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Link: http://lkml.kernel.org/r/20171126024031.uxi4numpbjm5rlbr@treble
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/include/asm/kdebug.h |  1 +
>  arch/x86/include/asm/unwind.h |  7 ++++
>  arch/x86/kernel/dumpstack.c   | 32 ++++++++++++++++---
>  arch/x86/kernel/process_64.c  | 11 +++----
>  arch/x86/kernel/stacktrace.c  |  2 +-
>  arch/x86/kernel/unwind_orc.c  | 74 +++++++++++++++----------------------------
>  6 files changed, 66 insertions(+), 61 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kdebug.h b/arch/x86/include/asm/kdebug.h
> index f86a8caa561e..395c9631e000 100644
> --- a/arch/x86/include/asm/kdebug.h
> +++ b/arch/x86/include/asm/kdebug.h
> @@ -26,6 +26,7 @@ extern void die(const char *, struct pt_regs *,long);
>  extern int __must_check __die(const char *, struct pt_regs *, long);
>  extern void show_stack_regs(struct pt_regs *regs);
>  extern void __show_regs(struct pt_regs *regs, int all);
> +extern void show_iret_regs(struct pt_regs *regs);
>  extern unsigned long oops_begin(void);
>  extern void oops_end(unsigned long, struct pt_regs *, int signr);
>  
> diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
> index e9cc6fe1fc6f..5be2fb23825a 100644
> --- a/arch/x86/include/asm/unwind.h
> +++ b/arch/x86/include/asm/unwind.h
> @@ -7,6 +7,9 @@
>  #include <asm/ptrace.h>
>  #include <asm/stacktrace.h>
>  
> +#define IRET_FRAME_OFFSET (offsetof(struct pt_regs, ip))
> +#define IRET_FRAME_SIZE   (sizeof(struct pt_regs) - IRET_FRAME_OFFSET)
> +
>  struct unwind_state {
>  	struct stack_info stack_info;
>  	unsigned long stack_mask;
> @@ -52,6 +55,10 @@ void unwind_start(struct unwind_state *state, struct task_struct *task,
>  }
>  
>  #if defined(CONFIG_UNWINDER_ORC) || defined(CONFIG_UNWINDER_FRAME_POINTER)
> +/*
> + * WARNING: The entire pt_regs may not be safe to dereference.  In some cases,
> + * only the iret registers are accessible.  Use with caution!
	   ^^^^^^^^^^^^^^^^^^

You mean the interrupt stack frame here, right?

> + */
>  static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
>  {
>  	if (unwind_done(state))
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index f13b4c00a5de..fc918744ad7d 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -50,6 +50,28 @@ static void printk_stack_address(unsigned long address, int reliable,
>  	printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address);
>  }
>  
> +void show_iret_regs(struct pt_regs *regs)
> +{
> +	printk(KERN_DEFAULT "RIP: %04x:%pS\n", (int)regs->cs, (void *)regs->ip);
> +	printk(KERN_DEFAULT "RSP: %04x:%016lx EFLAGS: %08lx", (int)regs->ss,
> +		regs->sp, regs->flags);
> +}
> +
> +static void show_regs_safe(struct stack_info *info, struct pt_regs *regs)
> +{
> +	if (on_stack(info, regs, sizeof(*regs)))
> +		__show_regs(regs, 0);
> +	else if (on_stack(info, (void *)regs + IRET_FRAME_OFFSET,
> +			  IRET_FRAME_SIZE)) {
> +		/*
> +		 * When an interrupt or exception occurs in entry code, the
> +		 * full pt_regs might not have been saved yet.  In that case
> +		 * just print the iret return frame.

Right, it is the interrupt stack frame. But "iret frame" is shorter so
let's stick to that :)

...

> @@ -283,42 +276,32 @@ static bool deref_stack_reg(struct unwind_state *state, unsigned long addr,
>  	return true;
>  }
>  
> -#define REGS_SIZE (sizeof(struct pt_regs))
> -#define SP_OFFSET (offsetof(struct pt_regs, sp))
> -#define IRET_REGS_SIZE (REGS_SIZE - offsetof(struct pt_regs, ip))
> -#define IRET_SP_OFFSET (SP_OFFSET - offsetof(struct pt_regs, ip))
> -
>  static bool deref_stack_regs(struct unwind_state *state, unsigned long addr,
> -			     unsigned long *ip, unsigned long *sp, bool full)
> +			     unsigned long *ip, unsigned long *sp)
>  {
> -	size_t regs_size = full ? REGS_SIZE : IRET_REGS_SIZE;
> -	size_t sp_offset = full ? SP_OFFSET : IRET_SP_OFFSET;
> -	struct pt_regs *regs = (struct pt_regs *)(addr + regs_size - REGS_SIZE);
> -
> -	if (IS_ENABLED(CONFIG_X86_64)) {
> -		if (!stack_access_ok(state, addr, regs_size))
> -			return false;
> -
> -		*ip = regs->ip;
> -		*sp = regs->sp;
> +	struct pt_regs *regs = (struct pt_regs *)addr;
>  
> -		return true;
> -	}
> +	/* x86-32 support will be more complicated due to the &regs->sp hack */
> +	BUILD_BUG_ON(IS_ENABLED(CONFIG_X86_32));
>  
> -	if (!stack_access_ok(state, addr, sp_offset))
> +	if (!stack_access_ok(state, addr, sizeof(struct pt_regs)))
>  		return false;
>  
>  	*ip = regs->ip;
> +	*sp = regs->sp;
> +	return true;
> +}
>  
> -	if (user_mode(regs)) {
> -		if (!stack_access_ok(state, addr + sp_offset,
> -				     REGS_SIZE - SP_OFFSET))
> -			return false;
> +static bool deref_stack_iret_regs(struct unwind_state *state, unsigned long addr,
> +				  unsigned long *ip, unsigned long *sp)
> +{
> +	struct pt_regs *regs = (void *)addr - IRET_FRAME_OFFSET;

I guess those are traditionally done with container_of()...

But yeah, FWIW, looks ok to me:

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 11/21] x86/dumpstack: Handle stack overflow on all stacks
  2017-11-27 10:45 ` [PATCH 11/21] x86/dumpstack: Handle stack overflow on all stacks Ingo Molnar
@ 2017-11-27 19:26   ` Linus Torvalds
  2017-11-28  4:29     ` Josh Poimboeuf
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2017-11-27 19:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, H . Peter Anvin, Peter Zijlstra,
	Borislav Petkov

On Mon, Nov 27, 2017 at 2:45 AM, Ingo Molnar <mingo@kernel.org> wrote:
> From: Andy Lutomirski <luto@kernel.org>
>
> We currently special-case stack overflow on the task stack.  We're
> going to start putting special stacks in the fixmap with a custom
> layout, so they'll have guard pages, too.  Teach the unwinder to be
> able to unwind an overflow of any of the stacks.

Why isn't this together with 01/21? The two cases seem to be entirely
identical and fundamentally the same issue.

In fact, maybe the whole "stack overflow" special cases should be in
"get_stack_info()" itself, rather than be special-cased in the
callers?

                Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 11/21] x86/dumpstack: Handle stack overflow on all stacks
  2017-11-27 19:26   ` Linus Torvalds
@ 2017-11-28  4:29     ` Josh Poimboeuf
  2017-11-28  5:29       ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Josh Poimboeuf @ 2017-11-28  4:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Dave Hansen,
	Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov

On Mon, Nov 27, 2017 at 11:26:30AM -0800, Linus Torvalds wrote:
> On Mon, Nov 27, 2017 at 2:45 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > From: Andy Lutomirski <luto@kernel.org>
> >
> > We currently special-case stack overflow on the task stack.  We're
> > going to start putting special stacks in the fixmap with a custom
> > layout, so they'll have guard pages, too.  Teach the unwinder to be
> > able to unwind an overflow of any of the stacks.
> 
> Why isn't this together with 01/21? The two cases seem to be entirely
> identical and fundamentally the same issue.

Yeah, they probably do belong in the same patch.

> In fact, maybe the whole "stack overflow" special cases should be in
> "get_stack_info()" itself, rather than be special-cased in the
> callers?

I would be nervous about doing that.  Several of the get_stack_info()
callers rely on it being honest.

In fact, looking deeper at the above patch, it doesn't seem convincingly
safe to me.  What if the adjacent page doesn't exist?  Then when the
oops dumping code dereferences the 'stack' variable, you get an oops in
your oops.

-- 
Josh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/21] x86/unwinder: Handle stack overflows more gracefully
  2017-11-27 17:36   ` Borislav Petkov
@ 2017-11-28  4:55     ` Josh Poimboeuf
  0 siblings, 0 replies; 35+ messages in thread
From: Josh Poimboeuf @ 2017-11-28  4:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Mon, Nov 27, 2017 at 06:36:04PM +0100, Borislav Petkov wrote:
> > +static bool deref_stack_iret_regs(struct unwind_state *state, unsigned long addr,
> > +				  unsigned long *ip, unsigned long *sp)
> > +{
> > +	struct pt_regs *regs = (void *)addr - IRET_FRAME_OFFSET;
> 
> I guess those are traditionally done with container_of()...

Without modifying pt_regs, there's no container to speak of, but at
least IRET_FRAME_OFFSET uses offsetof(), which is similar.

> But yeah, FWIW, looks ok to me:
> 
> Reviewed-by: Borislav Petkov <bp@suse.de>

Thanks!


Ingo, can you fold in the below changes based on Boris' comments?


diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index 5be2fb23825a..c1688c2d0a12 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -57,7 +57,7 @@ void unwind_start(struct unwind_state *state, struct task_struct *task,
 #if defined(CONFIG_UNWINDER_ORC) || defined(CONFIG_UNWINDER_FRAME_POINTER)
 /*
  * WARNING: The entire pt_regs may not be safe to dereference.  In some cases,
- * only the iret registers are accessible.  Use with caution!
+ * only the iret frame registers are accessible.  Use with caution!
  */
 static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index fc918744ad7d..0bc95be5c638 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -66,7 +66,7 @@ static void show_regs_safe(struct stack_info *info, struct pt_regs *regs)
 		/*
 		 * When an interrupt or exception occurs in entry code, the
 		 * full pt_regs might not have been saved yet.  In that case
-		 * just print the iret return frame.
+		 * just print the iret frame.
 		 */
 		show_iret_regs(regs);
 	}

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 11/21] x86/dumpstack: Handle stack overflow on all stacks
  2017-11-28  4:29     ` Josh Poimboeuf
@ 2017-11-28  5:29       ` Andy Lutomirski
  2017-11-28 18:15         ` Josh Poimboeuf
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2017-11-28  5:29 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Dave Hansen, Thomas Gleixner, H . Peter Anvin, Peter Zijlstra,
	Borislav Petkov

On Mon, Nov 27, 2017 at 8:29 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, Nov 27, 2017 at 11:26:30AM -0800, Linus Torvalds wrote:
>> On Mon, Nov 27, 2017 at 2:45 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> > From: Andy Lutomirski <luto@kernel.org>
>> >
>> > We currently special-case stack overflow on the task stack.  We're
>> > going to start putting special stacks in the fixmap with a custom
>> > layout, so they'll have guard pages, too.  Teach the unwinder to be
>> > able to unwind an overflow of any of the stacks.
>>
>> Why isn't this together with 01/21? The two cases seem to be entirely
>> identical and fundamentally the same issue.
>
> Yeah, they probably do belong in the same patch.
>
>> In fact, maybe the whole "stack overflow" special cases should be in
>> "get_stack_info()" itself, rather than be special-cased in the
>> callers?
>
> I would be nervous about doing that.  Several of the get_stack_info()
> callers rely on it being honest.
>
> In fact, looking deeper at the above patch, it doesn't seem convincingly
> safe to me.  What if the adjacent page doesn't exist?  Then when the
> oops dumping code dereferences the 'stack' variable, you get an oops in
> your oops.
>

Isn't the oops dumping code supposed to dereference everything using a
special safe function?

Anyway, get_stack_info() wouldn't really be lying.  It would just be
returning something where begin..end doesn't contain the requested
pointer.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 11/21] x86/dumpstack: Handle stack overflow on all stacks
  2017-11-28  5:29       ` Andy Lutomirski
@ 2017-11-28 18:15         ` Josh Poimboeuf
  0 siblings, 0 replies; 35+ messages in thread
From: Josh Poimboeuf @ 2017-11-28 18:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Dave Hansen, Thomas Gleixner, H . Peter Anvin, Peter Zijlstra,
	Borislav Petkov

On Mon, Nov 27, 2017 at 09:29:26PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 27, 2017 at 8:29 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, Nov 27, 2017 at 11:26:30AM -0800, Linus Torvalds wrote:
> >> On Mon, Nov 27, 2017 at 2:45 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >> > From: Andy Lutomirski <luto@kernel.org>
> >> >
> >> > We currently special-case stack overflow on the task stack.  We're
> >> > going to start putting special stacks in the fixmap with a custom
> >> > layout, so they'll have guard pages, too.  Teach the unwinder to be
> >> > able to unwind an overflow of any of the stacks.
> >>
> >> Why isn't this together with 01/21? The two cases seem to be entirely
> >> identical and fundamentally the same issue.
> >
> > Yeah, they probably do belong in the same patch.
> >
> >> In fact, maybe the whole "stack overflow" special cases should be in
> >> "get_stack_info()" itself, rather than be special-cased in the
> >> callers?
> >
> > I would be nervous about doing that.  Several of the get_stack_info()
> > callers rely on it being honest.
> >
> > In fact, looking deeper at the above patch, it doesn't seem convincingly
> > safe to me.  What if the adjacent page doesn't exist?  Then when the
> > oops dumping code dereferences the 'stack' variable, you get an oops in
> > your oops.

I was wrong here.  Too much thinking, not enough sleep.  Looking at it
today, the patch looks safe again.

The 'stack' variable is actually safe to dereference afterwards because
it gets updated with the safe aligned address.

> Isn't the oops dumping code supposed to dereference everything using a
> special safe function?

That's not what it does now.  The 'stack' pointer variable is
dereferenced normally, unless KASAN is enabled.  It relies on
get_stack_info() doing the safety check.

> Anyway, get_stack_info() wouldn't really be lying.  It would just be
> returning something where begin..end doesn't contain the requested
> pointer.

At the very least, it would be surprising.  I prefer the current
approach, it's much more explicit.

-- 
Josh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries
  2017-11-27 10:45 ` [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries Ingo Molnar
@ 2017-12-01 17:06   ` Dave Hansen
  2017-12-01 17:44     ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2017-12-01 17:06 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On 11/27/2017 02:45 AM, Ingo Molnar wrote:
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1645,11 +1645,13 @@ void cpu_init(void)
>  	setup_cpu_entry_area(cpu);
>  
>  	/*
> -	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
> -	 * task never enters user mode.
> +	 * Initialize the TSS.  sp0 points to the entry trampoline stack
> +	 * regardless of what task is running.
>  	 */
>  	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
>  	load_TR_desc();
> +	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
> +		 offsetofend(struct tss_struct, SYSENTER_stack));
>  
>  	load_mm_ldt(&init_mm);

Does this also mean that our stack dumps will say "<SYSENTER>" in oopses?

> [   30.811750] CR2: fffffffffdeb2f98 CR3: 0000000423fae001 CR4: 00000000001607e0                                                                                                           
> [   30.819712] Call Trace:                                                                                                                                                                 
> [   30.822442]  <SYSENTER>                                                                                                                                                                 
> [   30.825170]  trace_hardirqs_on_thunk+0x1c/0x1c                                                                                                                                          
...
> [   31.000571] R13: 0000000000000050 R14: 0000000000000076 R15: 00007f59f76f2d60                                                                                                           
> [   31.008533]  </SYSENTER>                                                                                                    

Should we change that string to something more descriptive?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries
  2017-12-01 17:06   ` Dave Hansen
@ 2017-12-01 17:44     ` Andy Lutomirski
  2017-12-01 21:21       ` Dave Hansen
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2017-12-01 17:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

> On Dec 1, 2017, at 9:06 AM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>> On 11/27/2017 02:45 AM, Ingo Molnar wrote:
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -1645,11 +1645,13 @@ void cpu_init(void)
>>    setup_cpu_entry_area(cpu);
>>
>>    /*
>> -     * Initialize the TSS.  Don't bother initializing sp0, as the initial
>> -     * task never enters user mode.
>> +     * Initialize the TSS.  sp0 points to the entry trampoline stack
>> +     * regardless of what task is running.
>>     */
>>    set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
>>    load_TR_desc();
>> +    load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
>> +         offsetofend(struct tss_struct, SYSENTER_stack));
>>
>>    load_mm_ldt(&init_mm);
>
> Does this also mean that our stack dumps will say "<SYSENTER>" in oopses?

Only if we oops while we're on the trampoline stack.  If the oops is
due to a bug that isn't in the entry code, then we won't see it.

>
>> [   30.811750] CR2: fffffffffdeb2f98 CR3: 0000000423fae001 CR4: 00000000001607e0
>> [   30.819712] Call Trace:
>> [   30.822442]  <SYSENTER>
>> [   30.825170]  trace_hardirqs_on_thunk+0x1c/0x1c
> ...
>> [   31.000571] R13: 0000000000000050 R14: 0000000000000076 R15: 00007f59f76f2d60
>> [   31.008533]  </SYSENTER>
>
> Should we change that string to something more descriptive?

I suppose we could rename it to "ENTRY_TRAMPOLINE" or something like that.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 21/21] x86/entry: Clean up the SYSENTER_stack code
  2017-11-27 10:45 ` [PATCH 21/21] x86/entry: Clean up the SYSENTER_stack code Ingo Molnar
@ 2017-12-01 17:59   ` Borislav Petkov
  0 siblings, 0 replies; 35+ messages in thread
From: Borislav Petkov @ 2017-12-01 17:59 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Mon, Nov 27, 2017 at 11:45:29AM +0100, Ingo Molnar wrote:
> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index 81fdc405ab1f..5a1013df456e 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -237,5 +237,10 @@ static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
>  	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
>  }
>  
> +static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
> +{
> +	return &get_cpu_entry_area((cpu))->tss.SYSENTER_stack;
				^^^^^^^^^
Those inner brackets are not needed, right?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries
  2017-12-01 17:44     ` Andy Lutomirski
@ 2017-12-01 21:21       ` Dave Hansen
  2017-12-01 21:59         ` Andy Lutomirski
  2017-12-02  6:41         ` Kevin Easton
  0 siblings, 2 replies; 35+ messages in thread
From: Dave Hansen @ 2017-12-01 21:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 524 bytes --]

>>> [   30.811750] CR2: fffffffffdeb2f98 CR3: 0000000423fae001 CR4: 00000000001607e0
>>> [   30.819712] Call Trace:
>>> [   30.822442]  <SYSENTER>
>>> [   30.825170]  trace_hardirqs_on_thunk+0x1c/0x1c
>> ...
>>> [   31.000571] R13: 0000000000000050 R14: 0000000000000076 R15: 00007f59f76f2d60
>>> [   31.008533]  </SYSENTER>
>>
>> Should we change that string to something more descriptive?
> 
> I suppose we could rename it to "ENTRY_TRAMPOLINE" or something like that.

The attached patch does just that.  Any objections?

[-- Attachment #2: SYSENTER-rename.patch --]
[-- Type: text/x-patch, Size: 1199 bytes --]


From: Dave Hansen <dave.hansen@linux.intel.com>

The "SYSENTER" stack is used for a lot more than SYSENTER now.
Give it a better string to display in stack dumps.

We should probably cleanse the 64-bit code of the remaining
"SYSENTER" nomenclature too at some point.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch//x86/kernel/dumpstack_64.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff -puN arch//x86/kernel/dumpstack_64.c~SYSENTER-rename arch//x86/kernel/dumpstack_64.c
--- a/arch//x86/kernel/dumpstack_64.c~SYSENTER-rename	2017-12-01 12:43:16.768707737 -0800
+++ b/arch//x86/kernel/dumpstack_64.c	2017-12-01 13:19:21.741702337 -0800
@@ -37,8 +37,14 @@ const char *stack_type_name(enum stack_t
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
 
-	if (type == STACK_TYPE_SYSENTER)
-		return "SYSENTER";
+	if (type == STACK_TYPE_SYSENTER) {
+		/*
+		 * On 64-bit, we have a generic entry stack that we
+		 * use for all the kernel try points, including
+		 * SYSENTER.
+		 */
+		return "ENTRY_TRAMPOLINE";
+	}
 
 	if (type >= STACK_TYPE_EXCEPTION && type <= STACK_TYPE_EXCEPTION_LAST)
 		return exception_stack_names[type - STACK_TYPE_EXCEPTION];
_

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries
  2017-12-01 21:21       ` Dave Hansen
@ 2017-12-01 21:59         ` Andy Lutomirski
  2017-12-02  6:41         ` Kevin Easton
  1 sibling, 0 replies; 35+ messages in thread
From: Andy Lutomirski @ 2017-12-01 21:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Borislav Petkov, Linus Torvalds



On Dec 1, 2017, at 1:21 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:

>>>> [   30.811750] CR2: fffffffffdeb2f98 CR3: 0000000423fae001 CR4: 00000000001607e0
>>>> [   30.819712] Call Trace:
>>>> [   30.822442]  <SYSENTER>
>>>> [   30.825170]  trace_hardirqs_on_thunk+0x1c/0x1c
>>> ...
>>>> [   31.000571] R13: 0000000000000050 R14: 0000000000000076 R15: 00007f59f76f2d60
>>>> [   31.008533]  </SYSENTER>
>>> 
>>> Should we change that string to something more descriptive?
>> 
>> I suppose we could rename it to "ENTRY_TRAMPOLINE" or something like that.
> 
> The attached patch does just that.  Any objections?
> <SYSENTER-rename.patch>

I think that, if we do this, we should rename it in the code, too.  Calling it one thing in the oops and something else in the code is just going to add confusion.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries
  2017-12-01 21:21       ` Dave Hansen
  2017-12-01 21:59         ` Andy Lutomirski
@ 2017-12-02  6:41         ` Kevin Easton
  1 sibling, 0 replies; 35+ messages in thread
From: Kevin Easton @ 2017-12-02  6:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Borislav Petkov, Linus Torvalds

> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The "SYSENTER" stack is used for a lot more than SYSENTER now.
> Give it a better string to display in stack dumps.
> 
> We should probably cleanse the 64-bit code of the remaining
> "SYSENTER" nomenclature too at some point.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/arch//x86/kernel/dumpstack_64.c |   10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff -puN arch//x86/kernel/dumpstack_64.c~SYSENTER-rename arch//x86/kernel/dumpstack_64.c
> --- a/arch//x86/kernel/dumpstack_64.c~SYSENTER-rename	2017-12-01 12:43:16.768707737 -0800
> +++ b/arch//x86/kernel/dumpstack_64.c	2017-12-01 13:19:21.741702337 -0800
> @@ -37,8 +37,14 @@ const char *stack_type_name(enum stack_t
>  	if (type == STACK_TYPE_IRQ)
>  		return "IRQ";
>  
> -	if (type == STACK_TYPE_SYSENTER)
> -		return "SYSENTER";
> +	if (type == STACK_TYPE_SYSENTER) {
> +		/*
> +		 * On 64-bit, we have a generic entry stack that we
> +		 * use for all the kernel try points, including
> +		 * SYSENTER.

ITYM "kernel entry points".

    - Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2017-12-02  6:41 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-27 10:45 [PATCH 00/21] Preparatory patches for x86 KAISER support Ingo Molnar
2017-11-27 10:45 ` [PATCH 01/21] x86/unwinder/orc: Don't bail on stack overflow Ingo Molnar
2017-11-27 14:47   ` Borislav Petkov
2017-11-27 10:45 ` [PATCH 02/21] x86/unwinder: Handle stack overflows more gracefully Ingo Molnar
2017-11-27 17:36   ` Borislav Petkov
2017-11-28  4:55     ` Josh Poimboeuf
2017-11-27 10:45 ` [PATCH 03/21] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
2017-11-27 10:45 ` [PATCH 04/21] x86/irq/64: Print the offending IP in the stack overflow warning Ingo Molnar
2017-11-27 10:45 ` [PATCH 05/21] x86/entry/64: Allocate and enable the SYSENTER stack Ingo Molnar
2017-11-27 10:45 ` [PATCH 06/21] x86/dumpstack: Add get_stack_info() support for " Ingo Molnar
2017-11-27 10:45 ` [PATCH 07/21] x86/entry/gdt: Put per-CPU GDT remaps in ascending order Ingo Molnar
2017-11-27 10:45 ` [PATCH 08/21] x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce 'struct cpu_entry_area' Ingo Molnar
2017-11-27 10:45 ` [PATCH 09/21] x86/kasan/64: Teach KASAN about the cpu_entry_area Ingo Molnar
2017-11-27 10:45 ` [PATCH 10/21] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss Ingo Molnar
2017-11-27 10:45 ` [PATCH 11/21] x86/dumpstack: Handle stack overflow on all stacks Ingo Molnar
2017-11-27 19:26   ` Linus Torvalds
2017-11-28  4:29     ` Josh Poimboeuf
2017-11-28  5:29       ` Andy Lutomirski
2017-11-28 18:15         ` Josh Poimboeuf
2017-11-27 10:45 ` [PATCH 12/21] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct Ingo Molnar
2017-11-27 10:45 ` [PATCH 13/21] x86/entry: Remap the TSS into the CPU entry area Ingo Molnar
2017-11-27 10:45 ` [PATCH 14/21] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0 Ingo Molnar
2017-11-27 10:45 ` [PATCH 15/21] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Ingo Molnar
2017-11-27 10:45 ` [PATCH 16/21] x86/entry/64: Use a per-CPU trampoline stack for IDT entries Ingo Molnar
2017-12-01 17:06   ` Dave Hansen
2017-12-01 17:44     ` Andy Lutomirski
2017-12-01 21:21       ` Dave Hansen
2017-12-01 21:59         ` Andy Lutomirski
2017-12-02  6:41         ` Kevin Easton
2017-11-27 10:45 ` [PATCH 17/21] x86/entry/64: Return to userspace from the trampoline stack Ingo Molnar
2017-11-27 10:45 ` [PATCH 18/21] x86/entry/64: Create a per-CPU SYSCALL entry trampoline Ingo Molnar
2017-11-27 10:45 ` [PATCH 19/21] x86/entry/64: Move the IST stacks into 'struct cpu_entry_area' Ingo Molnar
2017-11-27 10:45 ` [PATCH 20/21] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
2017-11-27 10:45 ` [PATCH 21/21] x86/entry: Clean up the SYSENTER_stack code Ingo Molnar
2017-12-01 17:59   ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).