LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 00/19] Entry stack switching
@ 2017-11-24  4:32 Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 01/19] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
                   ` (18 more replies)
  0 siblings, 19 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

This sets up stack switching, including for SYSCALL.  I think it's
in decent shape.

I'm fiddling with a patch to make the TSS remap read-only on 64-bit.

Known issues:
 - I think we're going to want a way to turn the stack switching on and
   off either at boot time or at runtime.  It should be fairly straightforward
   to make it work.

 - I think the ORC unwinder isn't so good at dealing with stack overflows.
   It bails too early (I think), resulting in lots of ? entries.  This
   isn't a regression with this series -- it's just something that could
   be improved.

Changes from v2:
 - Tons of review comments
 - The IDT entry stack switching code is less messy
 - The last patch is new

Changes from v1:
 - Fixed KASAN for real (as far as I can tell).
 - Compiler errors due to my old hackish KASAN fix are gone.
 - Context tracking lockdep errors are fixed (and maybe 

Changes from farther back:
 - This is quite massively changed from last time.
 - 32-bit seems to build and mostly work
 - KASAN is less broken now

Andy Lutomirski (19):
  x86/asm/64: Allocate and enable the SYSENTER stack
  x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  x86/gdt: Put per-cpu GDT remaps in ascending order
  x86/fixmap: Generalize the GDT fixmap mechanism
  x86/kasan/64: Teach KASAN about the cpu_entry_area
  x86/asm: Fix assumptions that the HW TSS is at the beginning of
    cpu_tss
  x86/dumpstack: Handle stack overflow on all stacks
  x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct
  x86/asm: Remap the TSS into the cpu entry area
  x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0
  x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  x86/asm/64: Use a percpu trampoline stack for IDT entries
  x86/asm/64: Return to userspace from the trampoline stack
  x86/entry/64: Create a percpu SYSCALL entry trampoline
  x86/irq: Remove an old outdated comment about context tracking races
  x86/irq/64: In the stack overflow warning, print the offending IP
  x86/entry/64: Move the IST stacks into cpu_entry_area
  x86/entry/64: Remove the SYSENTER stack canary
  x86/entry: Clean up SYSENTER_stack code

 arch/x86/entry/entry_32.S          |   6 +-
 arch/x86/entry/entry_64.S          | 170 ++++++++++++++++++++++++++++++++-----
 arch/x86/entry/entry_64_compat.S   |   7 +-
 arch/x86/include/asm/desc.h        |  11 +--
 arch/x86/include/asm/fixmap.h      |  58 ++++++++++++-
 arch/x86/include/asm/processor.h   |  49 ++++++-----
 arch/x86/include/asm/stacktrace.h  |   3 +
 arch/x86/include/asm/switch_to.h   |   2 +-
 arch/x86/include/asm/thread_info.h |   2 +-
 arch/x86/include/asm/traps.h       |   1 -
 arch/x86/kernel/asm-offsets.c      |   7 ++
 arch/x86/kernel/asm-offsets_32.c   |   5 --
 arch/x86/kernel/asm-offsets_64.c   |   1 +
 arch/x86/kernel/cpu/common.c       | 133 +++++++++++++++++++++--------
 arch/x86/kernel/doublefault.c      |  36 ++++----
 arch/x86/kernel/dumpstack.c        |  42 ++++++---
 arch/x86/kernel/dumpstack_32.c     |   6 ++
 arch/x86/kernel/dumpstack_64.c     |   6 ++
 arch/x86/kernel/irq.c              |  12 ---
 arch/x86/kernel/irq_64.c           |   4 +-
 arch/x86/kernel/process.c          |  13 ++-
 arch/x86/kernel/process_64.c       |   1 +
 arch/x86/kernel/traps.c            |  27 +++---
 arch/x86/kernel/vmlinux.lds.S      |  10 +++
 arch/x86/mm/kasan_init_64.c        |  13 ++-
 arch/x86/power/cpu.c               |  16 ++--
 arch/x86/xen/mmu_pv.c              |   2 +-
 27 files changed, 473 insertions(+), 170 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 01/19] x86/asm/64: Allocate and enable the SYSENTER stack
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 02/19] x86/dumpstack: Add get_stack_info() support for " Andy Lutomirski
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

This will simplify future changes that want scratch variables early in
the SYSENTER handler -- they'll be able to spill registers to the
stack.  It also lets us get rid of a SWAPGS_UNSAFE_STACK user.

This does not depend on CONFIG_IA32_EMULATION because we'll want the
stack space even without IA32 emulation.

As far as I can tell, the reason that this wasn't done from day 1 is
that we use IST for #DB and #BP, which is IMO rather nasty and causes
a lot more problems than it solves.  But, since #DB uses IST, we don't
actually need a real stack for SYSENTER (because SYSENTER with TF set
will invoke #DB on the IST stack rather than the SYSENTER stack).
I want to remove IST usage from these vectors some day, and this patch
is a prerequisite for that as well.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64_compat.S | 2 +-
 arch/x86/include/asm/processor.h | 3 ---
 arch/x86/kernel/asm-offsets.c    | 5 +++++
 arch/x86/kernel/asm-offsets_32.c | 5 -----
 arch/x86/kernel/cpu/common.c     | 4 +++-
 arch/x86/kernel/process.c        | 2 --
 arch/x86/kernel/traps.c          | 3 +--
 7 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 568e130d932c..dcc6987f9bae 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -48,7 +48,7 @@
  */
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
-	SWAPGS_UNSAFE_STACK
+	SWAPGS
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index cc16fa882e3e..504a3bb4d5f0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -340,14 +340,11 @@ struct tss_struct {
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 
-#ifdef CONFIG_X86_32
 	/*
 	 * Space for the temporary SYSENTER stack.
 	 */
 	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
-#endif
-
 } ____cacheline_aligned;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 8ea78275480d..b275863128eb 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -93,4 +93,9 @@ void common(void) {
 
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
+
+	/* Offset from cpu_tss to SYSENTER_stack */
+	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	/* Size of SYSENTER_stack */
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
 }
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index dedf428b20b6..52ce4ea16e53 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -50,11 +50,6 @@ void foo(void)
 	DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) -
 	       offsetofend(struct tss_struct, SYSENTER_stack));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
-
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
 	OFFSET(stack_canary_offset, stack_canary, canary);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fa998ca8aa5a..ccb5f66c4e5b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1386,7 +1386,9 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
+		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 97fb3e5737f5..35d674157fda 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -71,9 +71,7 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-#ifdef CONFIG_X86_32
 	.SYSENTER_stack_canary	= STACK_END_MAGIC,
-#endif
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b7b0f74a2150..2008dd0f8ccb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -800,14 +800,13 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-#if defined(CONFIG_X86_32)
 	/*
 	 * This is the most likely code path that involves non-trivial use
 	 * of the SYSENTER stack.  Check that we haven't overrun it.
 	 */
 	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
 	     "Overran or corrupted SYSENTER stack\n");
-#endif
+
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 02/19] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 01/19] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 03/19] x86/gdt: Put per-cpu GDT remaps in ascending order Andy Lutomirski
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

get_stack_info() doesn't currently know about the SYSENTER stack, so
unwinding will fail if we entered the kernel on the SYSENTER stack
and haven't fully switched off.  Teach get_stack_info() about the
SYSENTER stack.

With future patches applied that run part of the entry code on the
SYSENTER stack and introduce an intentional BUG(), I would get:

PANIC: double fault, error_code: 0x0
...
RIP: 0010:do_error_trap+0x33/0x1c0
...
Call Trace:
Code: ...

With this patch, I get:

PANIC: double fault, error_code: 0x0
...
Call Trace:
 <SYSENTER>
 ? async_page_fault+0x36/0x60
 ? invalid_op+0x22/0x40
 ? async_page_fault+0x36/0x60
 ? sync_regs+0x3c/0x40
 ? sync_regs+0x2e/0x40
 ? error_entry+0x6c/0xd0
 ? async_page_fault+0x36/0x60
 </SYSENTER>
Code: ...

Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/stacktrace.h |  3 +++
 arch/x86/kernel/dumpstack.c       | 19 +++++++++++++++++++
 arch/x86/kernel/dumpstack_32.c    |  6 ++++++
 arch/x86/kernel/dumpstack_64.c    |  6 ++++++
 4 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 8da111b3c342..f8062bfd43a0 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -16,6 +16,7 @@ enum stack_type {
 	STACK_TYPE_TASK,
 	STACK_TYPE_IRQ,
 	STACK_TYPE_SOFTIRQ,
+	STACK_TYPE_SYSENTER,
 	STACK_TYPE_EXCEPTION,
 	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
 };
@@ -28,6 +29,8 @@ struct stack_info {
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info);
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info);
+
 int get_stack_info(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info, unsigned long *visit_mask);
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index f13b4c00a5de..5e7d10e8ca25 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -43,6 +43,25 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 	return true;
 }
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
+{
+	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+
+	/* Treat the canary as part of the stack for unwinding purposes. */
+	void *begin = &tss->SYSENTER_stack_canary;
+	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+
+	if ((void *)stack < begin || (void *)stack >= end)
+		return false;
+
+	info->type	= STACK_TYPE_SYSENTER;
+	info->begin	= begin;
+	info->end	= end;
+	info->next_sp	= NULL;
+
+	return true;
+}
+
 static void printk_stack_address(unsigned long address, int reliable,
 				 char *log_lvl)
 {
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index daefae83a3aa..5ff13a6b3680 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -26,6 +26,9 @@ const char *stack_type_name(enum stack_type type)
 	if (type == STACK_TYPE_SOFTIRQ)
 		return "SOFTIRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	return NULL;
 }
 
@@ -93,6 +96,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (task != current)
 		goto unknown;
 
+	if (in_sysenter_stack(stack, info))
+		goto recursion_check;
+
 	if (in_hardirq_stack(stack, info))
 		goto recursion_check;
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 88ce2ffdb110..abc828f8c297 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -37,6 +37,9 @@ const char *stack_type_name(enum stack_type type)
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	if (type >= STACK_TYPE_EXCEPTION && type <= STACK_TYPE_EXCEPTION_LAST)
 		return exception_stack_names[type - STACK_TYPE_EXCEPTION];
 
@@ -115,6 +118,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (in_irq_stack(stack, info))
 		goto recursion_check;
 
+	if (in_sysenter_stack(stack, info))
+		goto recursion_check;
+
 	goto unknown;
 
 recursion_check:
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 03/19] x86/gdt: Put per-cpu GDT remaps in ascending order
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 01/19] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 02/19] x86/dumpstack: Add get_stack_info() support for " Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 04/19] x86/fixmap: Generalize the GDT fixmap mechanism Andy Lutomirski
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

We currently have CPU 0's GDT at the top of the GDT range and
higher-numbered CPUs at lower addresses.  This happens because the
fixmap is upside down (index 0 is the top of the fixmap).

Flip it so that GDTs are in ascending order by virtual address.
This will simplify a future patch that will generalize the GDT
remap to contain multiple pages.

Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/desc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 4011cb03ef08..95cd95eb7285 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -63,7 +63,7 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 /* Get the fixmap index for a specific processor */
 static inline unsigned int get_cpu_gdt_ro_index(int cpu)
 {
-	return FIX_GDT_REMAP_BEGIN + cpu;
+	return FIX_GDT_REMAP_END - cpu;
 }
 
 /* Provide the fixmap address of the remapped GDT */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 04/19] x86/fixmap: Generalize the GDT fixmap mechanism
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 03/19] x86/gdt: Put per-cpu GDT remaps in ascending order Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 05/19] x86/kasan/64: Teach KASAN about the cpu_entry_area Andy Lutomirski
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

Currently, the GDT is an ad-hoc array of pages, one per CPU, in the
fixmap.  Generalize it to be an array of a new struct cpu_entry_area
so that we can cleanly add new things to it.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/desc.h   |  9 +--------
 arch/x86/include/asm/fixmap.h | 34 ++++++++++++++++++++++++++++++++--
 arch/x86/kernel/cpu/common.c  | 14 +++++++-------
 arch/x86/xen/mmu_pv.c         |  2 +-
 4 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 95cd95eb7285..194ffab00ebe 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -60,17 +60,10 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 	return this_cpu_ptr(&gdt_page)->gdt;
 }
 
-/* Get the fixmap index for a specific processor */
-static inline unsigned int get_cpu_gdt_ro_index(int cpu)
-{
-	return FIX_GDT_REMAP_END - cpu;
-}
-
 /* Provide the fixmap address of the remapped GDT */
 static inline struct desc_struct *get_cpu_gdt_ro(int cpu)
 {
-	unsigned int idx = get_cpu_gdt_ro_index(cpu);
-	return (struct desc_struct *)__fix_to_virt(idx);
+	return (struct desc_struct *)&get_cpu_entry_area(cpu)->gdt;
 }
 
 /* Provide the current read-only GDT */
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..0f4c92f02968 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -44,6 +44,16 @@ extern unsigned long __FIXADDR_TOP;
 			 PAGE_SIZE)
 #endif
 
+/*
+ * cpu_entry_area is a percpu region in the fixmap that contains things
+ * needed by the CPU and early entry/exit code.  Real types aren't used
+ * for all fields here to avoid circular header dependencies.
+ */
+struct cpu_entry_area {
+	char gdt[PAGE_SIZE];
+};
+
+#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
 
 /*
  * Here we define all the compile-time 'special' virtual
@@ -101,8 +111,8 @@ enum fixed_addresses {
 	FIX_LNW_VRTC,
 #endif
 	/* Fixmap entries to remap the GDTs, one per processor. */
-	FIX_GDT_REMAP_BEGIN,
-	FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+	FIX_CPU_ENTRY_AREA_TOP,
+	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
 
 	__end_of_permanent_fixed_addresses,
 
@@ -185,5 +195,25 @@ void __init *early_memremap_decrypted_wp(resource_size_t phys_addr,
 void __early_set_fixmap(enum fixed_addresses idx,
 			phys_addr_t phys, pgprot_t flags);
 
+static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
+{
+	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
+
+	return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
+}
+
+#define __get_cpu_entry_area_offset_index(cpu, offset) ({		\
+	BUILD_BUG_ON(offset % PAGE_SIZE != 0);				\
+	__get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);	\
+	})
+
+#define get_cpu_entry_area_index(cpu, field)				\
+	__get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))
+
+static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
+{
+	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ccb5f66c4e5b..c0fb3eb37ee0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,12 +490,12 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
-/* Setup the fixmap mapping only once per-processor */
-static inline void setup_fixmap_gdt(int cpu)
+/* Setup the fixmap mappings only once per-processor */
+static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
-	pgprot_t prot = PAGE_KERNEL_RO;
+	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
 	/*
 	 * On native 32-bit systems, the GDT cannot be read-only because
@@ -506,11 +506,11 @@ static inline void setup_fixmap_gdt(int cpu)
 	 * On Xen PV, the GDT must be read-only because the hypervisor requires
 	 * it.
 	 */
-	pgprot_t prot = boot_cpu_has(X86_FEATURE_XENPV) ?
+	pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
 		PAGE_KERNEL_RO : PAGE_KERNEL;
 #endif
 
-	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
+	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1614,7 +1614,7 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1676,7 +1676,7 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2ccdaba31a07..c2454237fa67 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2272,7 +2272,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 #endif
 	case FIX_TEXT_POKE0:
 	case FIX_TEXT_POKE1:
-	case FIX_GDT_REMAP_BEGIN ... FIX_GDT_REMAP_END:
+	case FIX_CPU_ENTRY_AREA_TOP ... FIX_CPU_ENTRY_AREA_BOTTOM:
 		/* All local page mappings */
 		pte = pfn_pte(phys, prot);
 		break;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 05/19] x86/kasan/64: Teach KASAN about the cpu_entry_area
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (3 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 04/19] x86/fixmap: Generalize the GDT fixmap mechanism Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24 13:16   ` Andrey Ryabinin
  2017-11-24  4:32 ` [PATCH v3 06/19] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss Andy Lutomirski
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, kasan-dev

The cpu_entry_area will contain stacks.  Make sure that KASAN has
appropriate shadow mappings for them.

Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: kasan-dev@googlegroups.com
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/mm/kasan_init_64.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 99dfed6dfef8..54561dce742e 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -277,6 +277,7 @@ void __init kasan_early_init(void)
 void __init kasan_init(void)
 {
 	int i;
+	void *cpu_entry_area_begin, *cpu_entry_area_end;
 
 #ifdef CONFIG_KASAN_INLINE
 	register_die_notifier(&kasan_die_notifier);
@@ -329,8 +330,18 @@ void __init kasan_init(void)
 			      (unsigned long)kasan_mem_to_shadow(_end),
 			      early_pfn_to_nid(__pa(_stext)));
 
+	cpu_entry_area_begin = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM));
+	cpu_entry_area_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
+
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
-			(void *)KASAN_SHADOW_END);
+				   kasan_mem_to_shadow(cpu_entry_area_begin));
+
+	kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(cpu_entry_area_begin),
+			      (unsigned long)kasan_mem_to_shadow(cpu_entry_area_end),
+		0);
+
+	kasan_populate_zero_shadow(kasan_mem_to_shadow(cpu_entry_area_end),
+				   (void *)KASAN_SHADOW_END);
 
 	load_cr3(init_top_pgt);
 	__flush_tlb_all();
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 06/19] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (4 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 05/19] x86/kasan/64: Teach KASAN about the cpu_entry_area Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 07/19] x86/dumpstack: Handle stack overflow on all stacks Andy Lutomirski
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

A future patch will move SYSENTER_stack to the beginning of cpu_tss
to help detect overflow.  Before this can happen, fix several code
paths that hardcode assumptions about the old layout

Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/desc.h      |  2 +-
 arch/x86/include/asm/processor.h |  4 ++--
 arch/x86/kernel/cpu/common.c     |  8 ++++----
 arch/x86/kernel/doublefault.c    | 36 +++++++++++++++++-------------------
 arch/x86/power/cpu.c             | 13 +++++++------
 5 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 194ffab00ebe..aab4fe9f49f8 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -178,7 +178,7 @@ static inline void set_tssldt_descriptor(void *d, unsigned long addr,
 #endif
 }
 
-static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr)
+static inline void __set_tss_desc(unsigned cpu, unsigned int entry, struct x86_hw_tss *addr)
 {
 	struct desc_struct *d = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 504a3bb4d5f0..c24456429c7d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -163,7 +163,7 @@ enum cpuid_regs_idx {
 extern struct cpuinfo_x86	boot_cpu_data;
 extern struct cpuinfo_x86	new_cpu_data;
 
-extern struct tss_struct	doublefault_tss;
+extern struct x86_hw_tss	doublefault_tss;
 extern __u32			cpu_caps_cleared[NCAPINTS];
 extern __u32			cpu_caps_set[NCAPINTS];
 
@@ -323,7 +323,7 @@ struct x86_hw_tss {
 #define IO_BITMAP_BITS			65536
 #define IO_BITMAP_BYTES			(IO_BITMAP_BITS/8)
 #define IO_BITMAP_LONGS			(IO_BITMAP_BYTES/sizeof(long))
-#define IO_BITMAP_OFFSET		offsetof(struct tss_struct, io_bitmap)
+#define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
 struct tss_struct {
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c0fb3eb37ee0..62cdc10a7d94 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1582,7 +1582,7 @@ void cpu_init(void)
 		}
 	}
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 	/*
 	 * <= is required because the CPU will access up to
@@ -1601,7 +1601,7 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1659,12 +1659,12 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 #ifdef CONFIG_DOUBLEFAULT
 	/* Set up doublefault TSS pointer in the GDT */
diff --git a/arch/x86/kernel/doublefault.c b/arch/x86/kernel/doublefault.c
index 0e662c55ae90..0b8cedb20d6d 100644
--- a/arch/x86/kernel/doublefault.c
+++ b/arch/x86/kernel/doublefault.c
@@ -50,25 +50,23 @@ static void doublefault_fn(void)
 		cpu_relax();
 }
 
-struct tss_struct doublefault_tss __cacheline_aligned = {
-	.x86_tss = {
-		.sp0		= STACK_START,
-		.ss0		= __KERNEL_DS,
-		.ldt		= 0,
-		.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
-
-		.ip		= (unsigned long) doublefault_fn,
-		/* 0x2 bit is always set */
-		.flags		= X86_EFLAGS_SF | 0x2,
-		.sp		= STACK_START,
-		.es		= __USER_DS,
-		.cs		= __KERNEL_CS,
-		.ss		= __KERNEL_DS,
-		.ds		= __USER_DS,
-		.fs		= __KERNEL_PERCPU,
-
-		.__cr3		= __pa_nodebug(swapper_pg_dir),
-	}
+struct x86_hw_tss doublefault_tss __cacheline_aligned = {
+	.sp0		= STACK_START,
+	.ss0		= __KERNEL_DS,
+	.ldt		= 0,
+	.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
+
+	.ip		= (unsigned long) doublefault_fn,
+	/* 0x2 bit is always set */
+	.flags		= X86_EFLAGS_SF | 0x2,
+	.sp		= STACK_START,
+	.es		= __USER_DS,
+	.cs		= __KERNEL_CS,
+	.ss		= __KERNEL_DS,
+	.ds		= __USER_DS,
+	.fs		= __KERNEL_PERCPU,
+
+	.__cr3		= __pa_nodebug(swapper_pg_dir),
 };
 
 /* dummy for do_double_fault() call */
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 84fcfde53f8f..50593e138281 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -165,12 +165,13 @@ static void fix_processor_context(void)
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
-	set_tss_desc(cpu, t);	/*
-				 * This just modifies memory; should not be
-				 * necessary. But... This is necessary, because
-				 * 386 hardware has concept of busy TSS or some
-				 * similar stupidity.
-				 */
+
+	/*
+	 * This just modifies memory; should not be necessary. But... This is
+	 * necessary, because 386 hardware has concept of busy TSS or some
+	 * similar stupidity.
+	 */
+	set_tss_desc(cpu, &t->x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 07/19] x86/dumpstack: Handle stack overflow on all stacks
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (5 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 06/19] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 08/19] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct Andy Lutomirski
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

We currently special-case stack overflow on the task stack.  We're
going to start putting special stacks in the fixmap with a custom
layout, so they'll have guard pages, too.  Teach the unwinder to be
able to unwind an overflow of any of the stacks.

Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 5e7d10e8ca25..a8aa70c05489 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -90,24 +90,28 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * - task stack
 	 * - interrupt stack
 	 * - HW exception stacks (double fault, nmi, debug, mce)
+	 * - SYSENTER stack
 	 *
-	 * x86-32 can have up to three stacks:
+	 * x86-32 can have up to four stacks:
 	 * - task stack
 	 * - softirq stack
 	 * - hardirq stack
+	 * - SYSENTER stack
 	 */
 	for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
 		const char *stack_name;
 
-		/*
-		 * If we overflowed the task stack into a guard page, jump back
-		 * to the bottom of the usable stack.
-		 */
-		if (task_stack_page(task) - (void *)stack < PAGE_SIZE)
-			stack = task_stack_page(task);
-
-		if (get_stack_info(stack, task, &stack_info, &visit_mask))
-			break;
+		if (get_stack_info(stack, task, &stack_info, &visit_mask)) {
+			/*
+			 * We weren't on a valid stack.  It's possible that
+			 * we overflowed a valid stack into a guard page.
+			 * See if the next page up is valid so that we can
+			 * generate some kind of backtrace if this happens.
+			 */
+			stack = (unsigned long *)PAGE_ALIGN((unsigned long)stack);
+			if (get_stack_info(stack, task, &stack_info, &visit_mask))
+				break;
+		}
 
 		stack_name = stack_type_name(stack_info.type);
 		if (stack_name)
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 08/19] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (6 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 07/19] x86/dumpstack: Handle stack overflow on all stacks Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 09/19] x86/asm: Remap the TSS into the cpu entry area Andy Lutomirski
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

SYSENTER_stack should have reliable overflow detection, which
means that it needs to be at the bottom of a page, not the top.
Move it to the beginning of struct tss_struct and page-align it.

Also add an assertion to make sure that the fixed hardware TSS
doesn't cross a page boundary.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/processor.h | 21 ++++++++++++---------
 arch/x86/kernel/cpu/common.c     | 21 +++++++++++++++++++++
 2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c24456429c7d..48d44fae3d27 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -328,7 +328,16 @@ struct x86_hw_tss {
 
 struct tss_struct {
 	/*
-	 * The hardware state:
+	 * Space for the temporary SYSENTER stack, used for SYSENTER
+	 * and the entry trampoline as well.
+	 */
+	unsigned long		SYSENTER_stack_canary;
+	unsigned long		SYSENTER_stack[64];
+
+	/*
+	 * The fixed hardware portion.  This must not cross a page boundary
+	 * at risk of violating the SDM's advice and potentially triggering
+	 * errata.
 	 */
 	struct x86_hw_tss	x86_tss;
 
@@ -339,15 +348,9 @@ struct tss_struct {
 	 * be within the limit.
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
+} __aligned(PAGE_SIZE);
 
-	/*
-	 * Space for the temporary SYSENTER stack.
-	 */
-	unsigned long		SYSENTER_stack_canary;
-	unsigned long		SYSENTER_stack[64];
-} ____cacheline_aligned;
-
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 62cdc10a7d94..d173f6013467 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -511,6 +511,27 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
+
+	/*
+	 * The Intel SDM says (Volume 3, 7.2.1):
+	 *
+	 *  Avoid placing a page boundary in the part of the TSS that the
+	 *  processor reads during a task switch (the first 104 bytes). The
+	 *  processor may not correctly perform address translations if a
+	 *  boundary occurs in this area. During a task switch, the processor
+	 *  reads and writes into the first 104 bytes of each TSS (using
+	 *  contiguous physical addresses beginning with the physical address
+	 *  of the first byte of the TSS). So, after TSS access begins, if
+	 *  part of the 104 bytes is not physically contiguous, the processor
+	 *  will access incorrect information without generating a page-fault
+	 *  exception.
+	 *
+	 * There are also a lot of errata involving the TSS spanning a page
+	 * boundary.  Assert that we're not doing that.
+	 */
+	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
+		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+
 }
 
 /* Load the original GDT from the per-cpu structure */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 09/19] x86/asm: Remap the TSS into the cpu entry area
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (7 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 08/19] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 10/19] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0 Andy Lutomirski
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

This has a secondary purpose: it puts the entry stack into a region
with a well-controlled layout.  A subsequent patch will take
advantage of this to streamline the SYSCALL entry code to be able to
find it more easily.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S     |  6 ++++--
 arch/x86/include/asm/fixmap.h |  7 +++++++
 arch/x86/kernel/asm-offsets.c |  3 +++
 arch/x86/kernel/cpu/common.c  | 38 ++++++++++++++++++++++++++++++++------
 arch/x86/kernel/dumpstack.c   |  3 ++-
 arch/x86/power/cpu.c          | 11 ++++++-----
 6 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 4838037f97f6..0ab316c46806 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -941,7 +941,8 @@ ENTRY(debug)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -984,7 +985,8 @@ ENTRY(nmi)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 0f4c92f02968..3a42da14c2cb 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -51,6 +51,13 @@ extern unsigned long __FIXADDR_TOP;
  */
 struct cpu_entry_area {
 	char gdt[PAGE_SIZE];
+
+	/*
+	 * The GDT is just below cpu_tss and thus serves (on x86_64) as a
+	 * a read-only guard page for the SYSENTER stack at the bottom
+	 * of the TSS region.
+	 */
+	struct tss_struct tss;
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index b275863128eb..55858b277cf6 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -98,4 +98,7 @@ void common(void) {
 	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
 	/* Size of SYSENTER_stack */
 	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+
+	/* Layout info for cpu_entry_area */
+	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d173f6013467..c67742df569a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,6 +490,19 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
+static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, pgprot_t prot)
+{
+	int i;
+
+	for (i = 0; i < pages; i++)
+		__set_fixmap(fixmap_index - i, per_cpu_ptr_to_phys(ptr + i*PAGE_SIZE), prot);
+}
+
+#ifdef CONFIG_X86_32
+/* The 32-bit entry code needs to find cpu_entry_area. */
+DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -531,7 +544,15 @@ static inline void setup_cpu_entry_area(int cpu)
 	 */
 	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
 		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
+				&per_cpu(cpu_tss, cpu),
+				sizeof(struct tss_struct) / PAGE_SIZE,
+				PAGE_KERNEL);
 
+#ifdef CONFIG_X86_32
+	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1282,7 +1303,8 @@ void enable_sep_cpu(void)
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
 
 	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)tss + offsetofend(struct tss_struct, SYSENTER_stack),
+	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
+	      offsetofend(struct tss_struct, SYSENTER_stack),
 	      0);
 
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
@@ -1395,6 +1417,8 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	int cpu = smp_processor_id();
+
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
 	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
@@ -1408,7 +1432,7 @@ void syscall_init(void)
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
 	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
 		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
@@ -1618,11 +1642,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, me);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1635,7 +1661,6 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1676,11 +1701,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, curr);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1697,7 +1724,6 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index a8aa70c05489..bb61919c9335 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,7 +45,8 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 
 bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 {
-	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+	int cpu = smp_processor_id();
+	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
 	/* Treat the canary as part of the stack for unwinding purposes. */
 	void *begin = &tss->SYSENTER_stack_canary;
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 50593e138281..04d5157fe7f8 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -160,18 +160,19 @@ static void do_fpu_end(void)
 static void fix_processor_context(void)
 {
 	int cpu = smp_processor_id();
-	struct tss_struct *t = &per_cpu(cpu_tss, cpu);
 #ifdef CONFIG_X86_64
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
 
 	/*
-	 * This just modifies memory; should not be necessary. But... This is
-	 * necessary, because 386 hardware has concept of busy TSS or some
-	 * similar stupidity.
+	 * We need to reload TR, which requires that we change the
+	 * GDT entry to indicate "available" first.
+	 *
+	 * XXX: This could probably all be replaced by a call to
+	 * force_reload_TR().
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 10/19] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (8 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 09/19] x86/asm: Remap the TSS into the cpu entry area Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 11/19] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Andy Lutomirski
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

On 64-bit kernels, we used to assume that TSS.sp0 was the current
top of stack.  With the addition of an entry trampoline, this will
no longer be the case.  Store the current top of stack in TSS.sp1,
which is otherwise unused but shares the same cacheline.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/processor.h   | 18 +++++++++++++-----
 arch/x86/include/asm/thread_info.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c   |  1 +
 arch/x86/kernel/process.c          | 10 ++++++++++
 arch/x86/kernel/process_64.c       |  1 +
 5 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 48d44fae3d27..3a09e5571a92 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -305,7 +305,13 @@ struct x86_hw_tss {
 struct x86_hw_tss {
 	u32			reserved1;
 	u64			sp0;
+
+	/*
+	 * We store cpu_current_top_of_stack in sp1 so it's always accessible.
+	 * Linux does not use ring 1, so sp1 is not otherwise needed.
+	 */
 	u64			sp1;
+
 	u64			sp2;
 	u64			reserved2;
 	u64			ist[7];
@@ -364,6 +370,8 @@ DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 #ifdef CONFIG_X86_32
 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
+#else
+#define cpu_current_top_of_stack cpu_tss.x86_tss.sp1
 #endif
 
 /*
@@ -535,12 +543,12 @@ static inline void native_swapgs(void)
 
 static inline unsigned long current_top_of_stack(void)
 {
-#ifdef CONFIG_X86_64
-	return this_cpu_read_stable(cpu_tss.x86_tss.sp0);
-#else
-	/* sp0 on x86_32 is special in and around vm86 mode. */
+	/*
+	 *  We can't read directly from tss.sp0: sp0 on x86_32 is special in
+	 *  and around vm86 mode and sp0 on x86_64 is special because of the
+	 *  entry trampoline.
+	 */
 	return this_cpu_read_stable(cpu_current_top_of_stack);
-#endif
 }
 
 static inline bool on_thread_stack(void)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 70f425947dc5..44a04999791e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -207,7 +207,7 @@ static inline int arch_within_stack_frames(const void * const stack,
 #else /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_64
-# define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
+# define cpu_current_top_of_stack (cpu_tss + TSS_sp1)
 #endif
 
 #endif
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index 630212fa9b9d..ad649a8a74a0 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -63,6 +63,7 @@ int main(void)
 
 	OFFSET(TSS_ist, tss_struct, x86_tss.ist);
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
+	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	BLANK();
 
 #ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 35d674157fda..86e83762e3b3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -56,6 +56,16 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 		 * Poison it.
 		 */
 		.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
+
+#ifdef CONFIG_X86_64
+		/*
+		 * .sp1 is cpu_current_top_of_stack.  The init task never
+		 * runs user code, but cpu_current_top_of_stack should still
+		 * be well defined before the first context switch.
+		 */
+		.sp1 = TOP_OF_INIT_STACK,
+#endif
+
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
 		.ss1 = __KERNEL_CS,
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index eeeb34f85c25..bafe65b08697 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -462,6 +462,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 * Switch the PDA and FPU contexts.
 	 */
 	this_cpu_write(current_task, next_p);
+	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
 	/* Reload sp0. */
 	update_sp0(next_p);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 11/19] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (9 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 10/19] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0 Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 12/19] x86/asm/64: Use a percpu trampoline stack for IDT entries Andy Lutomirski
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

When we start using an entry trampoline, a #GP from userspace will
be delivered on the entry stack, not on the task stack.  Fix the
espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
assuming that pt_regs + 1 == SP0.  This won't change anything
without an entry stack, but it will make the code continue to work
when an entry stack is added.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/traps.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 2008dd0f8ccb..1bd43f044c62 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -359,7 +359,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs = task_pt_regs(current);
+		struct pt_regs *normal_regs =
+			(struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 		/* Fake a #GP(0) from userspace. */
 		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
@@ -390,7 +391,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	 *
 	 *   Processors update CR2 whenever a page fault is detected. If a
 	 *   second page fault occurs while an earlier page fault is being
-	 *   deliv- ered, the faulting linear address of the second fault will
+	 *   delivered, the faulting linear address of the second fault will
 	 *   overwrite the contents of CR2 (replacing the previous
 	 *   address). These updates to CR2 occur even if the page fault
 	 *   results in a double fault or occurs during the delivery of a
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 12/19] x86/asm/64: Use a percpu trampoline stack for IDT entries
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (10 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 11/19] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:32 ` [PATCH v3 13/19] x86/asm/64: Return to userspace from the trampoline stack Andy Lutomirski
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

Historically, IDT entries from usermode have always gone directly
to the running task's kernel stack.  Rearrange it so that we enter on
a percpu trampoline stack and then manually switch to the task's stack.
This touches a couple of extra cachelines, but it gives us a chance
to run some code before we touch the kernel stack.

The asm isn't exactly beautiful, but I think that fully refactoring
it can wait.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S        | 67 ++++++++++++++++++++++++++++++----------
 arch/x86/entry/entry_64_compat.S |  5 ++-
 arch/x86/include/asm/switch_to.h |  2 +-
 arch/x86/include/asm/traps.h     |  1 -
 arch/x86/kernel/cpu/common.c     |  6 ++--
 arch/x86/kernel/traps.c          | 18 +++++------
 6 files changed, 68 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f81d50d7ceac..7d47199f405f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -563,6 +563,13 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
 	.macro interrupt func
 	cld
+
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jz	1f
+	SWAPGS
+	call	switch_to_thread_stack
+1:
+
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
@@ -572,12 +579,8 @@ END(irq_entries_start)
 	jz	1f
 
 	/*
-	 * IRQ from user mode.  Switch to kernel gsbase and inform context
-	 * tracking that we're in kernel mode.
-	 */
-	SWAPGS
-
-	/*
+	 * IRQ from user mode.
+	 *
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
 	 * (which can take locks).  Since TRACE_IRQS_OFF idempotent,
@@ -831,6 +834,32 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
 
+/*
+ * Switch to the thread stack.  This is called with the IRET frame and
+ * orig_ax on the stack.  (That is, RDI..R12 are not on the stack and
+ * space has not been allocated for them.)
+ */
+ENTRY(switch_to_thread_stack)
+	UNWIND_HINT_FUNC
+
+	pushq	%rdi
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
+
+	pushq	7*8(%rdi)		/* regs->ss */
+	pushq	6*8(%rdi)		/* regs->rsp */
+	pushq	5*8(%rdi)		/* regs->eflags */
+	pushq	4*8(%rdi)		/* regs->cs */
+	pushq	3*8(%rdi)		/* regs->ip */
+	pushq	2*8(%rdi)		/* regs->orig_ax */
+	pushq	8(%rdi)			/* return address */
+	UNWIND_HINT_FUNC
+
+	movq	(%rdi), %rdi
+	ret
+END(switch_to_thread_stack)
+
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -848,11 +877,12 @@ ENTRY(\sym)
 
 	ALLOC_PT_GPREGS_ON_STACK
 
-	.if \paranoid
-	.if \paranoid == 1
+	.if \paranoid < 2
 	testb	$3, CS(%rsp)			/* If coming from userspace, switch stacks */
-	jnz	1f
+	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
+
+	.if \paranoid
 	call	paranoid_entry
 	.else
 	call	error_entry
@@ -894,20 +924,15 @@ ENTRY(\sym)
 	jmp	error_exit
 	.endif
 
-	.if \paranoid == 1
+	.if \paranoid < 2
 	/*
-	 * Paranoid entry from userspace.  Switch stacks and treat it
+	 * Entry from userspace.  Switch stacks and treat it
 	 * as a normal entry.  This means that paranoid handlers
 	 * run in real process context if user_mode(regs).
 	 */
-1:
+.Lfrom_usermode_switch_stack_\@:
 	call	error_entry
 
-
-	movq	%rsp, %rdi			/* pt_regs pointer */
-	call	sync_regs
-	movq	%rax, %rsp			/* switch stack */
-
 	movq	%rsp, %rdi			/* pt_regs pointer */
 
 	.if \has_error_code
@@ -1170,6 +1195,14 @@ ENTRY(error_entry)
 	SWAPGS
 
 .Lerror_entry_from_usermode_after_swapgs:
+	/* Put us onto the real thread stack. */
+	popq	%r12				/* save return addr in %12 */
+	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
+	call	sync_regs
+	movq	%rax, %rsp			/* switch stack */
+	ENCODE_FRAME_POINTER
+	pushq	%r12
+
 	/*
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index dcc6987f9bae..95ad40eb7eff 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -306,8 +306,11 @@ ENTRY(entry_INT80_compat)
 	 */
 	movl	%eax, %eax
 
-	/* Construct struct pt_regs on stack (iret frame is already on stack) */
 	pushq	%rax			/* pt_regs->orig_ax */
+
+	/* switch to thread stack expects orig_ax to be pushed */
+	call	switch_to_thread_stack
+
 	pushq	%rdi			/* pt_regs->di */
 	pushq	%rsi			/* pt_regs->si */
 	pushq	%rdx			/* pt_regs->dx */
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8c6bd6863db9..a6796ac8d311 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -93,7 +93,7 @@ static inline void update_sp0(struct task_struct *task)
 #ifdef CONFIG_X86_32
 	load_sp0(task->thread.sp0);
 #else
-	load_sp0(task_top_of_stack(task));
+	/* On x86_64, sp0 always points to the entry trampoline stack. */
 #endif
 }
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 1fadd310ff68..31051f35cbb7 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -75,7 +75,6 @@ dotraplinkage void do_segment_not_present(struct pt_regs *, long);
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
-asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);
 dotraplinkage void do_page_fault(struct pt_regs *, unsigned long);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c67742df569a..7c82a8a8bfda 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1645,11 +1645,13 @@ void cpu_init(void)
 	setup_cpu_entry_area(cpu);
 
 	/*
-	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
-	 * task never enters user mode.
+	 * Initialize the TSS.  sp0 points to the entry trampoline stack
+	 * regardless of what task is running.
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
+	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
+		 offsetofend(struct tss_struct, SYSENTER_stack));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1bd43f044c62..cbc4272bb9dd 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -359,8 +359,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs =
-			(struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
+		struct pt_regs *normal_regs = (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 		/* Fake a #GP(0) from userspace. */
 		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
@@ -611,9 +610,10 @@ NOKPROBE_SYMBOL(do_int3);
  * interrupted code was in user mode. The actual stack switch is done in
  * entry_64.S
  */
-asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
+asmlinkage __visible notrace
+struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
-	struct pt_regs *regs = task_pt_regs(current);
+	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(cpu_current_top_of_stack) - 1;
 	*regs = *eregs;
 	return regs;
 }
@@ -630,13 +630,13 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 	/*
 	 * This is called from entry_64.S early in handling a fault
 	 * caused by a bad iret to user mode.  To handle the fault
-	 * correctly, we want move our stack frame to task_pt_regs
-	 * and we want to pretend that the exception came from the
-	 * iret target.
+	 * correctly, we want move our stack frame to where it would
+	 * be had we entered directly on the entry stack (rather than
+	 * just below the IRET frame) and we want to pretend that the
+	 * exception came from the iret target.
 	 */
 	struct bad_iret_stack *new_stack =
-		container_of(task_pt_regs(current),
-			     struct bad_iret_stack, regs);
+		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 	/* Copy the IRET target to the new stack. */
 	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 13/19] x86/asm/64: Return to userspace from the trampoline stack
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (11 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 12/19] x86/asm/64: Use a percpu trampoline stack for IDT entries Andy Lutomirski
@ 2017-11-24  4:32 ` Andy Lutomirski
  2017-11-24  4:33 ` [PATCH v3 14/19] x86/entry/64: Create a percpu SYSCALL entry trampoline Andy Lutomirski
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:32 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

By itself, this is useless.  It gives us the ability to run some final
code before exit that cannnot run on the kernel stack.  This could
include a CR3 switch a la KAISER or some kernel stack erasing, for
example.  (Or even weird things like *changing* which kernel stack
gets used as an ASLR-strengthening mechanism.)

The SYSRET32 path is not covered yet.  It could be in the future or
we could just ignore it and force the slow path if needed.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 55 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7d47199f405f..426b8c669d6a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -330,8 +330,24 @@ syscall_return_via_sysret:
 	popq	%rsi	/* skip rcx */
 	popq	%rdx
 	popq	%rsi
+
+	/*
+	 * Now all regs are restored except RSP and RDI.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	pushq	RSP-RDI(%rdi)	/* RSP */
+	pushq	(%rdi)		/* RDI */
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
 	popq	%rdi
-	movq	RSP-ORIG_RAX(%rsp), %rsp
+	popq	%rsp
 	USERGS_SYSRET64
 END(entry_SYSCALL_64)
 
@@ -633,10 +649,41 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	ud2
 1:
 #endif
-	SWAPGS
 	POP_EXTRA_REGS
-	POP_C_REGS
-	addq	$8, %rsp	/* skip regs->orig_ax */
+	popq	%r11
+	popq	%r10
+	popq	%r9
+	popq	%r8
+	popq	%rax
+	popq	%rcx
+	popq	%rdx
+	popq	%rsi
+
+	/*
+	 * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	/* Copy the IRET frame to the trampoline stack. */
+	pushq	6*8(%rdi)	/* SS */
+	pushq	5*8(%rdi)	/* RSP */
+	pushq	4*8(%rdi)	/* EFLAGS */
+	pushq	3*8(%rdi)	/* CS */
+	pushq	2*8(%rdi)	/* RIP */
+
+	/* Push user RDI on the trampoline stack. */
+	pushq	(%rdi)
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
+	/* Restore RDI. */
+	popq	%rdi
+	SWAPGS
 	INTERRUPT_RETURN
 
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 14/19] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (12 preceding siblings ...)
  2017-11-24  4:32 ` [PATCH v3 13/19] x86/asm/64: Return to userspace from the trampoline stack Andy Lutomirski
@ 2017-11-24  4:33 ` Andy Lutomirski
  2017-11-24 11:39   ` David Laight
  2017-11-24  4:33 ` [PATCH v3 15/19] x86/irq: Remove an old outdated comment about context tracking races Andy Lutomirski
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:33 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live.  It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer.  The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.

With KAISER-like pagetable switching, this is problematic.  Without a
scratch register, switching CR3 is impossible, so %gs-based percpu
memory would need to be mapped in the user pagetables.  Doing that
without information leaks is difficult or impossible.

Instead, use a different sneaky trick.  Map a copy of the first part
of the SYSCALL asm at a different address for each CPU.  Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory.  By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.

A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.

The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
at all.

XXX: Whenever we settle how KAISER gets turned on and off, we should do
the same to this.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S     | 48 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/fixmap.h |  2 ++
 arch/x86/kernel/asm-offsets.c |  1 +
 arch/x86/kernel/cpu/common.c  | 12 ++++++++++-
 arch/x86/kernel/vmlinux.lds.S | 10 +++++++++
 5 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 426b8c669d6a..0cde243b7542 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -140,6 +140,54 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
+	.pushsection .entry_trampoline, "ax"
+
+/*
+ * The code in here gets remapped into cpu_entry_area's trampoline.  This means
+ * that the assembler and linker have the wrong idea as to where this code
+ * lives (and, in fact, it's mapped more than once, so it's not even at a
+ * fixed address).  So we can't reference any symbols outside the entry
+ * trampoline and expect it to work.
+ *
+ * Instead, we carefully abuse %rip-relative addressing.
+ * .Lentry_trampoline(%rip) refers to the start of the remapped) entry
+ * trampoline.  We can thus find cpu_entry_area with this macro:
+ */
+
+#define CPU_ENTRY_AREA \
+	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
+
+/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
+#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+	SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+
+ENTRY(entry_SYSCALL_64_trampoline)
+	UNWIND_HINT_EMPTY
+	swapgs
+
+	/* Stash the user RSP. */
+	movq	%rsp, RSP_SCRATCH
+
+	/* Load the top of the task stack into RSP */
+	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+
+	/* Start building the simulated IRET frame. */
+	pushq	$__USER_DS			/* pt_regs->ss */
+	pushq	RSP_SCRATCH			/* pt_regs->sp */
+	pushq	%r11				/* pt_regs->flags */
+	pushq	$__USER_CS			/* pt_regs->cs */
+	pushq	%rcx				/* pt_regs->ip */
+
+	/*
+	 * x86 lacks a near absolute jump, and we can't jump to the real
+	 * entry text with a relative jump, so we fake it using retq.
+	 */
+	pushq	$entry_SYSCALL_64_after_hwframe
+	retq
+END(entry_SYSCALL_64_trampoline)
+
+	.popsection
+
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 3a42da14c2cb..7eb1b5490395 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -58,6 +58,8 @@ struct cpu_entry_area {
 	 * of the TSS region.
 	 */
 	struct tss_struct tss;
+
+	char entry_trampoline[PAGE_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 55858b277cf6..61b1af88ac07 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -101,4 +101,5 @@ void common(void) {
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
+	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7c82a8a8bfda..5a05db084659 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,8 @@ DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
+	extern char _entry_trampoline[];
+
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
@@ -553,6 +555,11 @@ static inline void setup_cpu_entry_area(int cpu)
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
 #endif
+
+#ifdef CONFIG_X86_64
+	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1417,10 +1424,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	extern char _entry_trampoline[];
+	extern char entry_SYSCALL_64_trampoline[];
+
 	int cpu = smp_processor_id();
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index a4009fb9be87..2738cfb6c8c8 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -107,6 +107,16 @@ SECTIONS
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)
+
+#ifdef CONFIG_X86_64
+		/* Entry trampoline */
+		. = ALIGN(PAGE_SIZE);
+		_entry_trampoline = .;
+		*(.entry_trampoline)
+		. = ALIGN(PAGE_SIZE);
+		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
+#endif
+
 		/* End of text section */
 		_etext = .;
 	} :text = 0x9090
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 15/19] x86/irq: Remove an old outdated comment about context tracking races
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (13 preceding siblings ...)
  2017-11-24  4:33 ` [PATCH v3 14/19] x86/entry/64: Create a percpu SYSCALL entry trampoline Andy Lutomirski
@ 2017-11-24  4:33 ` Andy Lutomirski
  2017-11-24  4:33 ` [PATCH v3 16/19] x86/irq/64: In the stack overflow warning, print the offending IP Andy Lutomirski
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:33 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

That race has been fixed and code cleaned up for a while now.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/irq.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 49cfd9fe7589..68e1867cca80 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -219,18 +219,6 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 	/* high bit used in ret_from_ code  */
 	unsigned vector = ~regs->orig_ax;
 
-	/*
-	 * NB: Unlike exception entries, IRQ entries do not reliably
-	 * handle context tracking in the low-level entry code.  This is
-	 * because syscall entries execute briefly with IRQs on before
-	 * updating context tracking state, so we can take an IRQ from
-	 * kernel mode with CONTEXT_USER.  The low-level entry code only
-	 * updates the context if we came from user mode, so we won't
-	 * switch to CONTEXT_KERNEL.  We'll fix that once the syscall
-	 * code is cleaned up enough that we can cleanly defer enabling
-	 * IRQs.
-	 */
-
 	entering_irq();
 
 	/* entering_irq() tells RCU that we're not quiescent.  Check it. */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 16/19] x86/irq/64: In the stack overflow warning, print the offending IP
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (14 preceding siblings ...)
  2017-11-24  4:33 ` [PATCH v3 15/19] x86/irq: Remove an old outdated comment about context tracking races Andy Lutomirski
@ 2017-11-24  4:33 ` Andy Lutomirski
  2017-11-24  4:33 ` [PATCH v3 17/19] x86/entry/64: Move the IST stacks into cpu_entry_area Andy Lutomirski
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:33 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

In case something goes wrong with unwind (not unlikely in case of
overflow), print the offending IP where we detected the overflow.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/irq_64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 020efbf5786b..d86e344f5b3d 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -57,10 +57,10 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (regs->sp >= estack_top && regs->sp <= estack_bottom)
 		return;
 
-	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx)\n",
+	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx,ip:%pF)\n",
 		current->comm, curbase, regs->sp,
 		irq_stack_top, irq_stack_bottom,
-		estack_top, estack_bottom);
+		estack_top, estack_bottom, (void *)regs->ip);
 
 	if (sysctl_panic_on_stackoverflow)
 		panic("low stack detected by irq handler - check messages\n");
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 17/19] x86/entry/64: Move the IST stacks into cpu_entry_area
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (15 preceding siblings ...)
  2017-11-24  4:33 ` [PATCH v3 16/19] x86/irq/64: In the stack overflow warning, print the offending IP Andy Lutomirski
@ 2017-11-24  4:33 ` Andy Lutomirski
  2017-11-24  4:33 ` [PATCH v3 18/19] x86/entry/64: Remove the SYSENTER stack canary Andy Lutomirski
  2017-11-24  4:33 ` [PATCH v3 19/19] x86/entry: Clean up SYSENTER_stack code Andy Lutomirski
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:33 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

The IST stacks are needed when an IST exception occurs and are
accessed before any kernel code at all runs.  Move them into
cpu_entry_area.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/fixmap.h | 10 ++++++++++
 arch/x86/kernel/cpu/common.c  | 40 +++++++++++++++++++++++++---------------
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 7eb1b5490395..15cf010225c9 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -60,6 +60,16 @@ struct cpu_entry_area {
 	struct tss_struct tss;
 
 	char entry_trampoline[PAGE_SIZE];
+
+#ifdef CONFIG_X86_64
+	/*
+	 * Exception stacks used for IST entries.
+	 *
+	 * In the future, this should have a separate slot for each stack
+	 * with guard pages between them.
+	 */
+	char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ];
+#endif
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 5a05db084659..6b949e6ea0f9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -503,6 +503,22 @@ static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, pgpr
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * Special IST stacks which the CPU switches to when it calls
+ * an IST-marked descriptor entry. Up to 7 stacks (hardware
+ * limit), all of them are 4K, except the debug stack which
+ * is 8K.
+ */
+static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
+	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
+	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
+};
+
+static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -557,6 +573,14 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 #ifdef CONFIG_X86_64
+	BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
+	BUILD_BUG_ON(sizeof(exception_stacks) !=
+		     sizeof(((struct cpu_entry_area *)0)->exception_stacks));
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
+				&per_cpu(exception_stacks, cpu),
+				sizeof(exception_stacks) / PAGE_SIZE,
+				PAGE_KERNEL);
+
 	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
@@ -1407,20 +1431,6 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
-/*
- * Special IST stacks which the CPU switches to when it calls
- * an IST-marked descriptor entry. Up to 7 stacks (hardware
- * limit), all of them are 4K, except the debug stack which
- * is 8K.
- */
-static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
-	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
-	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
-};
-
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
-	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
@@ -1626,7 +1636,7 @@ void cpu_init(void)
 	 * set up and load the per-CPU TSS
 	 */
 	if (!oist->ist[0]) {
-		char *estacks = per_cpu(exception_stacks, cpu);
+		char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
 			estacks += exception_stack_sizes[v];
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 18/19] x86/entry/64: Remove the SYSENTER stack canary
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (16 preceding siblings ...)
  2017-11-24  4:33 ` [PATCH v3 17/19] x86/entry/64: Move the IST stacks into cpu_entry_area Andy Lutomirski
@ 2017-11-24  4:33 ` Andy Lutomirski
  2017-11-24  4:33 ` [PATCH v3 19/19] x86/entry: Clean up SYSENTER_stack code Andy Lutomirski
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:33 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

Now that the SYSENTER stack has a guard page, there's no need for a
canary to detect overflow after the fact.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/processor.h | 1 -
 arch/x86/kernel/dumpstack.c      | 3 +--
 arch/x86/kernel/process.c        | 1 -
 arch/x86/kernel/traps.c          | 7 -------
 4 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3a09e5571a92..7743aedb82ea 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -337,7 +337,6 @@ struct tss_struct {
 	 * Space for the temporary SYSENTER stack, used for SYSENTER
 	 * and the entry trampoline as well.
 	 */
-	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
 
 	/*
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index bb61919c9335..9ce5fcf7d14d 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -48,8 +48,7 @@ bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 	int cpu = smp_processor_id();
 	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
-	/* Treat the canary as part of the stack for unwinding purposes. */
-	void *begin = &tss->SYSENTER_stack_canary;
+	void *begin = &tss->SYSENTER_stack;
 	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
 
 	if ((void *)stack < begin || (void *)stack >= end)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 86e83762e3b3..6a04287f222b 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -81,7 +81,6 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-	.SYSENTER_stack_canary	= STACK_END_MAGIC,
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cbc4272bb9dd..19475dbff068 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -801,13 +801,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-	/*
-	 * This is the most likely code path that involves non-trivial use
-	 * of the SYSENTER stack.  Check that we haven't overrun it.
-	 */
-	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
-	     "Overran or corrupted SYSENTER stack\n");
-
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 19/19] x86/entry: Clean up SYSENTER_stack code
  2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
                   ` (17 preceding siblings ...)
  2017-11-24  4:33 ` [PATCH v3 18/19] x86/entry/64: Remove the SYSENTER stack canary Andy Lutomirski
@ 2017-11-24  4:33 ` Andy Lutomirski
  18 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24  4:33 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

The existing code was a mess, mainly because C arrays are nasty.
Turn SYSENTER_stack into a struct, add a helper to find it, and do
all the obvious cleanups this enables.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S        |  4 ++--
 arch/x86/entry/entry_64.S        |  2 +-
 arch/x86/include/asm/fixmap.h    |  5 +++++
 arch/x86/include/asm/processor.h |  6 +++++-
 arch/x86/kernel/asm-offsets.c    |  6 ++----
 arch/x86/kernel/cpu/common.c     | 14 +++-----------
 arch/x86/kernel/dumpstack.c      |  7 +++----
 7 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 0ab316c46806..3629bcbf85a2 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -942,7 +942,7 @@ ENTRY(debug)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -986,7 +986,7 @@ ENTRY(nmi)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0cde243b7542..34e3110b0876 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -158,7 +158,7 @@ END(native_usergs_sysret64)
 	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
 
 /* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+#define RSP_SCRATCH CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + \
 	SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
 
 ENTRY(entry_SYSCALL_64_trampoline)
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 15cf010225c9..ceb04ab0a642 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -234,5 +234,10 @@ static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
 	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
 }
 
+static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
+{
+	return &get_cpu_entry_area((cpu))->tss.SYSENTER_stack;
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7743aedb82ea..54f3ee3bc8a0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -332,12 +332,16 @@ struct x86_hw_tss {
 #define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
+struct SYSENTER_stack {
+	unsigned long		words[64];
+};
+
 struct tss_struct {
 	/*
 	 * Space for the temporary SYSENTER stack, used for SYSENTER
 	 * and the entry trampoline as well.
 	 */
-	unsigned long		SYSENTER_stack[64];
+	struct SYSENTER_stack	SYSENTER_stack;
 
 	/*
 	 * The fixed hardware portion.  This must not cross a page boundary
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 61b1af88ac07..46c0995344aa 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -94,10 +94,8 @@ void common(void) {
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+	OFFSET(TSS_STRUCT_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6b949e6ea0f9..f9c7e6852874 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1332,12 +1332,7 @@ void enable_sep_cpu(void)
 
 	tss->x86_tss.ss1 = __KERNEL_CS;
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
-
-	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
-	      offsetofend(struct tss_struct, SYSENTER_stack),
-	      0);
-
+	wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1), 0);
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
 
 	put_cpu();
@@ -1451,9 +1446,7 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
-		    offsetofend(struct tss_struct, SYSENTER_stack));
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
@@ -1670,8 +1663,7 @@ void cpu_init(void)
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
-	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
-		 offsetofend(struct tss_struct, SYSENTER_stack));
+	load_sp0((unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 9ce5fcf7d14d..d58dd121c0af 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,11 +45,10 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 
 bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 {
-	int cpu = smp_processor_id();
-	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
+	struct SYSENTER_stack *ss = cpu_SYSENTER_stack(smp_processor_id());
 
-	void *begin = &tss->SYSENTER_stack;
-	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+	void *begin = ss;
+	void *end = ss + 1;
 
 	if ((void *)stack < begin || (void *)stack >= end)
 		return false;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v3 14/19] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-24  4:33 ` [PATCH v3 14/19] x86/entry/64: Create a percpu SYSCALL entry trampoline Andy Lutomirski
@ 2017-11-24 11:39   ` David Laight
  2017-12-01  5:56     ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: David Laight @ 2017-11-24 11:39 UTC (permalink / raw)
  To: 'Andy Lutomirski', X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

From: Andy Lutomirski
> Sent: 24 November 2017 04:33
...
> +	/*
> +	 * x86 lacks a near absolute jump, and we can't jump to the real
> +	 * entry text with a relative jump, so we fake it using retq.
> +	 */
> +	pushq	$entry_SYSCALL_64_after_hwframe
> +	retq

Don't some of the cpus have hardware logic that follows call and return?
The above will break that logic and slow things down (unless it is an
expected sequence).

Is there a scratch register that can be used for an indirect jump?

	David

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 05/19] x86/kasan/64: Teach KASAN about the cpu_entry_area
  2017-11-24  4:32 ` [PATCH v3 05/19] x86/kasan/64: Teach KASAN about the cpu_entry_area Andy Lutomirski
@ 2017-11-24 13:16   ` Andrey Ryabinin
  2017-11-24 16:27     ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Andrey Ryabinin @ 2017-11-24 13:16 UTC (permalink / raw)
  To: Andy Lutomirski, X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Alexander Potapenko,
	Dmitry Vyukov, kasan-dev



On 11/24/2017 07:32 AM, Andy Lutomirski wrote:
> The cpu_entry_area will contain stacks.  Make sure that KASAN has
> appropriate shadow mappings for them.
> 
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Alexander Potapenko <glider@google.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: kasan-dev@googlegroups.com
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/mm/kasan_init_64.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
> index 99dfed6dfef8..54561dce742e 100644
> --- a/arch/x86/mm/kasan_init_64.c
> +++ b/arch/x86/mm/kasan_init_64.c
> @@ -277,6 +277,7 @@ void __init kasan_early_init(void)
>  void __init kasan_init(void)
>  {
>  	int i;
> +	void *cpu_entry_area_begin, *cpu_entry_area_end;
>  
>  #ifdef CONFIG_KASAN_INLINE
>  	register_die_notifier(&kasan_die_notifier);
> @@ -329,8 +330,18 @@ void __init kasan_init(void)
>  			      (unsigned long)kasan_mem_to_shadow(_end),
>  			      early_pfn_to_nid(__pa(_stext)));
>  
> +	cpu_entry_area_begin = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM));
> +	cpu_entry_area_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
> +
>  	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
> -			(void *)KASAN_SHADOW_END);
> +				   kasan_mem_to_shadow(cpu_entry_area_begin));
> +
> +	kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(cpu_entry_area_begin),
> +			      (unsigned long)kasan_mem_to_shadow(cpu_entry_area_end),
> +		0);
> +
> +	kasan_populate_zero_shadow(kasan_mem_to_shadow(cpu_entry_area_end),

Seems we need to round_up kasan_mem_to_shadow(cpu_entry_area_end) to the next page
(or alternatively - round_up(cpu_entry_area_end, KASAN_SHADOW_SCALE_SIZE*PAGE_SIZE)).
Otherwise, kasan_populate_zero_shadow() will overpopulate the last shadow page of cpu_entry area with kasan_zero_page.

We don't necessarily need to round_down(kasan_mem_to_shadow(cpu_entry_area_begin), PAGE_SIZE) because
kasan_populate_zero_shadow() will not populate the last 'incomplete' page and kasan_populate_shadow()
does round_down() internally, which is exactly what we want here. But it might be better to round_down()
explicitly anyway, to avoid relying on such subtle implementation details.

> +				   (void *)KASAN_SHADOW_END);
>  
>  	load_cr3(init_top_pgt);
>  	__flush_tlb_all();
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 05/19] x86/kasan/64: Teach KASAN about the cpu_entry_area
  2017-11-24 13:16   ` Andrey Ryabinin
@ 2017-11-24 16:27     ` Andy Lutomirski
  2017-11-24 17:21       ` [PATCH v4] x86/mm/kasan: " Andrey Ryabinin
  2017-11-24 17:26       ` [PATCH v3 05/19] x86/kasan/64: " Andrey Ryabinin
  0 siblings, 2 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24 16:27 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds, Josh Poimboeuf,
	Alexander Potapenko, Dmitry Vyukov, kasan-dev

On Fri, Nov 24, 2017 at 5:16 AM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
>
>
> On 11/24/2017 07:32 AM, Andy Lutomirski wrote:
>> The cpu_entry_area will contain stacks.  Make sure that KASAN has
>> appropriate shadow mappings for them.
>>
>> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
>> Cc: Alexander Potapenko <glider@google.com>
>> Cc: Dmitry Vyukov <dvyukov@google.com>
>> Cc: kasan-dev@googlegroups.com
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/mm/kasan_init_64.c | 13 ++++++++++++-
>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
>> index 99dfed6dfef8..54561dce742e 100644
>> --- a/arch/x86/mm/kasan_init_64.c
>> +++ b/arch/x86/mm/kasan_init_64.c
>> @@ -277,6 +277,7 @@ void __init kasan_early_init(void)
>>  void __init kasan_init(void)
>>  {
>>       int i;
>> +     void *cpu_entry_area_begin, *cpu_entry_area_end;
>>
>>  #ifdef CONFIG_KASAN_INLINE
>>       register_die_notifier(&kasan_die_notifier);
>> @@ -329,8 +330,18 @@ void __init kasan_init(void)
>>                             (unsigned long)kasan_mem_to_shadow(_end),
>>                             early_pfn_to_nid(__pa(_stext)));
>>
>> +     cpu_entry_area_begin = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM));
>> +     cpu_entry_area_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
>> +
>>       kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
>> -                     (void *)KASAN_SHADOW_END);
>> +                                kasan_mem_to_shadow(cpu_entry_area_begin));
>> +
>> +     kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(cpu_entry_area_begin),
>> +                           (unsigned long)kasan_mem_to_shadow(cpu_entry_area_end),
>> +             0);
>> +
>> +     kasan_populate_zero_shadow(kasan_mem_to_shadow(cpu_entry_area_end),
>
> Seems we need to round_up kasan_mem_to_shadow(cpu_entry_area_end) to the next page
> (or alternatively - round_up(cpu_entry_area_end, KASAN_SHADOW_SCALE_SIZE*PAGE_SIZE)).
> Otherwise, kasan_populate_zero_shadow() will overpopulate the last shadow page of cpu_entry area with kasan_zero_page.
>
> We don't necessarily need to round_down(kasan_mem_to_shadow(cpu_entry_area_begin), PAGE_SIZE) because
> kasan_populate_zero_shadow() will not populate the last 'incomplete' page and kasan_populate_shadow()
> does round_down() internally, which is exactly what we want here. But it might be better to round_down()
> explicitly anyway, to avoid relying on such subtle implementation details.

Any chance you could send a fixup patch or a replacement patch?  You
obviously understand this code *way* better than I do.

Or you could do my table-based approach and fix it permanently... :)

>
>> +                                (void *)KASAN_SHADOW_END);
>>
>>       load_cr3(init_top_pgt);
>>       __flush_tlb_all();
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4] x86/mm/kasan: Teach KASAN about the cpu_entry_area
  2017-11-24 16:27     ` Andy Lutomirski
@ 2017-11-24 17:21       ` Andrey Ryabinin
  2017-11-24 19:26         ` Andy Lutomirski
  2017-11-24 17:26       ` [PATCH v3 05/19] x86/kasan/64: " Andrey Ryabinin
  1 sibling, 1 reply; 27+ messages in thread
From: Andrey Ryabinin @ 2017-11-24 17:21 UTC (permalink / raw)
  To: Andy Lutomirski, X86 ML
  Cc: Alexander Potapenko, Dmitry Vyukov, kasan-dev, Andrey Ryabinin,
	Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

From: Andy Lutomirski <luto@kernel.org>

The cpu_entry_area will contain stacks.  Make sure that KASAN has
appropriate shadow mappings for them.

Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: kasan-dev@googlegroups.com
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 arch/x86/mm/kasan_init_64.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 99dfed6dfef8..9ec70d780f1f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -277,6 +277,7 @@ void __init kasan_early_init(void)
 void __init kasan_init(void)
 {
 	int i;
+	void *shadow_cpu_entry_begin, *shadow_cpu_entry_end;
 
 #ifdef CONFIG_KASAN_INLINE
 	register_die_notifier(&kasan_die_notifier);
@@ -329,8 +330,23 @@ void __init kasan_init(void)
 			      (unsigned long)kasan_mem_to_shadow(_end),
 			      early_pfn_to_nid(__pa(_stext)));
 
+	shadow_cpu_entry_begin = (void *)__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM);
+	shadow_cpu_entry_begin = kasan_mem_to_shadow(shadow_cpu_entry_begin);
+	shadow_cpu_entry_begin = (void *)round_down((unsigned long)shadow_cpu_entry_begin,
+						PAGE_SIZE);
+
+	shadow_cpu_entry_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
+	shadow_cpu_entry_end = kasan_mem_to_shadow(shadow_cpu_entry_end);
+	shadow_cpu_entry_end = (void *)round_up((unsigned long)shadow_cpu_entry_end,
+					PAGE_SIZE);
+
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
-			(void *)KASAN_SHADOW_END);
+				   shadow_cpu_entry_begin);
+
+	kasan_populate_shadow((unsigned long)shadow_cpu_entry_begin,
+			      (unsigned long)shadow_cpu_entry_end, 0);
+
+	kasan_populate_zero_shadow(shadow_cpu_entry_end, (void *)KASAN_SHADOW_END);
 
 	load_cr3(init_top_pgt);
 	__flush_tlb_all();
-- 
2.13.6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 05/19] x86/kasan/64: Teach KASAN about the cpu_entry_area
  2017-11-24 16:27     ` Andy Lutomirski
  2017-11-24 17:21       ` [PATCH v4] x86/mm/kasan: " Andrey Ryabinin
@ 2017-11-24 17:26       ` Andrey Ryabinin
  1 sibling, 0 replies; 27+ messages in thread
From: Andrey Ryabinin @ 2017-11-24 17:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Alexander Potapenko,
	Dmitry Vyukov, kasan-dev

On 11/24/2017 07:27 PM, Andy Lutomirski wrote:

>>> +     cpu_entry_area_begin = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM));
>>> +     cpu_entry_area_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
>>> +
>>>       kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
>>> -                     (void *)KASAN_SHADOW_END);
>>> +                                kasan_mem_to_shadow(cpu_entry_area_begin));
>>> +
>>> +     kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(cpu_entry_area_begin),
>>> +                           (unsigned long)kasan_mem_to_shadow(cpu_entry_area_end),
>>> +             0);
>>> +
>>> +     kasan_populate_zero_shadow(kasan_mem_to_shadow(cpu_entry_area_end),
>>
>> Seems we need to round_up kasan_mem_to_shadow(cpu_entry_area_end) to the next page
>> (or alternatively - round_up(cpu_entry_area_end, KASAN_SHADOW_SCALE_SIZE*PAGE_SIZE)).
>> Otherwise, kasan_populate_zero_shadow() will overpopulate the last shadow page of cpu_entry area with kasan_zero_page.
>>
>> We don't necessarily need to round_down(kasan_mem_to_shadow(cpu_entry_area_begin), PAGE_SIZE) because
>> kasan_populate_zero_shadow() will not populate the last 'incomplete' page and kasan_populate_shadow()
>> does round_down() internally, which is exactly what we want here. But it might be better to round_down()
>> explicitly anyway, to avoid relying on such subtle implementation details.
> 
> Any chance you could send a fixup patch or a replacement patch?  You
> obviously understand this code *way* better than I do.
> 
> Or you could do my table-based approach and fix it permanently... :)
> 

Perhaps I'll look at table-based approach later. I've send you a fixed patch for now, to not slow you down.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4] x86/mm/kasan: Teach KASAN about the cpu_entry_area
  2017-11-24 17:21       ` [PATCH v4] x86/mm/kasan: " Andrey Ryabinin
@ 2017-11-24 19:26         ` Andy Lutomirski
  0 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-11-24 19:26 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andy Lutomirski, X86 ML, Alexander Potapenko, Dmitry Vyukov,
	kasan-dev, Borislav Petkov, linux-kernel, Brian Gerst,
	Dave Hansen, Linus Torvalds, Josh Poimboeuf

On Fri, Nov 24, 2017 at 9:21 AM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
> From: Andy Lutomirski <luto@kernel.org>
>
> The cpu_entry_area will contain stacks.  Make sure that KASAN has
> appropriate shadow mappings for them.

LGTM.  I'll update my tree, and Ingo can pick it up as a replacement
if he wants.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 14/19] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-24 11:39   ` David Laight
@ 2017-12-01  5:56     ` Andy Lutomirski
  0 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2017-12-01  5:56 UTC (permalink / raw)
  To: David Laight
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds, Josh Poimboeuf

On Fri, Nov 24, 2017 at 3:39 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Andy Lutomirski
>> Sent: 24 November 2017 04:33
> ...
>> +     /*
>> +      * x86 lacks a near absolute jump, and we can't jump to the real
>> +      * entry text with a relative jump, so we fake it using retq.
>> +      */
>> +     pushq   $entry_SYSCALL_64_after_hwframe
>> +     retq
>
> Don't some of the cpus have hardware logic that follows call and return?
> The above will break that logic and slow things down (unless it is an
> expected sequence).

Indeed, and it makes a *huge* difference.

>
> Is there a scratch register that can be used for an indirect jump?

No, but I can fudge it.

>
>         David
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, back to index

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-24  4:32 [PATCH v3 00/19] Entry stack switching Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 01/19] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 02/19] x86/dumpstack: Add get_stack_info() support for " Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 03/19] x86/gdt: Put per-cpu GDT remaps in ascending order Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 04/19] x86/fixmap: Generalize the GDT fixmap mechanism Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 05/19] x86/kasan/64: Teach KASAN about the cpu_entry_area Andy Lutomirski
2017-11-24 13:16   ` Andrey Ryabinin
2017-11-24 16:27     ` Andy Lutomirski
2017-11-24 17:21       ` [PATCH v4] x86/mm/kasan: " Andrey Ryabinin
2017-11-24 19:26         ` Andy Lutomirski
2017-11-24 17:26       ` [PATCH v3 05/19] x86/kasan/64: " Andrey Ryabinin
2017-11-24  4:32 ` [PATCH v3 06/19] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 07/19] x86/dumpstack: Handle stack overflow on all stacks Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 08/19] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 09/19] x86/asm: Remap the TSS into the cpu entry area Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 10/19] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0 Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 11/19] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 12/19] x86/asm/64: Use a percpu trampoline stack for IDT entries Andy Lutomirski
2017-11-24  4:32 ` [PATCH v3 13/19] x86/asm/64: Return to userspace from the trampoline stack Andy Lutomirski
2017-11-24  4:33 ` [PATCH v3 14/19] x86/entry/64: Create a percpu SYSCALL entry trampoline Andy Lutomirski
2017-11-24 11:39   ` David Laight
2017-12-01  5:56     ` Andy Lutomirski
2017-11-24  4:33 ` [PATCH v3 15/19] x86/irq: Remove an old outdated comment about context tracking races Andy Lutomirski
2017-11-24  4:33 ` [PATCH v3 16/19] x86/irq/64: In the stack overflow warning, print the offending IP Andy Lutomirski
2017-11-24  4:33 ` [PATCH v3 17/19] x86/entry/64: Move the IST stacks into cpu_entry_area Andy Lutomirski
2017-11-24  4:33 ` [PATCH v3 18/19] x86/entry/64: Remove the SYSENTER stack canary Andy Lutomirski
2017-11-24  4:33 ` [PATCH v3 19/19] x86/entry: Clean up SYSENTER_stack code Andy Lutomirski

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git