All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] Entry stuff, in decent shape now
@ 2017-11-20 17:07 Andy Lutomirski
  2017-11-20 17:07 ` [PATCH 01/16] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
                   ` (17 more replies)
  0 siblings, 18 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

This sets up stack switching, including for SYSCALL.  I think it's
in decent shape.

Known issues:
 - KASAN is likely to be busted.  This could be fixed either by teaching
   KASAN that cpu_entry_area contains valid stacks (I have no clue how
   to go about doing this) or by rigging up the IST entry code to switch
   RSP to point to the direct-mapped copy of the stacks before calling
   into non-KASAN-excluded C code.

 - 32-bit kernels are failing the sigreturn_32 test.  But they're also
   failing without the patches, so I'm not sure this is a bug in the
   series per se.  Needs further investigation.  (Off the top of my head,
   this could be further fallout from Thomas's IDT rework.)

 - I think we're going to want a way to turn the stack switching on and
   off either at boot time or at runtime.  It should be fairly straightforward
   to make it work.

 - I think the ORC unwinder isn't so good at dealing with stack overflows.
   It bails too early (I think), resulting in lots of ? entries.  This
   isn't a regression with this series -- it's just something that could
   be improved.

Changes:
 - This is quite massively changed from last time.
 - 32-bit seems to build and mostly work
 - KASAN is less broken now

Andy Lutomirski (16):
  x86/asm/64: Allocate and enable the SYSENTER stack
  x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  x86/gdt: Put per-cpu GDT remaps in ascending order
  x86/fixmap: Generalize the GDT fixmap mechanism
  x86/asm: Fix assumptions that the HW TSS is at the beginning of
    cpu_tss
  x86/dumpstack: Handle stack overflow on all stacks
  x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct
  x86/asm: Remap the TSS into the cpu entry area
  x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0
  x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  x86/asm/64: Use a percpu trampoline stack for IDT entries
  x86/asm/64: Return to userspace from the trampoline stack
  x86/entry/64: Create a percpu SYSCALL entry trampoline
  x86/irq: Remove an old outdated comment about context tracking races
  x86/irq/64: In the stack overflow warning, print the offending IP
  x86/entry/64: Move the IST stacks into cpu_entry_area

 arch/x86/entry/entry_32.S          |   6 +-
 arch/x86/entry/entry_64.S          | 177 +++++++++++++++++++++++++++++++++----
 arch/x86/entry/entry_64_compat.S   |   6 +-
 arch/x86/include/asm/desc.h        |  11 +--
 arch/x86/include/asm/fixmap.h      |  55 +++++++++++-
 arch/x86/include/asm/processor.h   |  49 ++++++----
 arch/x86/include/asm/stacktrace.h  |   3 +
 arch/x86/include/asm/switch_to.h   |   2 +-
 arch/x86/include/asm/thread_info.h |   2 +-
 arch/x86/include/asm/traps.h       |   1 -
 arch/x86/kernel/asm-offsets.c      |   9 ++
 arch/x86/kernel/asm-offsets_32.c   |   5 --
 arch/x86/kernel/asm-offsets_64.c   |   1 +
 arch/x86/kernel/cpu/common.c       | 136 +++++++++++++++++++++-------
 arch/x86/kernel/doublefault.c      |  36 ++++----
 arch/x86/kernel/dumpstack.c        |  46 +++++++---
 arch/x86/kernel/dumpstack_32.c     |   6 ++
 arch/x86/kernel/dumpstack_64.c     |   6 ++
 arch/x86/kernel/irq.c              |  12 ---
 arch/x86/kernel/irq_64.c           |   4 +-
 arch/x86/kernel/process.c          |  12 ++-
 arch/x86/kernel/process_64.c       |   1 +
 arch/x86/kernel/traps.c            |  25 +++---
 arch/x86/kernel/vmlinux.lds.S      |  10 +++
 arch/x86/power/cpu.c               |  16 ++--
 arch/x86/xen/mmu_pv.c              |   2 +-
 26 files changed, 487 insertions(+), 152 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 01/16] x86/asm/64: Allocate and enable the SYSENTER stack
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 21:55   ` Thomas Gleixner
  2017-11-21 10:47   ` Borislav Petkov
  2017-11-20 17:07 ` [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for " Andy Lutomirski
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

This will simplify future changes that want scratch variables early in
the SYSENTER handler -- they'll be able to spill registers to the
stack.  It also lets us get rid of a SWAPGS_UNSAFE_STACK user.

This does not depend on CONFIG_IA32_EMULATION because we'll want the
stack space even without IA32 emulation.

As far as I can tell, the reason that this wasn't done from day 1 is
that we use IST for #DB and #BP, which is IMO rather nasty and causes
a lot more problems than it solves.  But, since #DB uses IST, we don't
actually need a real stack for SYSENTER (because SYSENTER with TF set
will invoke #DB on the IST stack rather than the SYSENTER stack).
I want to remove IST usage from these vectors some day, and this patch
is a prerequisite for that as well.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64_compat.S | 2 +-
 arch/x86/include/asm/processor.h | 3 ---
 arch/x86/kernel/asm-offsets.c    | 5 +++++
 arch/x86/kernel/asm-offsets_32.c | 5 -----
 arch/x86/kernel/cpu/common.c     | 4 +++-
 arch/x86/kernel/process.c        | 2 --
 arch/x86/kernel/traps.c          | 3 +--
 7 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 568e130d932c..dcc6987f9bae 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -48,7 +48,7 @@
  */
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
-	SWAPGS_UNSAFE_STACK
+	SWAPGS
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2db7cf720b04..789dad5da20f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -339,14 +339,11 @@ struct tss_struct {
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 
-#ifdef CONFIG_X86_32
 	/*
 	 * Space for the temporary SYSENTER stack.
 	 */
 	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
-#endif
-
 } ____cacheline_aligned;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 8ea78275480d..b275863128eb 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -93,4 +93,9 @@ void common(void) {
 
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
+
+	/* Offset from cpu_tss to SYSENTER_stack */
+	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	/* Size of SYSENTER_stack */
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
 }
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index dedf428b20b6..52ce4ea16e53 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -50,11 +50,6 @@ void foo(void)
 	DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) -
 	       offsetofend(struct tss_struct, SYSENTER_stack));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
-
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
 	OFFSET(stack_canary_offset, stack_canary, canary);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 13ae9e5eec2f..1243f732810b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1384,7 +1384,9 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
+		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 97fb3e5737f5..35d674157fda 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -71,9 +71,7 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-#ifdef CONFIG_X86_32
 	.SYSENTER_stack_canary	= STACK_END_MAGIC,
-#endif
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 989514c94a55..81841f27beff 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -795,14 +795,13 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-#if defined(CONFIG_X86_32)
 	/*
 	 * This is the most likely code path that involves non-trivial use
 	 * of the SYSENTER stack.  Check that we haven't overrun it.
 	 */
 	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
 	     "Overran or corrupted SYSENTER stack\n");
-#endif
+
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
  2017-11-20 17:07 ` [PATCH 01/16] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 20:42   ` Josh Poimboeuf
  2017-11-20 17:07 ` [PATCH 03/16] x86/gdt: Put per-cpu GDT remaps in ascending order Andy Lutomirski
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

get_stack_info() doesn't currently know about the SYSENTER stack, so
unwinding will fail if we entered the kernel on the SYSENTER stack
and haven't fully switched off.  Teach get_stack_info() about the
SYSENTER stack.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/stacktrace.h |  3 +++
 arch/x86/kernel/dumpstack.c       | 19 +++++++++++++++++++
 arch/x86/kernel/dumpstack_32.c    |  6 ++++++
 arch/x86/kernel/dumpstack_64.c    |  6 ++++++
 4 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 8da111b3c342..b6318b320ddb 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -16,6 +16,7 @@ enum stack_type {
 	STACK_TYPE_TASK,
 	STACK_TYPE_IRQ,
 	STACK_TYPE_SOFTIRQ,
+	STACK_TYPE_SYSENTER,
 	STACK_TYPE_EXCEPTION,
 	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
 };
@@ -28,6 +29,8 @@ struct stack_info {
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info);
 
+bool in_SYSENTER_stack(unsigned long *stack, struct stack_info *info);
+
 int get_stack_info(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info, unsigned long *visit_mask);
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index f13b4c00a5de..dba63f286ea8 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -43,6 +43,25 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 	return true;
 }
 
+bool in_SYSENTER_stack(unsigned long *stack, struct stack_info *info)
+{
+	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+
+	/* Treat the canary as part of the stack for unwinding purposes. */
+	void *begin = &tss->SYSENTER_stack_canary;
+	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+
+	if ((void *)stack < begin || (void *)stack >= end)
+		return false;
+
+	info->type	= STACK_TYPE_SYSENTER;
+	info->begin	= begin;
+	info->end	= end;
+	info->next_sp	= NULL;
+
+	return true;
+}
+
 static void printk_stack_address(unsigned long address, int reliable,
 				 char *log_lvl)
 {
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index daefae83a3aa..75d11ccd4611 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -26,6 +26,9 @@ const char *stack_type_name(enum stack_type type)
 	if (type == STACK_TYPE_SOFTIRQ)
 		return "SOFTIRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	return NULL;
 }
 
@@ -93,6 +96,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (task != current)
 		goto unknown;
 
+	if (in_SYSENTER_stack(stack, info))
+		goto recursion_check;
+
 	if (in_hardirq_stack(stack, info))
 		goto recursion_check;
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 88ce2ffdb110..2ccca6321ae5 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -37,6 +37,9 @@ const char *stack_type_name(enum stack_type type)
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	if (type >= STACK_TYPE_EXCEPTION && type <= STACK_TYPE_EXCEPTION_LAST)
 		return exception_stack_names[type - STACK_TYPE_EXCEPTION];
 
@@ -115,6 +118,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (in_irq_stack(stack, info))
 		goto recursion_check;
 
+	if (in_SYSENTER_stack(stack, info))
+		goto recursion_check;
+
 	goto unknown;
 
 recursion_check:
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 03/16] x86/gdt: Put per-cpu GDT remaps in ascending order
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
  2017-11-20 17:07 ` [PATCH 01/16] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
  2017-11-20 17:07 ` [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for " Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 21:56   ` Thomas Gleixner
  2017-11-21 17:16   ` Borislav Petkov
  2017-11-20 17:07 ` [PATCH 04/16] x86/fixmap: Generalize the GDT fixmap mechanism Andy Lutomirski
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

We currently have CPU 0's GDT at the top of the GDT range and
higher-numbered CPUs at lower addreses.  This happens because the
fixmap is upside down (index 0 is the top of the fixmap).

Flip it so that GDTs are in ascending order by virtual address.
This will simplify a future patch that will generalize the GDT
remap to contain multiple pages.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/desc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 4011cb03ef08..95cd95eb7285 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -63,7 +63,7 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 /* Get the fixmap index for a specific processor */
 static inline unsigned int get_cpu_gdt_ro_index(int cpu)
 {
-	return FIX_GDT_REMAP_BEGIN + cpu;
+	return FIX_GDT_REMAP_END - cpu;
 }
 
 /* Provide the fixmap address of the remapped GDT */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 04/16] x86/fixmap: Generalize the GDT fixmap mechanism
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 03/16] x86/gdt: Put per-cpu GDT remaps in ascending order Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 22:01   ` Thomas Gleixner
  2017-11-20 17:07 ` [PATCH 05/16] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss Andy Lutomirski
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

Currently, the GDT is an ad-hoc array of pages, one per CPU, in the
fixmap.  Generalize it to be an array of a new struct cpu_entry_area
so that we can cleanly add new things to it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/desc.h   |  9 +--------
 arch/x86/include/asm/fixmap.h | 36 ++++++++++++++++++++++++++++++++++--
 arch/x86/kernel/cpu/common.c  | 14 +++++++-------
 arch/x86/xen/mmu_pv.c         |  2 +-
 4 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 95cd95eb7285..194ffab00ebe 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -60,17 +60,10 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 	return this_cpu_ptr(&gdt_page)->gdt;
 }
 
-/* Get the fixmap index for a specific processor */
-static inline unsigned int get_cpu_gdt_ro_index(int cpu)
-{
-	return FIX_GDT_REMAP_END - cpu;
-}
-
 /* Provide the fixmap address of the remapped GDT */
 static inline struct desc_struct *get_cpu_gdt_ro(int cpu)
 {
-	unsigned int idx = get_cpu_gdt_ro_index(cpu);
-	return (struct desc_struct *)__fix_to_virt(idx);
+	return (struct desc_struct *)&get_cpu_entry_area(cpu)->gdt;
 }
 
 /* Provide the current read-only GDT */
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index b0c505fe9a95..5ff8e251772a 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -44,6 +44,17 @@ extern unsigned long __FIXADDR_TOP;
 			 PAGE_SIZE)
 #endif
 
+/*
+ * cpu_entry_area is a percpu region in the fixmap that contains things
+ * need by the CPU and early entry/exit code.  Real types aren't used here
+ * to avoid circular header dependencies.
+ */
+struct cpu_entry_area
+{
+	char gdt[PAGE_SIZE];
+};
+
+#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
 
 /*
  * Here we define all the compile-time 'special' virtual
@@ -101,8 +112,8 @@ enum fixed_addresses {
 	FIX_LNW_VRTC,
 #endif
 	/* Fixmap entries to remap the GDTs, one per processor. */
-	FIX_GDT_REMAP_BEGIN,
-	FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+	FIX_CPU_ENTRY_AREA_TOP,
+	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
 
 #ifdef CONFIG_ACPI_APEI_GHES
 	/* Used for GHES mapping from assorted contexts */
@@ -191,5 +202,26 @@ void __init *early_memremap_decrypted_wp(resource_size_t phys_addr,
 void __early_set_fixmap(enum fixed_addresses idx,
 			phys_addr_t phys, pgprot_t flags);
 
+static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
+{
+	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
+
+	return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
+}
+
+#define __get_cpu_entry_area_offset_index(cpu, offset) ({		\
+	BUILD_BUG_ON(offset % PAGE_SIZE != 0);				\
+	__get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);	\
+	})
+
+#define get_cpu_entry_area_index(cpu, field)				\
+	__get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))
+
+static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
+{
+	return (struct cpu_entry_area *)
+		__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 1243f732810b..1e0843d15e64 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -488,12 +488,12 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
-/* Setup the fixmap mapping only once per-processor */
-static inline void setup_fixmap_gdt(int cpu)
+/* Setup the fixmap mappings only once per-processor */
+static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
-	pgprot_t prot = PAGE_KERNEL_RO;
+	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
 	/*
 	 * On native 32-bit systems, the GDT cannot be read-only because
@@ -504,11 +504,11 @@ static inline void setup_fixmap_gdt(int cpu)
 	 * On Xen PV, the GDT must be read-only because the hypervisor requires
 	 * it.
 	 */
-	pgprot_t prot = boot_cpu_has(X86_FEATURE_XENPV) ?
+	pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
 		PAGE_KERNEL_RO : PAGE_KERNEL;
 #endif
 
-	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
+	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1612,7 +1612,7 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1674,7 +1674,7 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index fc048ec686e7..6cf801ca1142 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2272,7 +2272,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 #endif
 	case FIX_TEXT_POKE0:
 	case FIX_TEXT_POKE1:
-	case FIX_GDT_REMAP_BEGIN ... FIX_GDT_REMAP_END:
+	case FIX_CPU_ENTRY_AREA_TOP ... FIX_CPU_ENTRY_AREA_BOTTOM:
 		/* All local page mappings */
 		pte = pfn_pte(phys, prot);
 		break;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 05/16] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (3 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 04/16] x86/fixmap: Generalize the GDT fixmap mechanism Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 22:22   ` Thomas Gleixner
  2017-11-20 17:07 ` [PATCH 06/16] x86/dumpstack: Handle stack overflow on all stacks Andy Lutomirski
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

I'm going to move SYSENTER_stack to the beginning of cpu_tss to help
detect overflow.  Before this can happen, I need to fix several code
paths that hardcode assumptions about the old layout.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/desc.h      |  2 +-
 arch/x86/include/asm/processor.h |  4 ++--
 arch/x86/kernel/cpu/common.c     |  8 ++++----
 arch/x86/kernel/doublefault.c    | 36 +++++++++++++++++-------------------
 arch/x86/power/cpu.c             | 13 +++++++------
 5 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 194ffab00ebe..aab4fe9f49f8 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -178,7 +178,7 @@ static inline void set_tssldt_descriptor(void *d, unsigned long addr,
 #endif
 }
 
-static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr)
+static inline void __set_tss_desc(unsigned cpu, unsigned int entry, struct x86_hw_tss *addr)
 {
 	struct desc_struct *d = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 789dad5da20f..d32a3c88a968 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -162,7 +162,7 @@ enum cpuid_regs_idx {
 extern struct cpuinfo_x86	boot_cpu_data;
 extern struct cpuinfo_x86	new_cpu_data;
 
-extern struct tss_struct	doublefault_tss;
+extern struct x86_hw_tss	doublefault_tss;
 extern __u32			cpu_caps_cleared[NCAPINTS];
 extern __u32			cpu_caps_set[NCAPINTS];
 
@@ -322,7 +322,7 @@ struct x86_hw_tss {
 #define IO_BITMAP_BITS			65536
 #define IO_BITMAP_BYTES			(IO_BITMAP_BITS/8)
 #define IO_BITMAP_LONGS			(IO_BITMAP_BYTES/sizeof(long))
-#define IO_BITMAP_OFFSET		offsetof(struct tss_struct, io_bitmap)
+#define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
 struct tss_struct {
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 1e0843d15e64..099fca92f6be 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1580,7 +1580,7 @@ void cpu_init(void)
 		}
 	}
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 	/*
 	 * <= is required because the CPU will access up to
@@ -1599,7 +1599,7 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1657,12 +1657,12 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 #ifdef CONFIG_DOUBLEFAULT
 	/* Set up doublefault TSS pointer in the GDT */
diff --git a/arch/x86/kernel/doublefault.c b/arch/x86/kernel/doublefault.c
index 0e662c55ae90..0b8cedb20d6d 100644
--- a/arch/x86/kernel/doublefault.c
+++ b/arch/x86/kernel/doublefault.c
@@ -50,25 +50,23 @@ static void doublefault_fn(void)
 		cpu_relax();
 }
 
-struct tss_struct doublefault_tss __cacheline_aligned = {
-	.x86_tss = {
-		.sp0		= STACK_START,
-		.ss0		= __KERNEL_DS,
-		.ldt		= 0,
-		.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
-
-		.ip		= (unsigned long) doublefault_fn,
-		/* 0x2 bit is always set */
-		.flags		= X86_EFLAGS_SF | 0x2,
-		.sp		= STACK_START,
-		.es		= __USER_DS,
-		.cs		= __KERNEL_CS,
-		.ss		= __KERNEL_DS,
-		.ds		= __USER_DS,
-		.fs		= __KERNEL_PERCPU,
-
-		.__cr3		= __pa_nodebug(swapper_pg_dir),
-	}
+struct x86_hw_tss doublefault_tss __cacheline_aligned = {
+	.sp0		= STACK_START,
+	.ss0		= __KERNEL_DS,
+	.ldt		= 0,
+	.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
+
+	.ip		= (unsigned long) doublefault_fn,
+	/* 0x2 bit is always set */
+	.flags		= X86_EFLAGS_SF | 0x2,
+	.sp		= STACK_START,
+	.es		= __USER_DS,
+	.cs		= __KERNEL_CS,
+	.ss		= __KERNEL_DS,
+	.ds		= __USER_DS,
+	.fs		= __KERNEL_PERCPU,
+
+	.__cr3		= __pa_nodebug(swapper_pg_dir),
 };
 
 /* dummy for do_double_fault() call */
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 84fcfde53f8f..50593e138281 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -165,12 +165,13 @@ static void fix_processor_context(void)
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
-	set_tss_desc(cpu, t);	/*
-				 * This just modifies memory; should not be
-				 * necessary. But... This is necessary, because
-				 * 386 hardware has concept of busy TSS or some
-				 * similar stupidity.
-				 */
+
+	/*
+	 * This just modifies memory; should not be necessary. But... This is
+	 * necessary, because 386 hardware has concept of busy TSS or some
+	 * similar stupidity.
+	 */
+	set_tss_desc(cpu, &t->x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 06/16] x86/dumpstack: Handle stack overflow on all stacks
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (4 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 05/16] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 17:07 ` [PATCH 07/16] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct Andy Lutomirski
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

We currently special-case stack overflow on the task stack.  We're
going to start putting special stacks in the fixmap with a custom
layout, so they'll have guard pages, too.  Teach the unwinder to be
able to unwind an overflow of any of the stacks.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index dba63f286ea8..e53a7e823c69 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -90,24 +90,30 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * - task stack
 	 * - interrupt stack
 	 * - HW exception stacks (double fault, nmi, debug, mce)
+	 * - SYSENTER stack
 	 *
-	 * x86-32 can have up to three stacks:
+	 * x86-32 can have up to four stacks:
 	 * - task stack
 	 * - softirq stack
 	 * - hardirq stack
+	 * - SYSENTER stack
 	 */
 	for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
 		const char *stack_name;
 
-		/*
-		 * If we overflowed the task stack into a guard page, jump back
-		 * to the bottom of the usable stack.
-		 */
-		if (task_stack_page(task) - (void *)stack < PAGE_SIZE)
-			stack = task_stack_page(task);
-
-		if (get_stack_info(stack, task, &stack_info, &visit_mask))
-			break;
+		if (get_stack_info(stack, task, &stack_info, &visit_mask)) {
+			/*
+			 * We weren't on a valid stack.  It's possible that
+			 * we overflowed a valid stack into a guard page.
+			 * See if the next page up is valid to that we can
+			 * generate some kind of backtrace if this happens.
+			 */
+			stack = (unsigned long *)
+				PAGE_ALIGN((unsigned long)stack);
+			if (get_stack_info(stack, task, &stack_info,
+					   &visit_mask))
+				break;
+		}
 
 		stack_name = stack_type_name(stack_info.type);
 		if (stack_name)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 07/16] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (5 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 06/16] x86/dumpstack: Handle stack overflow on all stacks Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 23:37   ` Thomas Gleixner
  2017-11-20 17:07 ` [PATCH 08/16] x86/asm: Remap the TSS into the cpu entry area Andy Lutomirski
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

I want SYSENTER_stack to have reliable overflow detection, which
means that it needs to be at the bottom of a page, not the top.
Move it to the beginning of struct tss_struct and page-align it.

Also add an assertion to make sure that the fixed hardware TSS
doesn't cross a page boundary.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/processor.h | 24 +++++++++++++++---------
 arch/x86/kernel/cpu/common.c     | 22 ++++++++++++++++++++++
 2 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index d32a3c88a968..f816ed2601d4 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -327,7 +327,19 @@ struct x86_hw_tss {
 
 struct tss_struct {
 	/*
-	 * The hardware state:
+	 * Space for the temporary SYSENTER stack.  Used for the entry
+	 * trampoline as well.  Size it such that tss_struct ends up
+	 * as a multiple of PAGE_SIZE.  This calculation assumes that
+	 * io_bitmap is a multiple of PAGE_SIZE (8192 bytes) plus one
+	 * long.
+	 */
+	unsigned long		SYSENTER_stack_canary;
+	unsigned long		SYSENTER_stack[64];
+
+	/*
+	 * The fixed hardware portion.  This must not cross a page boundary
+	 * at risk of violating the SDM's advice and potentially triggering
+	 * errata.
 	 */
 	struct x86_hw_tss	x86_tss;
 
@@ -338,15 +350,9 @@ struct tss_struct {
 	 * be within the limit.
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
+} __attribute__((__aligned__(PAGE_SIZE)));
 
-	/*
-	 * Space for the temporary SYSENTER stack.
-	 */
-	unsigned long		SYSENTER_stack_canary;
-	unsigned long		SYSENTER_stack[64];
-} ____cacheline_aligned;
-
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 099fca92f6be..1868fe693fe1 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -509,6 +509,28 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
+
+	/*
+	 *
+	 * The Intel SDM says (Volume 3, 7.2.1):
+	 *
+	 *  Avoid placing a page boundary in the part of the TSS that the
+	 *  processor reads during a task switch (the first 104 bytes). The
+	 *  processor may not correctly perform address translations if a
+	 *  boundary occurs in this area. During a task switch, the processor
+	 *  reads and writes into the first 104 bytes of each TSS (using
+	 *  contiguous physical addresses beginning with the physical address
+	 *  of the first byte of the TSS). So, after TSS access begins, if
+	 *  part of the 104 bytes is not physically contiguous, the processor
+	 *  will access incorrect information without generating a page-fault
+	 *  exception.
+	 *
+	 * There are also a lot of errata involving the TSS spanning a page
+	 * boundary.  Assert that we're not doing that.
+	 */
+	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
+		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+
 }
 
 /* Load the original GDT from the per-cpu structure */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 08/16] x86/asm: Remap the TSS into the cpu entry area
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (6 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 07/16] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 17:07 ` [PATCH 09/16] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0 Andy Lutomirski
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

This has a secondary purpose: it puts the entry stack into a region
with a well-controlled layout.  A subsequent patch will take
advantage of this to streamline the SYSCALL entry code to be able to
find it more easily.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_32.S     |  6 ++++--
 arch/x86/include/asm/fixmap.h |  7 +++++++
 arch/x86/kernel/asm-offsets.c |  3 +++
 arch/x86/kernel/cpu/common.c  | 40 ++++++++++++++++++++++++++++++++++------
 arch/x86/kernel/dumpstack.c   |  3 ++-
 arch/x86/power/cpu.c          | 11 ++++++-----
 6 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 4838037f97f6..0ab316c46806 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -941,7 +941,8 @@ ENTRY(debug)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -984,7 +985,8 @@ ENTRY(nmi)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 5ff8e251772a..b512882f9b0e 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -52,6 +52,13 @@ extern unsigned long __FIXADDR_TOP;
 struct cpu_entry_area
 {
 	char gdt[PAGE_SIZE];
+
+	/*
+	 * The gdt is just below cpu_tss and thus serves (on x86_64) as a
+	 * a read-only guard page for the SYSENTER stack at the bottom
+	 * of the TSS region.
+	 */
+	struct tss_struct tss;
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index b275863128eb..55858b277cf6 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -98,4 +98,7 @@ void common(void) {
 	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
 	/* Size of SYSENTER_stack */
 	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+
+	/* Layout info for cpu_entry_area */
+	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 1868fe693fe1..d551bb4339ec 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -488,6 +488,21 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
+static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages,
+				    pgprot_t prot)
+{
+	int i;
+
+	for (i = 0; i < pages; i++)
+		__set_fixmap(fixmap_index - i,
+			     per_cpu_ptr_to_phys(ptr + i*PAGE_SIZE), prot);
+}
+
+#ifdef CONFIG_X86_32
+/* The 32-bit entry code needs to find cpu_entry_area. */
+DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -530,7 +545,15 @@ static inline void setup_cpu_entry_area(int cpu)
 	 */
 	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
 		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
+				&per_cpu(cpu_tss, cpu),
+				sizeof(struct tss_struct) / PAGE_SIZE,
+				PAGE_KERNEL);
 
+#ifdef CONFIG_X86_32
+	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1281,7 +1304,8 @@ void enable_sep_cpu(void)
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
 
 	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)tss + offsetofend(struct tss_struct, SYSENTER_stack),
+	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
+	      offsetofend(struct tss_struct, SYSENTER_stack),
 	      0);
 
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
@@ -1394,6 +1418,8 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	int cpu = smp_processor_id();
+
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
 	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
@@ -1407,7 +1433,7 @@ void syscall_init(void)
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
 	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
 		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
@@ -1617,11 +1643,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, me);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1634,7 +1662,6 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1675,11 +1702,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, curr);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1696,7 +1725,6 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index e53a7e823c69..6d98cfb64ed1 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,7 +45,8 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 
 bool in_SYSENTER_stack(unsigned long *stack, struct stack_info *info)
 {
-	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+	int cpu = smp_processor_id();
+	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
 	/* Treat the canary as part of the stack for unwinding purposes. */
 	void *begin = &tss->SYSENTER_stack_canary;
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 50593e138281..04d5157fe7f8 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -160,18 +160,19 @@ static void do_fpu_end(void)
 static void fix_processor_context(void)
 {
 	int cpu = smp_processor_id();
-	struct tss_struct *t = &per_cpu(cpu_tss, cpu);
 #ifdef CONFIG_X86_64
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
 
 	/*
-	 * This just modifies memory; should not be necessary. But... This is
-	 * necessary, because 386 hardware has concept of busy TSS or some
-	 * similar stupidity.
+	 * We need to reload TR, which requires that we change the
+	 * GDT entry to indicate "available" first.
+	 *
+	 * XXX: This could probably all be replaced by a call to
+	 * force_reload_TR().
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 09/16] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (7 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 08/16] x86/asm: Remap the TSS into the cpu entry area Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 17:07 ` [PATCH 10/16] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Andy Lutomirski
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

On 64-bit kernels, we used to assume that TSS.sp0 was the current
top of stack.  With the addition of an entry trampoline, this will
no longer be the case.  Store the current top of stack in TSS.sp1,
which is otherwise unused but shares the same cacheline.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/processor.h   | 18 +++++++++++++-----
 arch/x86/include/asm/thread_info.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c   |  1 +
 arch/x86/kernel/process.c          | 10 ++++++++++
 arch/x86/kernel/process_64.c       |  1 +
 5 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f816ed2601d4..9a100be69dbf 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -304,7 +304,13 @@ struct x86_hw_tss {
 struct x86_hw_tss {
 	u32			reserved1;
 	u64			sp0;
+
+	/*
+	 * We store cpu_current_top_of_stack in sp1 so it's always accessible.
+	 * Linux does not use ring 1, so sp1 is not otherwise needed.
+	 */
 	u64			sp1;
+
 	u64			sp2;
 	u64			reserved2;
 	u64			ist[7];
@@ -366,6 +372,8 @@ DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 #ifdef CONFIG_X86_32
 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
+#else
+#define cpu_current_top_of_stack cpu_tss.x86_tss.sp1
 #endif
 
 /*
@@ -537,12 +545,12 @@ static inline void native_swapgs(void)
 
 static inline unsigned long current_top_of_stack(void)
 {
-#ifdef CONFIG_X86_64
-	return this_cpu_read_stable(cpu_tss.x86_tss.sp0);
-#else
-	/* sp0 on x86_32 is special in and around vm86 mode. */
+	/*
+	 *  We can't read directly from tss.sp0: sp0 on x86_32 is special in
+	 *  and around vm86 mode and sp0 on x86_64 is special because of the
+	 *  entry trampoline.
+	 */
 	return this_cpu_read_stable(cpu_current_top_of_stack);
-#endif
 }
 
 static inline bool on_thread_stack(void)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 70f425947dc5..44a04999791e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -207,7 +207,7 @@ static inline int arch_within_stack_frames(const void * const stack,
 #else /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_64
-# define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
+# define cpu_current_top_of_stack (cpu_tss + TSS_sp1)
 #endif
 
 #endif
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index 630212fa9b9d..ad649a8a74a0 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -63,6 +63,7 @@ int main(void)
 
 	OFFSET(TSS_ist, tss_struct, x86_tss.ist);
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
+	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	BLANK();
 
 #ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 35d674157fda..86e83762e3b3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -56,6 +56,16 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 		 * Poison it.
 		 */
 		.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
+
+#ifdef CONFIG_X86_64
+		/*
+		 * .sp1 is cpu_current_top_of_stack.  The init task never
+		 * runs user code, but cpu_current_top_of_stack should still
+		 * be well defined before the first context switch.
+		 */
+		.sp1 = TOP_OF_INIT_STACK,
+#endif
+
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
 		.ss1 = __KERNEL_CS,
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index eeeb34f85c25..bafe65b08697 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -462,6 +462,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 * Switch the PDA and FPU contexts.
 	 */
 	this_cpu_write(current_task, next_p);
+	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
 	/* Reload sp0. */
 	update_sp0(next_p);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 10/16] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (8 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 09/16] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0 Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 17:07 ` [PATCH 11/16] x86/asm/64: Use a percpu trampoline stack for IDT entries Andy Lutomirski
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

When we start using an entry trampoline, a #GP from userspace will
be delivered on the entry stack, not on the task stack.  Fix the
espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
assuming that pt_regs + 1 == SP0.  This won't change anything
without an entry stack, but it will make the code continue to work
when an entry stack is added.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/traps.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 81841f27beff..1ea03027a4a9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -358,7 +358,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs = task_pt_regs(current);
+		struct pt_regs *normal_regs =
+			(struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 		/* Fake a #GP(0) from userspace. */
 		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
@@ -389,7 +390,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	 *
 	 *   Processors update CR2 whenever a page fault is detected. If a
 	 *   second page fault occurs while an earlier page fault is being
-	 *   deliv- ered, the faulting linear address of the second fault will
+	 *   delivered, the faulting linear address of the second fault will
 	 *   overwrite the contents of CR2 (replacing the previous
 	 *   address). These updates to CR2 occur even if the page fault
 	 *   results in a double fault or occurs during the delivery of a
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 11/16] x86/asm/64: Use a percpu trampoline stack for IDT entries
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (9 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 10/16] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-21  7:20   ` Ingo Molnar
  2017-11-21 18:57   ` Dave Hansen
  2017-11-20 17:07 ` [PATCH 12/16] x86/asm/64: Return to userspace from the trampoline stack Andy Lutomirski
                   ` (6 subsequent siblings)
  17 siblings, 2 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

Historically, IDT entries from usermode have always gone directly
to the running task's kernel stack.  Rearrange it so that we enter on
a percpu trampoline stack and then manually switch to the task's stack.
This touches a couple of extra cachelines, but it gives us a chance
to run some code before we touch the kernel stack.

XXX: The asm is gross and needs some significant cleanup.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S        | 61 ++++++++++++++++++++++++++++++++--------
 arch/x86/entry/entry_64_compat.S |  4 +++
 arch/x86/include/asm/switch_to.h |  2 +-
 arch/x86/include/asm/traps.h     |  1 -
 arch/x86/kernel/cpu/common.c     |  6 ++--
 arch/x86/kernel/traps.c          | 17 +++++------
 6 files changed, 68 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a2b30ec69497..9bd49bd79e9b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -560,6 +560,14 @@ END(irq_entries_start)
 	.macro interrupt func
 	cld
 	ALLOC_PT_GPREGS_ON_STACK
+
+	testb	$3, CS(%rsp)
+	jz	1f
+	SWAPGS
+	call	switch_to_thread_stack
+	SWAPGS
+1:
+
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
 	ENCODE_FRAME_POINTER
@@ -827,6 +835,33 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
 
+/*
+ * Switch to the thread stack.  This is called with the IRET frame and
+ * orig_ax in pt_regs and the rest of pt_regs allocated, but with all GPRs
+ * in the CPU registers.
+ */
+ENTRY(switch_to_thread_stack)
+	UNWIND_HINT_IRET_REGS offset=17*8
+
+	movq	%rdi, RDI+8(%rsp)
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	UNWIND_HINT_IRET_REGS offset=17*8 base=%rdi
+
+	pushq	SS+8(%rdi)		/* regs->ss */
+	pushq	RSP+8(%rdi)		/* regs->rsp */
+	pushq	EFLAGS+8(%rdi)		/* regs->eflags */
+	pushq	CS+8(%rdi)		/* regs->cs */
+	pushq	RIP+8(%rdi)		/* regs->ip */
+	pushq	ORIG_RAX+8(%rdi)	/* regs->orig_ax */
+	ALLOC_PT_GPREGS_ON_STACK	/* allocate the rest of regs */
+	pushq	(%rdi)			/* return address */
+
+	movq	RDI+8(%rdi), %rdi
+	UNWIND_HINT_IRET_REGS offset=17*8
+	ret
+END(switch_to_thread_stack)
+
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -844,11 +879,12 @@ ENTRY(\sym)
 
 	ALLOC_PT_GPREGS_ON_STACK
 
-	.if \paranoid
-	.if \paranoid == 1
+	.if \paranoid < 2
 	testb	$3, CS(%rsp)			/* If coming from userspace, switch stacks */
-	jnz	1f
+	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
+
+	.if \paranoid
 	call	paranoid_entry
 	.else
 	call	error_entry
@@ -890,20 +926,15 @@ ENTRY(\sym)
 	jmp	error_exit
 	.endif
 
-	.if \paranoid == 1
+	.if \paranoid < 2
 	/*
-	 * Paranoid entry from userspace.  Switch stacks and treat it
+	 * Entry from userspace.  Switch stacks and treat it
 	 * as a normal entry.  This means that paranoid handlers
 	 * run in real process context if user_mode(regs).
 	 */
-1:
+.Lfrom_usermode_switch_stack_\@:
 	call	error_entry
 
-
-	movq	%rsp, %rdi			/* pt_regs pointer */
-	call	sync_regs
-	movq	%rax, %rsp			/* switch stack */
-
 	movq	%rsp, %rdi			/* pt_regs pointer */
 
 	.if \has_error_code
@@ -1164,6 +1195,14 @@ ENTRY(error_entry)
 	SWAPGS
 
 .Lerror_entry_from_usermode_after_swapgs:
+	/* Put us onto the real thread stack. */
+	leaq	8(%rsp), %rdi			/* pt_regs pointer */
+	movq	(%rsp), %r12
+	call	sync_regs
+	movq	%rax, %rsp			/* switch stack */
+	ENCODE_FRAME_POINTER
+	pushq	%r12
+
 	/*
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index dcc6987f9bae..20398e3c5e2e 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -297,6 +297,10 @@ ENTRY(entry_INT80_compat)
 	ASM_CLAC			/* Do this early to minimize exposure */
 	SWAPGS
 
+	subq	$16*8, %rsp
+	call	switch_to_thread_stack
+	addq	$16*8, %rsp
+
 	/*
 	 * User tracing code (ptrace or signal handlers) might assume that
 	 * the saved RAX contains a 32-bit number when we're invoking a 32-bit
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8c6bd6863db9..a6796ac8d311 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -93,7 +93,7 @@ static inline void update_sp0(struct task_struct *task)
 #ifdef CONFIG_X86_32
 	load_sp0(task->thread.sp0);
 #else
-	load_sp0(task_top_of_stack(task));
+	/* On x86_64, sp0 always points to the entry trampoline stack. */
 #endif
 }
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 1fadd310ff68..31051f35cbb7 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -75,7 +75,6 @@ dotraplinkage void do_segment_not_present(struct pt_regs *, long);
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
-asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);
 dotraplinkage void do_page_fault(struct pt_regs *, unsigned long);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d551bb4339ec..ddf99d50c26c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1646,11 +1646,13 @@ void cpu_init(void)
 	setup_cpu_entry_area(cpu);
 
 	/*
-	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
-	 * task never enters user mode.
+	 * Initialize the TSS.  sp0 points to the entry trampoline stack
+	 * regardless of what task is running.
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
+	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
+		 offsetofend(struct tss_struct, SYSENTER_stack));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1ea03027a4a9..e4a941be96cf 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -610,9 +610,10 @@ NOKPROBE_SYMBOL(do_int3);
  * interrupted code was in user mode. The actual stack switch is done in
  * entry_64.S
  */
-asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
+asmlinkage __visible notrace __no_sanitize_address
+struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
-	struct pt_regs *regs = task_pt_regs(current);
+	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(cpu_current_top_of_stack) - 1;
 	*regs = *eregs;
 	return regs;
 }
@@ -623,19 +624,19 @@ struct bad_iret_stack {
 	struct pt_regs regs;
 };
 
-asmlinkage __visible notrace
+asmlinkage __visible notrace __no_sanitize_address
 struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 {
 	/*
 	 * This is called from entry_64.S early in handling a fault
 	 * caused by a bad iret to user mode.  To handle the fault
-	 * correctly, we want move our stack frame to task_pt_regs
-	 * and we want to pretend that the exception came from the
-	 * iret target.
+	 * correctly, we want move our stack frame to where it would
+	 * be had we entered directly on the entry stack (rather than
+	 * just below the IRET frame) and we want to pretend that the
+	 * exception came from the iret target.
 	 */
 	struct bad_iret_stack *new_stack =
-		container_of(task_pt_regs(current),
-			     struct bad_iret_stack, regs);
+		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 	/* Copy the IRET target to the new stack. */
 	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 12/16] x86/asm/64: Return to userspace from the trampoline stack
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (10 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 11/16] x86/asm/64: Use a percpu trampoline stack for IDT entries Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-20 17:07 ` [PATCH 13/16] x86/entry/64: Create a percpu SYSCALL entry trampoline Andy Lutomirski
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

By itself, this is useless.  It gives us the ability to run some final
code before exit that cannnot run on the kernel stack.  This could
include a CR3 switch a la KAISER or some kernel stack erasing, for
example.  (Or even weird things like *changing* which kernel stack
gets used as an ASLR-strengthening mechanism.)

The SYSRET32 path is not covered yet.  It could be in the future or
we could just ignore it and force the slow path if needed.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 55 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9bd49bd79e9b..5b1990d1258a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -326,8 +326,24 @@ syscall_return_via_sysret:
 	popq	%rsi	/* skip rcx */
 	popq	%rdx
 	popq	%rsi
+
+	/*
+	 * Now all regs are restored except RSP and RDI.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	pushq	RSP-RDI(%rdi)	/* RSP */
+	pushq	(%rdi)		/* RDI */
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
 	popq	%rdi
-	movq	RSP-ORIG_RAX(%rsp), %rsp
+	popq	%rsp
 	USERGS_SYSRET64
 END(entry_SYSCALL_64)
 
@@ -634,10 +650,41 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	ud2
 1:
 #endif
-	SWAPGS
 	POP_EXTRA_REGS
-	POP_C_REGS
-	addq	$8, %rsp	/* skip regs->orig_ax */
+	popq	%r11
+	popq	%r10
+	popq	%r9
+	popq	%r8
+	popq	%rax
+	popq	%rcx
+	popq	%rdx
+	popq	%rsi
+
+	/*
+	 * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	/* Copy the IRET frame to the trampoline stack. */
+	pushq	6*8(%rdi)	/* SS */
+	pushq	5*8(%rdi)	/* RSP */
+	pushq	4*8(%rdi)	/* EFLAGS */
+	pushq	3*8(%rdi)	/* CS */
+	pushq	2*8(%rdi)	/* RIP */
+
+	/* Push user RDI on the trampoline stack. */
+	pushq	(%rdi)
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
+	/* Restore RDI. */
+	popq	%rdi
+	SWAPGS
 	INTERRUPT_RETURN
 
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 13/16] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (11 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 12/16] x86/asm/64: Return to userspace from the trampoline stack Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-21  2:34   ` Josh Poimboeuf
  2017-11-20 17:07 ` [PATCH 14/16] x86/irq: Remove an old outdated comment about context tracking races Andy Lutomirski
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live.  It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer.  The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.

With KAISER-like pagetable switching, this is problematic.  Without a
scratch register, switching CR3 is impossible, so %gs-based percpu
memory would need to be mapped in the user pagetables.  Doing that
without information leaks is difficult or impossible.

Instead, use a different sneaky trick.  Map a copy of the first part
of the SYSCALL asm at a different address for each CPU.  Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory.  By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.

A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.

The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
at all.

XXX: Whenever we settle how KAISER gets turned on and off, we should do
the same to this.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S     | 61 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/fixmap.h |  2 ++
 arch/x86/kernel/asm-offsets.c |  1 +
 arch/x86/kernel/cpu/common.c  | 12 ++++++++-
 arch/x86/kernel/vmlinux.lds.S | 10 +++++++
 5 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 5b1990d1258a..a0da30f6c7e0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -136,6 +136,67 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
+	.pushsection .entry_trampoline, "ax"
+
+/*
+ * The code in here gets remapped into cpu_entry_area's trampoline.  This means
+ * that the assembler and linker have the wrong idea as to where this code
+ * lives (and, in fact, it's mapped more than once, so it's not even at a
+ * fixed address).  So we can't reference any symbols outside the entry
+ * trampoline and expect it to work.
+ *
+ * Instead, we carefully abuse %rip-relative addressing.
+ * .Lentry_trampoline(%rip) refers to the start of the remapped) entry
+ * trampoline.  We can thus find cpu_entry_area with this macro:
+ */
+
+#define CPU_ENTRY_AREA \
+	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
+
+/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
+#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+	SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+
+ENTRY(entry_SYSCALL_64_trampoline)
+	UNWIND_HINT_EMPTY
+	swapgs
+
+	/* Stash the user RSP. */
+	movq	%rsp, RSP_SCRATCH
+
+	/* Load the top of the task stack into RSP */
+	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+
+	/* Start building the simulated IRET frame. */
+	pushq	$__USER_DS			/* pt_regs->ss */
+	pushq	RSP_SCRATCH			/* pt_regs->sp */
+	pushq	%r11				/* pt_regs->flags */
+	pushq	$__USER_CS			/* pt_regs->cs */
+	pushq	%rcx				/* pt_regs->ip */
+
+	/* Save RDI, since we need a scratch register. */
+	pushq	%rdi
+
+	/*
+	 * x86 lacks a near absolute jump, and we can't jump to the real
+	 * entry text with a relative jump, so we use a double trampoline.
+	 */
+	movq	$entry_SYSCALL_64_stage2, %rdi
+	jmp	*%rdi
+END(entry_SYSCALL_64_trampoline)
+
+	.popsection
+
+ENTRY(entry_SYSCALL_64_stage2)
+	/*
+	 * Rather than polluting the normal SYSCALL path with stack switching
+	 * nonsense, fix up our register state to match its expectations.
+	 */
+	UNWIND_HINT_EMPTY
+	popq %rdi
+	jmp entry_SYSCALL_64_after_hwframe
+END(entry_SYSCALL_64_stage2)
+
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index b512882f9b0e..f3239c88179d 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -59,6 +59,8 @@ struct cpu_entry_area
 	 * of the TSS region.
 	 */
 	struct tss_struct tss;
+
+	char entry_trampoline[PAGE_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 55858b277cf6..61b1af88ac07 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -101,4 +101,5 @@ void common(void) {
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
+	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ddf99d50c26c..c15643393049 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,8 @@ DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
+	extern char _entry_trampoline[];
+
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
@@ -554,6 +556,11 @@ static inline void setup_cpu_entry_area(int cpu)
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
 #endif
+
+#ifdef CONFIG_X86_64
+	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1418,10 +1425,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	extern char _entry_trampoline[];
+	extern char entry_SYSCALL_64_trampoline[];
+
 	int cpu = smp_processor_id();
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(smp_processor_id())->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index a4009fb9be87..2738cfb6c8c8 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -107,6 +107,16 @@ SECTIONS
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)
+
+#ifdef CONFIG_X86_64
+		/* Entry trampoline */
+		. = ALIGN(PAGE_SIZE);
+		_entry_trampoline = .;
+		*(.entry_trampoline)
+		. = ALIGN(PAGE_SIZE);
+		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
+#endif
+
 		/* End of text section */
 		_etext = .;
 	} :text = 0x9090
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 14/16] x86/irq: Remove an old outdated comment about context tracking races
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (12 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 13/16] x86/entry/64: Create a percpu SYSCALL entry trampoline Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-21  6:25   ` Ingo Molnar
  2017-11-20 17:07 ` [PATCH 15/16] x86/irq/64: In the stack overflow warning, print the offending IP Andy Lutomirski
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

That race has been fixed and code cleaned up for a while now.
---
 arch/x86/kernel/irq.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 49cfd9fe7589..68e1867cca80 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -219,18 +219,6 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 	/* high bit used in ret_from_ code  */
 	unsigned vector = ~regs->orig_ax;
 
-	/*
-	 * NB: Unlike exception entries, IRQ entries do not reliably
-	 * handle context tracking in the low-level entry code.  This is
-	 * because syscall entries execute briefly with IRQs on before
-	 * updating context tracking state, so we can take an IRQ from
-	 * kernel mode with CONTEXT_USER.  The low-level entry code only
-	 * updates the context if we came from user mode, so we won't
-	 * switch to CONTEXT_KERNEL.  We'll fix that once the syscall
-	 * code is cleaned up enough that we can cleanly defer enabling
-	 * IRQs.
-	 */
-
 	entering_irq();
 
 	/* entering_irq() tells RCU that we're not quiescent.  Check it. */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 15/16] x86/irq/64: In the stack overflow warning, print the offending IP
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (13 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 14/16] x86/irq: Remove an old outdated comment about context tracking races Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-21  6:26   ` Ingo Molnar
  2017-11-20 17:07 ` [PATCH 16/16] x86/entry/64: Move the IST stacks into cpu_entry_area Andy Lutomirski
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

In case something goes wrong with unwind (not unlikely in case of
overflow), print the offending IP where we detected the overflow.
---
 arch/x86/kernel/irq_64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 020efbf5786b..d86e344f5b3d 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -57,10 +57,10 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (regs->sp >= estack_top && regs->sp <= estack_bottom)
 		return;
 
-	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx)\n",
+	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx,ip:%pF)\n",
 		current->comm, curbase, regs->sp,
 		irq_stack_top, irq_stack_bottom,
-		estack_top, estack_bottom);
+		estack_top, estack_bottom, (void *)regs->ip);
 
 	if (sysctl_panic_on_stackoverflow)
 		panic("low stack detected by irq handler - check messages\n");
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 16/16] x86/entry/64: Move the IST stacks into cpu_entry_area
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (14 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 15/16] x86/irq/64: In the stack overflow warning, print the offending IP Andy Lutomirski
@ 2017-11-20 17:07 ` Andy Lutomirski
  2017-11-21  7:38   ` Ingo Molnar
  2017-11-20 18:48 ` [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
  2017-11-21  7:33 ` Ingo Molnar
  17 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 17:07 UTC (permalink / raw)
  To: X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf, Andy Lutomirski

The IST stacks are needed when an IST exception occurs and are
accessed before any kernel code at all runs.  Move them into
cpu_entry_area.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/fixmap.h | 10 ++++++++++
 arch/x86/kernel/cpu/common.c  | 40 +++++++++++++++++++++++++---------------
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index f3239c88179d..00e98bbe2448 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -61,6 +61,16 @@ struct cpu_entry_area
 	struct tss_struct tss;
 
 	char entry_trampoline[PAGE_SIZE];
+
+#ifdef CONFIG_X86_64
+	/*
+	 * Exception stacks used for IST entries.
+	 *
+	 * In the future, this should have a separate slot for each stack
+	 * with guard pages between them.
+	 */
+	char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ];
+#endif
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c15643393049..912759b2cef4 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -503,6 +503,22 @@ static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages,
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * Special IST stacks which the CPU switches to when it calls
+ * an IST-marked descriptor entry. Up to 7 stacks (hardware
+ * limit), all of them are 4K, except the debug stack which
+ * is 8K.
+ */
+static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
+	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
+	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
+};
+
+static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -558,6 +574,14 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 #ifdef CONFIG_X86_64
+	BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
+	BUILD_BUG_ON(sizeof(exception_stacks) !=
+		     sizeof(((struct cpu_entry_area *)0)->exception_stacks));
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
+				&per_cpu(exception_stacks, cpu),
+				sizeof(exception_stacks) / PAGE_SIZE,
+				PAGE_KERNEL);
+
 	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
@@ -1408,20 +1432,6 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
-/*
- * Special IST stacks which the CPU switches to when it calls
- * an IST-marked descriptor entry. Up to 7 stacks (hardware
- * limit), all of them are 4K, except the debug stack which
- * is 8K.
- */
-static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
-	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
-	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
-};
-
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
-	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
@@ -1627,7 +1637,7 @@ void cpu_init(void)
 	 * set up and load the per-CPU TSS
 	 */
 	if (!oist->ist[0]) {
-		char *estacks = per_cpu(exception_stacks, cpu);
+		char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
 			estacks += exception_stack_sizes[v];
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 00/16] Entry stuff, in decent shape now
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (15 preceding siblings ...)
  2017-11-20 17:07 ` [PATCH 16/16] x86/entry/64: Move the IST stacks into cpu_entry_area Andy Lutomirski
@ 2017-11-20 18:48 ` Andy Lutomirski
  2017-11-21  7:33 ` Ingo Molnar
  17 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 18:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

On Mon, Nov 20, 2017 at 9:07 AM, Andy Lutomirski <luto@kernel.org> wrote:
> This sets up stack switching, including for SYSCALL.  I think it's
> in decent shape.

>
>  - 32-bit kernels are failing the sigreturn_32 test.  But they're also
>    failing without the patches, so I'm not sure this is a bug in the
>    series per se.  Needs further investigation.  (Off the top of my head,
>    this could be further fallout from Thomas's IDT rework.)

Hmm, I lied.  They *were* failing, but it was fixed by:

commit d0cd64b02aa854d68ce517cb7da1fe4e4fff2653
Author: Yonghong Song <yhs@fb.com>
Date:   Wed Nov 8 11:28:45 2017 -0800

    x86/idt: Remove X86_TRAP_BP initialization in idt_setup_traps()

so never mind.

I'd love to get the x86 selftests added to 0day's pile.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-20 17:07 ` [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for " Andy Lutomirski
@ 2017-11-20 20:42   ` Josh Poimboeuf
  2017-11-20 20:46     ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Josh Poimboeuf @ 2017-11-20 20:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds

On Mon, Nov 20, 2017 at 09:07:33AM -0800, Andy Lutomirski wrote:
> +bool in_SYSENTER_stack(unsigned long *stack, struct stack_info *info)

Can you make it lowercase for consistency with the other in_*_stack()
functions?  For example, in_irq_stack() is all lowercase even though
"IRQ" is normally written in uppercase.

But also, I'm wondering whether this get_stack_info() support is even
really needed.

As currently written, the trampoline code doesn't have any ORC data
associated with it.  So the unwinder would never have the need to
actually read the SYSENTER stack.

You _could_ add an UNWIND_HINT_IRET_REGS annotation after the simulated
iret frame is written, which would allow the unwinder to dump those regs
when unwinding from an NMI.

But there's only a tiny window where that would be possible: only a few
instructions.  I'm not sure that would be worth the effort, unless we
got to the point where we expect to have 100% unwinder coverage.  But
that's currently unrealistic anyway because of generated code and
runtime patching.

-- 
Josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-20 20:42   ` Josh Poimboeuf
@ 2017-11-20 20:46     ` Andy Lutomirski
  2017-11-20 21:00       ` Josh Poimboeuf
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 20:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds

On Mon, Nov 20, 2017 at 12:42 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, Nov 20, 2017 at 09:07:33AM -0800, Andy Lutomirski wrote:
>> +bool in_SYSENTER_stack(unsigned long *stack, struct stack_info *info)
>
> Can you make it lowercase for consistency with the other in_*_stack()
> functions?  For example, in_irq_stack() is all lowercase even though
> "IRQ" is normally written in uppercase.
>
> But also, I'm wondering whether this get_stack_info() support is even
> really needed.
>
> As currently written, the trampoline code doesn't have any ORC data
> associated with it.  So the unwinder would never have the need to
> actually read the SYSENTER stack.
>
> You _could_ add an UNWIND_HINT_IRET_REGS annotation after the simulated
> iret frame is written, which would allow the unwinder to dump those regs
> when unwinding from an NMI.

There's some ORC data in the non-trampoline  SYSENTER path but, more
importantly, the OOPS unwinder will just bail without this patch.
With the patch, we get a valid unwind, except that everything has a ?
in front.

>
> But there's only a tiny window where that would be possible: only a few
> instructions.  I'm not sure that would be worth the effort, unless we
> got to the point where we expect to have 100% unwinder coverage.  But
> that's currently unrealistic anyway because of generated code and
> runtime patching.

I tripped it myself several times when debugging this code.

>
> --
> Josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-20 20:46     ` Andy Lutomirski
@ 2017-11-20 21:00       ` Josh Poimboeuf
  2017-11-20 21:07         ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Josh Poimboeuf @ 2017-11-20 21:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds

On Mon, Nov 20, 2017 at 12:46:13PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 20, 2017 at 12:42 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, Nov 20, 2017 at 09:07:33AM -0800, Andy Lutomirski wrote:
> >> +bool in_SYSENTER_stack(unsigned long *stack, struct stack_info *info)
> >
> > Can you make it lowercase for consistency with the other in_*_stack()
> > functions?  For example, in_irq_stack() is all lowercase even though
> > "IRQ" is normally written in uppercase.
> >
> > But also, I'm wondering whether this get_stack_info() support is even
> > really needed.
> >
> > As currently written, the trampoline code doesn't have any ORC data
> > associated with it.  So the unwinder would never have the need to
> > actually read the SYSENTER stack.
> >
> > You _could_ add an UNWIND_HINT_IRET_REGS annotation after the simulated
> > iret frame is written, which would allow the unwinder to dump those regs
> > when unwinding from an NMI.
> 
> There's some ORC data in the non-trampoline  SYSENTER path

But that's *after* the stack switch to the real kernel stack, right?

> but, more importantly, the OOPS unwinder will just bail without this
> patch.  With the patch, we get a valid unwind, except that everything
> has a ?  in front.

Hm.  I can't even fathom how that's possible.  Are you talking about the
"unwind from NMI to SYSENTER stack" path?  Or any unwind to a syscall?
Either way I'm baffled...  If the unwinder only encounters the SYSENTER
stack at the end, how could that cause everything beforehand to have a
question mark?

> > But there's only a tiny window where that would be possible: only a few
> > instructions.  I'm not sure that would be worth the effort, unless we
> > got to the point where we expect to have 100% unwinder coverage.  But
> > that's currently unrealistic anyway because of generated code and
> > runtime patching.
> 
> I tripped it myself several times when debugging this code.

Again I don't see how this patch would help if there's no ORC data for
the code which uses the SYSENTER stack.  I must be missing something.

-- 
Josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-20 21:00       ` Josh Poimboeuf
@ 2017-11-20 21:07         ` Andy Lutomirski
  2017-11-20 21:27           ` Josh Poimboeuf
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 21:07 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds

On Mon, Nov 20, 2017 at 1:00 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, Nov 20, 2017 at 12:46:13PM -0800, Andy Lutomirski wrote:
>> On Mon, Nov 20, 2017 at 12:42 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Mon, Nov 20, 2017 at 09:07:33AM -0800, Andy Lutomirski wrote:
>> >> +bool in_SYSENTER_stack(unsigned long *stack, struct stack_info *info)
>> >
>> > Can you make it lowercase for consistency with the other in_*_stack()
>> > functions?  For example, in_irq_stack() is all lowercase even though
>> > "IRQ" is normally written in uppercase.
>> >
>> > But also, I'm wondering whether this get_stack_info() support is even
>> > really needed.
>> >
>> > As currently written, the trampoline code doesn't have any ORC data
>> > associated with it.  So the unwinder would never have the need to
>> > actually read the SYSENTER stack.
>> >
>> > You _could_ add an UNWIND_HINT_IRET_REGS annotation after the simulated
>> > iret frame is written, which would allow the unwinder to dump those regs
>> > when unwinding from an NMI.
>>
>> There's some ORC data in the non-trampoline  SYSENTER path
>
> But that's *after* the stack switch to the real kernel stack, right?

Hmm, maybe you're right.

>
>> but, more importantly, the OOPS unwinder will just bail without this
>> patch.  With the patch, we get a valid unwind, except that everything
>> has a ?  in front.
>
> Hm.  I can't even fathom how that's possible.  Are you talking about the
> "unwind from NMI to SYSENTER stack" path?  Or any unwind to a syscall?
> Either way I'm baffled...  If the unwinder only encounters the SYSENTER
> stack at the end, how could that cause everything beforehand to have a
> question mark?

I mean that, if I put a ud2 or other bug in the code that runs on the
SYSENTER stack, without this patch, I get a totally blank call trace.

>
>> > But there's only a tiny window where that would be possible: only a few
>> > instructions.  I'm not sure that would be worth the effort, unless we
>> > got to the point where we expect to have 100% unwinder coverage.  But
>> > that's currently unrealistic anyway because of generated code and
>> > runtime patching.
>>
>> I tripped it myself several times when debugging this code.
>
> Again I don't see how this patch would help if there's no ORC data for
> the code which uses the SYSENTER stack.  I must be missing something.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-20 21:07         ` Andy Lutomirski
@ 2017-11-20 21:27           ` Josh Poimboeuf
  2017-11-20 21:30             ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Josh Poimboeuf @ 2017-11-20 21:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds

On Mon, Nov 20, 2017 at 01:07:16PM -0800, Andy Lutomirski wrote:
> >> but, more importantly, the OOPS unwinder will just bail without this
> >> patch.  With the patch, we get a valid unwind, except that everything
> >> has a ?  in front.
> >
> > Hm.  I can't even fathom how that's possible.  Are you talking about the
> > "unwind from NMI to SYSENTER stack" path?  Or any unwind to a syscall?
> > Either way I'm baffled...  If the unwinder only encounters the SYSENTER
> > stack at the end, how could that cause everything beforehand to have a
> > question mark?
> 
> I mean that, if I put a ud2 or other bug in the code that runs on the
> SYSENTER stack, without this patch, I get a totally blank call trace.

I would expect a blank call trace either way...

In fact I just added a ud1 after your RDI save, and got a blank call
trace.

Also the RIP printout isn't very helpful, I guess kallsyms doesn't know
about the trampoline?

  [    2.816226] invalid opcode: 0000 [#10] PREEMPT SMP
  [    2.816968] Modules linked in:
  [    2.817097] CPU: 3 PID: 167 Comm: modprobe Tainted: G      D W       4.14.0+ #6
  [    2.817097] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
  [    2.817097] task: ffff88013388c000 task.stack: ffffc9000119c000
  [    2.817097] RIP: 0010:0xffffffffff54a01f
  [    2.817097] RSP: 0018:ffffc9000119ffd0 EFLAGS: 00010046
  [    2.817097] RAX: 000000000000000c RBX: 0000000000000001 RCX: 00007f16fdcf1c89
  [    2.817097] RDX: 00007ffc0b80f030 RSI: 0000561065557480 RDI: 0000000000000000
  [    2.817097] RBP: 0000561065554040 R08: 0000000000000001 R09: 00007ffc0b80f059
  [    2.817097] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000009
  [    2.817097] R13: 00007f16fdcd9190 R14: 0000000000000001 R15: 0000000000001000
  [    2.817097] FS:  0000000000000000(0000) GS:ffff88013ac00000(0000) knlGS:0000000000000000
  [    2.817097] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [    2.817097] CR2: 00007f16fdefe0c0 CR3: 0000000133c79004 CR4: 00000000001606e0
  [    2.817097] Call Trace:
  [    2.817097] Code: 00 00 00 00 00 00 00 00 00 00 00 00 0f 01 f8 48 89 25 f6 d1 ff ff 48 8b 25 03 d2 ff ff 6a 2b ff 35 e7 d1 ff ff 41 53 6a 33 51 57 <0f> b9 48 c7 c7 90 6f 92 81 ff e7 90 90 90 90 90 90 90 90 90 90 
  [    2.817097] RIP: 0xffffffffff54a01f RSP: ffffc9000119ffd0
  [    2.817097] ---[ end trace 215b2a1d93232ed1 ]---

-- 
Josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-20 21:27           ` Josh Poimboeuf
@ 2017-11-20 21:30             ` Andy Lutomirski
  2017-11-20 21:55               ` Josh Poimboeuf
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-20 21:30 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds

On Mon, Nov 20, 2017 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, Nov 20, 2017 at 01:07:16PM -0800, Andy Lutomirski wrote:
>> >> but, more importantly, the OOPS unwinder will just bail without this
>> >> patch.  With the patch, we get a valid unwind, except that everything
>> >> has a ?  in front.
>> >
>> > Hm.  I can't even fathom how that's possible.  Are you talking about the
>> > "unwind from NMI to SYSENTER stack" path?  Or any unwind to a syscall?
>> > Either way I'm baffled...  If the unwinder only encounters the SYSENTER
>> > stack at the end, how could that cause everything beforehand to have a
>> > question mark?
>>
>> I mean that, if I put a ud2 or other bug in the code that runs on the
>> SYSENTER stack, without this patch, I get a totally blank call trace.
>
> I would expect a blank call trace either way...

Try making sync_regs use a few kB of stack space or, better yet, call
a non-inlined function that uses too much stack.

>
> In fact I just added a ud1 after your RDI save, and got a blank call
> trace.
>
> Also the RIP printout isn't very helpful, I guess kallsyms doesn't know
> about the trampoline?

Um, right.  I should fix that.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/16] x86/asm/64: Allocate and enable the SYSENTER stack
  2017-11-20 17:07 ` [PATCH 01/16] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
@ 2017-11-20 21:55   ` Thomas Gleixner
  2017-11-21 10:47   ` Borislav Petkov
  1 sibling, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2017-11-20 21:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

On Mon, 20 Nov 2017, Andy Lutomirski wrote:

> This will simplify future changes that want scratch variables early in
> the SYSENTER handler -- they'll be able to spill registers to the
> stack.  It also lets us get rid of a SWAPGS_UNSAFE_STACK user.
> 
> This does not depend on CONFIG_IA32_EMULATION because we'll want the
> stack space even without IA32 emulation.
> 
> As far as I can tell, the reason that this wasn't done from day 1 is
> that we use IST for #DB and #BP, which is IMO rather nasty and causes
> a lot more problems than it solves.  But, since #DB uses IST, we don't
> actually need a real stack for SYSENTER (because SYSENTER with TF set
> will invoke #DB on the IST stack rather than the SYSENTER stack).
> I want to remove IST usage from these vectors some day, and this patch
> is a prerequisite for that as well.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-20 21:30             ` Andy Lutomirski
@ 2017-11-20 21:55               ` Josh Poimboeuf
  2017-11-21  1:39                 ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Josh Poimboeuf @ 2017-11-20 21:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds

On Mon, Nov 20, 2017 at 01:30:12PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 20, 2017 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, Nov 20, 2017 at 01:07:16PM -0800, Andy Lutomirski wrote:
> >> >> but, more importantly, the OOPS unwinder will just bail without this
> >> >> patch.  With the patch, we get a valid unwind, except that everything
> >> >> has a ?  in front.
> >> >
> >> > Hm.  I can't even fathom how that's possible.  Are you talking about the
> >> > "unwind from NMI to SYSENTER stack" path?  Or any unwind to a syscall?
> >> > Either way I'm baffled...  If the unwinder only encounters the SYSENTER
> >> > stack at the end, how could that cause everything beforehand to have a
> >> > question mark?
> >>
> >> I mean that, if I put a ud2 or other bug in the code that runs on the
> >> SYSENTER stack, without this patch, I get a totally blank call trace.
> >
> > I would expect a blank call trace either way...
> 
> Try making sync_regs use a few kB of stack space or, better yet, call
> a non-inlined function that uses too much stack.

You mean overflow the exception stack?  I still don't see how that would
do it.

If you could show a specific example, with splats from before/after,
that would be helpful.  Because I still have no idea how this patch
could possibly help.

-- 
Josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 03/16] x86/gdt: Put per-cpu GDT remaps in ascending order
  2017-11-20 17:07 ` [PATCH 03/16] x86/gdt: Put per-cpu GDT remaps in ascending order Andy Lutomirski
@ 2017-11-20 21:56   ` Thomas Gleixner
  2017-11-21 17:16   ` Borislav Petkov
  1 sibling, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2017-11-20 21:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

On Mon, 20 Nov 2017, Andy Lutomirski wrote:

> We currently have CPU 0's GDT at the top of the GDT range and
> higher-numbered CPUs at lower addreses.  This happens because the
> fixmap is upside down (index 0 is the top of the fixmap).
> 
> Flip it so that GDTs are in ascending order by virtual address.
> This will simplify a future patch that will generalize the GDT
> remap to contain multiple pages.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/16] x86/fixmap: Generalize the GDT fixmap mechanism
  2017-11-20 17:07 ` [PATCH 04/16] x86/fixmap: Generalize the GDT fixmap mechanism Andy Lutomirski
@ 2017-11-20 22:01   ` Thomas Gleixner
  2017-11-21  1:21     ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2017-11-20 22:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

On Mon, 20 Nov 2017, Andy Lutomirski wrote:

Just a few nits.

>  /* Provide the fixmap address of the remapped GDT */
>  static inline struct desc_struct *get_cpu_gdt_ro(int cpu)
>  {
> -	unsigned int idx = get_cpu_gdt_ro_index(cpu);
> -	return (struct desc_struct *)__fix_to_virt(idx);
> +	return (struct desc_struct *)&get_cpu_entry_area(cpu)->gdt;
>  }
>  
>  /* Provide the current read-only GDT */
> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index b0c505fe9a95..5ff8e251772a 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -44,6 +44,17 @@ extern unsigned long __FIXADDR_TOP;
>  			 PAGE_SIZE)
>  #endif
>  
> +/*
> + * cpu_entry_area is a percpu region in the fixmap that contains things
> + * need by the CPU and early entry/exit code.  Real types aren't used here

s/need/needed/

> + * to avoid circular header dependencies.

:(

> + */
> +struct cpu_entry_area
> +{
> +	char gdt[PAGE_SIZE];
> +};
> +
> +#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)

> +static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
> +{
> +	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
> +
> +	return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
> +}
> +
> +#define __get_cpu_entry_area_offset_index(cpu, offset) ({		\
> +	BUILD_BUG_ON(offset % PAGE_SIZE != 0);				\
> +	__get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);	\
> +	})
> +
> +#define get_cpu_entry_area_index(cpu, field)				\
> +	__get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))

Any reason why those need to be macros?

> +static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
> +{
> +	return (struct cpu_entry_area *)
> +		__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
> +}
> +

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 05/16] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  2017-11-20 17:07 ` [PATCH 05/16] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss Andy Lutomirski
@ 2017-11-20 22:22   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2017-11-20 22:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

On Mon, 20 Nov 2017, Andy Lutomirski wrote:

> I'm going to move SYSENTER_stack to the beginning of cpu_tss to help
> detect overflow.  Before this can happen, I need to fix several code
> paths that hardcode assumptions about the old layout.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 07/16] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct
  2017-11-20 17:07 ` [PATCH 07/16] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct Andy Lutomirski
@ 2017-11-20 23:37   ` Thomas Gleixner
  2017-11-21  1:25     ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2017-11-20 23:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

On Mon, 20 Nov 2017, Andy Lutomirski wrote:
>  struct tss_struct {
>  	/*
> -	 * The hardware state:
> +	 * Space for the temporary SYSENTER stack.  Used for the entry
> +	 * trampoline as well.  Size it such that tss_struct ends up
> +	 * as a multiple of PAGE_SIZE.  This calculation assumes that
> +	 * io_bitmap is a multiple of PAGE_SIZE (8192 bytes) plus one
> +	 * long.

I don't see how sizeof(tss_struct) is a multiple of PAGE_SIZE

canary	    	=    8
stack		=  512
hw_tss	    	=  104
io bitmap	= 8200
-------------------------
		  8824

The alignment is what blows it up to 3 * PAGE_SIZE

> +	 */
> +	unsigned long		SYSENTER_stack_canary;
> +	unsigned long		SYSENTER_stack[64];
> +
> +	/*
> +	 * The fixed hardware portion.  This must not cross a page boundary
> +	 * at risk of violating the SDM's advice and potentially triggering
> +	 * errata.
>  	 */
>  	struct x86_hw_tss	x86_tss;
>  
> @@ -338,15 +350,9 @@ struct tss_struct {
>  	 * be within the limit.
>  	 */
>  	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
> +} __attribute__((__aligned__(PAGE_SIZE)));
>  

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/16] x86/fixmap: Generalize the GDT fixmap mechanism
  2017-11-20 22:01   ` Thomas Gleixner
@ 2017-11-21  1:21     ` Andy Lutomirski
  2017-11-21  8:29       ` Thomas Gleixner
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-21  1:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds, Josh Poimboeuf

On Mon, Nov 20, 2017 at 2:01 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 20 Nov 2017, Andy Lutomirski wrote:
>
> Just a few nits.
>
>>  /* Provide the fixmap address of the remapped GDT */
>>  static inline struct desc_struct *get_cpu_gdt_ro(int cpu)
>>  {
>> -     unsigned int idx = get_cpu_gdt_ro_index(cpu);
>> -     return (struct desc_struct *)__fix_to_virt(idx);
>> +     return (struct desc_struct *)&get_cpu_entry_area(cpu)->gdt;
>>  }
>>
>>  /* Provide the current read-only GDT */
>> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
>> index b0c505fe9a95..5ff8e251772a 100644
>> --- a/arch/x86/include/asm/fixmap.h
>> +++ b/arch/x86/include/asm/fixmap.h
>> @@ -44,6 +44,17 @@ extern unsigned long __FIXADDR_TOP;
>>                        PAGE_SIZE)
>>  #endif
>>
>> +/*
>> + * cpu_entry_area is a percpu region in the fixmap that contains things
>> + * need by the CPU and early entry/exit code.  Real types aren't used here
>
> s/need/needed/

Yeah, fixed

>
>> + * to avoid circular header dependencies.
>
> :(

Hmm.  I could probably fix this, but it involves (at least) moving a
struct definition and adding several new includes, and I'm not sure
it'll actually converge to something  working.

>
>> + */
>> +struct cpu_entry_area
>> +{
>> +     char gdt[PAGE_SIZE];
>> +};
>> +
>> +#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
>
>> +static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
>> +{
>> +     BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
>> +
>> +     return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
>> +}
>> +
>> +#define __get_cpu_entry_area_offset_index(cpu, offset) ({            \
>> +     BUILD_BUG_ON(offset % PAGE_SIZE != 0);                          \
>> +     __get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);       \
>> +     })
>> +
>> +#define get_cpu_entry_area_index(cpu, field)                         \
>> +     __get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))
>
> Any reason why those need to be macros?

The former is a macro because I doubt that BUILD_BUG_ON is valid in
that context in a function.  The latter is a macro because 'field' is
a name, not a value.

>
>> +static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
>> +{
>> +     return (struct cpu_entry_area *)
>> +             __fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
>> +}
>> +
>
> Thanks,
>
>         tglx
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 07/16] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct
  2017-11-20 23:37   ` Thomas Gleixner
@ 2017-11-21  1:25     ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-21  1:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds, Josh Poimboeuf

On Mon, Nov 20, 2017 at 3:37 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 20 Nov 2017, Andy Lutomirski wrote:
>>  struct tss_struct {
>>       /*
>> -      * The hardware state:
>> +      * Space for the temporary SYSENTER stack.  Used for the entry
>> +      * trampoline as well.  Size it such that tss_struct ends up
>> +      * as a multiple of PAGE_SIZE.  This calculation assumes that
>> +      * io_bitmap is a multiple of PAGE_SIZE (8192 bytes) plus one
>> +      * long.
>
> I don't see how sizeof(tss_struct) is a multiple of PAGE_SIZE
>
> canary          =    8
> stack           =  512
> hw_tss          =  104
> io bitmap       = 8200
> -------------------------
>                   8824
>
> The alignment is what blows it up to 3 * PAGE_SIZE

Whoops!  That *was* correct in the RFC version version, but I changed
it and failed to fix the comment.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-20 21:55               ` Josh Poimboeuf
@ 2017-11-21  1:39                 ` Andy Lutomirski
  2017-11-21  2:29                   ` Josh Poimboeuf
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-21  1:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds

On Mon, Nov 20, 2017 at 1:55 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, Nov 20, 2017 at 01:30:12PM -0800, Andy Lutomirski wrote:
>> On Mon, Nov 20, 2017 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Mon, Nov 20, 2017 at 01:07:16PM -0800, Andy Lutomirski wrote:
>> >> >> but, more importantly, the OOPS unwinder will just bail without this
>> >> >> patch.  With the patch, we get a valid unwind, except that everything
>> >> >> has a ?  in front.
>> >> >
>> >> > Hm.  I can't even fathom how that's possible.  Are you talking about the
>> >> > "unwind from NMI to SYSENTER stack" path?  Or any unwind to a syscall?
>> >> > Either way I'm baffled...  If the unwinder only encounters the SYSENTER
>> >> > stack at the end, how could that cause everything beforehand to have a
>> >> > question mark?
>> >>
>> >> I mean that, if I put a ud2 or other bug in the code that runs on the
>> >> SYSENTER stack, without this patch, I get a totally blank call trace.
>> >
>> > I would expect a blank call trace either way...
>>
>> Try making sync_regs use a few kB of stack space or, better yet, call
>> a non-inlined function that uses too much stack.
>
> You mean overflow the exception stack?  I still don't see how that would
> do it.
>
> If you could show a specific example, with splats from before/after,
> that would be helpful.  Because I still have no idea how this patch
> could possibly help.

I added BUG() to sync_regs().  With the patch, I get:

[    4.211553] PANIC: double fault, error_code: 0x0
[    4.212113] CPU: 0 PID: 1 Comm: sh Not tainted 4.14.0+ #920
[    4.212741] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1.fc26 04/01/2014
[    4.213536] task: ffff88001aa18000 task.stack: ffff88001aa20000
[    4.214059] RIP: 0010:do_error_trap+0x33/0x1c0
[    4.214449] RSP: 0000:ffffffffff1b8f78 EFLAGS: 00010096
[    4.214934] RAX: dffffc0000000000 RBX: ffffffffff1b8f90 RCX: 0000000000000006
[    4.215554] RDX: ffffffff82048b20 RSI: 0000000000000000 RDI: ffffffffff1b9110
[    4.216176] RBP: ffffffffff1b9088 R08: 0000000000000004 R09: 0000000000000000
[    4.216793] R10: 0000000000000000 R11: fffffbffffe3723f R12: 0000000000000006
[    4.217419] R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000000
[    4.218046] FS:  0000000000000000(0000) GS:ffff88001ae00000(0000)
knlGS:0000000000000000
[    4.218775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.219280] CR2: ffffffffff1b8f68 CR3: 00000000193da002 CR4: 00000000003606f0
[    4.219931] Call Trace:
[    4.220156]  <SYSENTER>
[    4.220383]  ? async_page_fault+0x36/0x60
[    4.220768]  ? invalid_op+0x22/0x40
[    4.221087]  ? async_page_fault+0x36/0x60
[    4.221442]  ? sync_regs+0x3c/0x40
[    4.221745]  ? sync_regs+0x2e/0x40
[    4.222051]  ? error_entry+0x6c/0xd0
[    4.222395]  ? async_page_fault+0x36/0x60
[    4.222748]  </SYSENTER>
[    4.223014] Code: 00 00 00 fc ff df 41 55 41 54 49 89 f7 55 53 48
89 fd 48 81 c7 88 00 00 00 49 89 cc 45 89 c6 48 81 ec d8 00 00 00 48
8d 5c 24 18 <48> 89 14 24 48 c7 44 24 18 b3 8a b5 41 48 c7 44 24 20 b8
5e 4d

Without the patch, I get:

[    3.962218] PANIC: double fault, error_code: 0x0
[    3.962728] CPU: 0 PID: 1 Comm: sh Not tainted 4.14.0+ #921
[    3.963225] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1.fc26 04/01/2014
[    3.963998] task: ffff88001aa18000 task.stack: ffff88001aa20000
[    3.964526] RIP: 0010:do_error_trap+0x33/0x1c0
[    3.964947] RSP: 0000:ffffffffff1b8f78 EFLAGS: 00010096
[    3.965439] RAX: dffffc0000000000 RBX: ffffffffff1b8f90 RCX: 0000000000000006
[    3.966067] RDX: ffffffff82048b20 RSI: 0000000000000000 RDI: ffffffffff1b9110
[    3.966696] RBP: ffffffffff1b9088 R08: 0000000000000004 R09: 0000000000000000
[    3.967316] R10: 0000000000000000 R11: fffffbffffe3723f R12: 0000000000000006
[    3.967939] R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000000
[    3.968565] FS:  0000000000000000(0000) GS:ffff88001ae00000(0000)
knlGS:0000000000000000
[    3.969272] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.969776] CR2: ffffffffff1b8f68 CR3: 0000000019308002 CR4: 00000000003606f0
[    3.970406] Call Trace:
[    3.970632] Code: 00 00 00 fc ff df 41 55 41 54 49 89 f7 55 53 48
89 fd 48 81 c7 88 00 00 00 49 89 cc 45 89 c6 48 81 ec d8 00 00 00 48
8d 5c 24 18 <48> 89 14 24 48 c7 44 24 18 b3 8a b5 41 48 c7 44 24 20 b8
5e 4d

I prefer the former :)

>
> --
> Josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-21  1:39                 ` Andy Lutomirski
@ 2017-11-21  2:29                   ` Josh Poimboeuf
  0 siblings, 0 replies; 52+ messages in thread
From: Josh Poimboeuf @ 2017-11-21  2:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds

On Mon, Nov 20, 2017 at 05:39:34PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 20, 2017 at 1:55 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, Nov 20, 2017 at 01:30:12PM -0800, Andy Lutomirski wrote:
> >> On Mon, Nov 20, 2017 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Mon, Nov 20, 2017 at 01:07:16PM -0800, Andy Lutomirski wrote:
> >> >> >> but, more importantly, the OOPS unwinder will just bail without this
> >> >> >> patch.  With the patch, we get a valid unwind, except that everything
> >> >> >> has a ?  in front.
> >> >> >
> >> >> > Hm.  I can't even fathom how that's possible.  Are you talking about the
> >> >> > "unwind from NMI to SYSENTER stack" path?  Or any unwind to a syscall?
> >> >> > Either way I'm baffled...  If the unwinder only encounters the SYSENTER
> >> >> > stack at the end, how could that cause everything beforehand to have a
> >> >> > question mark?
> >> >>
> >> >> I mean that, if I put a ud2 or other bug in the code that runs on the
> >> >> SYSENTER stack, without this patch, I get a totally blank call trace.
> >> >
> >> > I would expect a blank call trace either way...
> >>
> >> Try making sync_regs use a few kB of stack space or, better yet, call
> >> a non-inlined function that uses too much stack.
> >
> > You mean overflow the exception stack?  I still don't see how that would
> > do it.
> >
> > If you could show a specific example, with splats from before/after,
> > that would be helpful.  Because I still have no idea how this patch
> > could possibly help.
> 
> I added BUG() to sync_regs().  With the patch, I get:
> 
> [    4.211553] PANIC: double fault, error_code: 0x0
> [    4.212113] CPU: 0 PID: 1 Comm: sh Not tainted 4.14.0+ #920
> [    4.212741] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1.fc26 04/01/2014
> [    4.213536] task: ffff88001aa18000 task.stack: ffff88001aa20000
> [    4.214059] RIP: 0010:do_error_trap+0x33/0x1c0
> [    4.214449] RSP: 0000:ffffffffff1b8f78 EFLAGS: 00010096
> [    4.214934] RAX: dffffc0000000000 RBX: ffffffffff1b8f90 RCX: 0000000000000006
> [    4.215554] RDX: ffffffff82048b20 RSI: 0000000000000000 RDI: ffffffffff1b9110
> [    4.216176] RBP: ffffffffff1b9088 R08: 0000000000000004 R09: 0000000000000000
> [    4.216793] R10: 0000000000000000 R11: fffffbffffe3723f R12: 0000000000000006
> [    4.217419] R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000000
> [    4.218046] FS:  0000000000000000(0000) GS:ffff88001ae00000(0000)
> knlGS:0000000000000000
> [    4.218775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    4.219280] CR2: ffffffffff1b8f68 CR3: 00000000193da002 CR4: 00000000003606f0
> [    4.219931] Call Trace:
> [    4.220156]  <SYSENTER>
> [    4.220383]  ? async_page_fault+0x36/0x60
> [    4.220768]  ? invalid_op+0x22/0x40
> [    4.221087]  ? async_page_fault+0x36/0x60
> [    4.221442]  ? sync_regs+0x3c/0x40
> [    4.221745]  ? sync_regs+0x2e/0x40
> [    4.222051]  ? error_entry+0x6c/0xd0
> [    4.222395]  ? async_page_fault+0x36/0x60
> [    4.222748]  </SYSENTER>

Ah, page fault.  I thought you were talking about an NMI.  I get it now.

Did it overflow the stack?  I think that would explain the question
marks.

-- 
Josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/16] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-20 17:07 ` [PATCH 13/16] x86/entry/64: Create a percpu SYSCALL entry trampoline Andy Lutomirski
@ 2017-11-21  2:34   ` Josh Poimboeuf
  2017-11-21  3:20     ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Josh Poimboeuf @ 2017-11-21  2:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds

On Mon, Nov 20, 2017 at 09:07:44AM -0800, Andy Lutomirski wrote:
> +	/* Save RDI, since we need a scratch register. */
> +	pushq	%rdi
> +
> +	/*
> +	 * x86 lacks a near absolute jump, and we can't jump to the real
> +	 * entry text with a relative jump, so we use a double trampoline.
> +	 */
> +	movq	$entry_SYSCALL_64_stage2, %rdi
> +	jmp	*%rdi
> +END(entry_SYSCALL_64_trampoline)
> +
> +	.popsection
> +
> +ENTRY(entry_SYSCALL_64_stage2)
> +	/*
> +	 * Rather than polluting the normal SYSCALL path with stack switching
> +	 * nonsense, fix up our register state to match its expectations.
> +	 */
> +	UNWIND_HINT_EMPTY
> +	popq %rdi
> +	jmp entry_SYSCALL_64_after_hwframe
> +END(entry_SYSCALL_64_stage2)

Is there a reason why you couldn't just do the following?

	pushq	$entry_SYSCALL_64_after_hwframe
	ret

Then you wouldn't need the 2nd trampoline and the %rdi clobber.

-- 
Josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 13/16] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-21  2:34   ` Josh Poimboeuf
@ 2017-11-21  3:20     ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-21  3:20 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds

On Mon, Nov 20, 2017 at 6:34 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, Nov 20, 2017 at 09:07:44AM -0800, Andy Lutomirski wrote:
>> +     /* Save RDI, since we need a scratch register. */
>> +     pushq   %rdi
>> +
>> +     /*
>> +      * x86 lacks a near absolute jump, and we can't jump to the real
>> +      * entry text with a relative jump, so we use a double trampoline.
>> +      */
>> +     movq    $entry_SYSCALL_64_stage2, %rdi
>> +     jmp     *%rdi
>> +END(entry_SYSCALL_64_trampoline)
>> +
>> +     .popsection
>> +
>> +ENTRY(entry_SYSCALL_64_stage2)
>> +     /*
>> +      * Rather than polluting the normal SYSCALL path with stack switching
>> +      * nonsense, fix up our register state to match its expectations.
>> +      */
>> +     UNWIND_HINT_EMPTY
>> +     popq %rdi
>> +     jmp entry_SYSCALL_64_after_hwframe
>> +END(entry_SYSCALL_64_stage2)
>
> Is there a reason why you couldn't just do the following?
>
>         pushq   $entry_SYSCALL_64_after_hwframe
>         ret
>
> Then you wouldn't need the 2nd trampoline and the %rdi clobber.
>

Good suggestion.  Thanks!

> --
> Josh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 14/16] x86/irq: Remove an old outdated comment about context tracking races
  2017-11-20 17:07 ` [PATCH 14/16] x86/irq: Remove an old outdated comment about context tracking races Andy Lutomirski
@ 2017-11-21  6:25   ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2017-11-21  6:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf


* Andy Lutomirski <luto@kernel.org> wrote:

> That race has been fixed and code cleaned up for a while now.

JFYI, I have added your SOB here, which I assume you just forgot to include.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 15/16] x86/irq/64: In the stack overflow warning, print the offending IP
  2017-11-20 17:07 ` [PATCH 15/16] x86/irq/64: In the stack overflow warning, print the offending IP Andy Lutomirski
@ 2017-11-21  6:26   ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2017-11-21  6:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf


* Andy Lutomirski <luto@kernel.org> wrote:

> In case something goes wrong with unwind (not unlikely in case of
> overflow), print the offending IP where we detected the overflow.

I have added a SOB here as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 11/16] x86/asm/64: Use a percpu trampoline stack for IDT entries
  2017-11-20 17:07 ` [PATCH 11/16] x86/asm/64: Use a percpu trampoline stack for IDT entries Andy Lutomirski
@ 2017-11-21  7:20   ` Ingo Molnar
  2017-11-21 15:36     ` Andy Lutomirski
  2017-11-21 18:57   ` Dave Hansen
  1 sibling, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2017-11-21  7:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf


* Andy Lutomirski <luto@kernel.org> wrote:

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 1ea03027a4a9..e4a941be96cf 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c

> -asmlinkage __visible notrace
> +asmlinkage __visible notrace __no_sanitize_address
>  struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
>  {
>  	/*
>  	 * This is called from entry_64.S early in handling a fault
>  	 * caused by a bad iret to user mode.  To handle the fault
> -	 * correctly, we want move our stack frame to task_pt_regs
> -	 * and we want to pretend that the exception came from the
> -	 * iret target.
> +	 * correctly, we want move our stack frame to where it would
> +	 * be had we entered directly on the entry stack (rather than
> +	 * just below the IRET frame) and we want to pretend that the
> +	 * exception came from the iret target.
>  	 */
>  	struct bad_iret_stack *new_stack =
> -		container_of(task_pt_regs(current),
> -			     struct bad_iret_stack, regs);
> +		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
>  
>  	/* Copy the IRET target to the new stack. */
>  	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);

So the addition of the __no_sanitize_address attribute to fixup_bad_iret() made 
the 64-bit allyesconfig/allmodconfig kernels fail the build:

 In file included from ./include/linux/bitmap.h:9:0,
 arch/x86/kernel/traps.c: In function ‘fixup_bad_iret’:
 ./include/linux/string.h:344:24: error: inlining failed in call to always_inline 
 ‘memmove’: function attribute mismatch
  __FORTIFY_INLINE void *memmove(void *p, const void *q, __kernel_size_t size)
 arch/x86/kernel/traps.c:645:2: error: called from here
 In file included from ./include/linux/bitmap.h:9:0,
 ./include/linux/string.h:344:24: error: inlining failed in call to always_inline 
 ‘memmove’: function attribute mismatch
  __FORTIFY_INLINE void *memmove(void *p, const void *q, __kernel_size_t size)
 arch/x86/kernel/traps.c:642:2: error: called from here
 scripts/Makefile.build:314: recipe for target 'arch/x86/kernel/traps.o' failed

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 00/16] Entry stuff, in decent shape now
  2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
                   ` (16 preceding siblings ...)
  2017-11-20 18:48 ` [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
@ 2017-11-21  7:33 ` Ingo Molnar
  2017-11-21 15:59   ` Andy Lutomirski
  17 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2017-11-21  7:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf


* Andy Lutomirski <luto@kernel.org> wrote:

> This sets up stack switching, including for SYSCALL.  I think it's
> in decent shape.
> 
> Known issues:
>  - KASAN is likely to be busted.  This could be fixed either by teaching
>    KASAN that cpu_entry_area contains valid stacks (I have no clue how
>    to go about doing this) or by rigging up the IST entry code to switch
>    RSP to point to the direct-mapped copy of the stacks before calling
>    into non-KASAN-excluded C code.
> 
>  - 32-bit kernels are failing the sigreturn_32 test.  But they're also
>    failing without the patches, so I'm not sure this is a bug in the
>    series per se.  Needs further investigation.  (Off the top of my head,
>    this could be further fallout from Thomas's IDT rework.)
> 
>  - I think we're going to want a way to turn the stack switching on and
>    off either at boot time or at runtime.  It should be fairly straightforward
>    to make it work.
> 
>  - I think the ORC unwinder isn't so good at dealing with stack overflows.
>    It bails too early (I think), resulting in lots of ? entries.  This
>    isn't a regression with this series -- it's just something that could
>    be improved.

Another problem I just found: IRQ tracing appears busted on 64-bit kernels - with 
lockdep enabled I get this boot warning:

 [    4.309026] WARNING: CPU: 10 PID: 222 at kernel/locking/lockdep.c:3924 check_flags.part.45+0x1a5/0x1b0
 ...
 [    4.309026] possible reason: unannotated irqs-off.

That's on a x86-64 defconfig-ish kernel with CONFIG_PROVE_LOCKING=y, running on an 
AMD system. Full splat below.

Thanks,

	Ingo

[    4.272197] LVT offset 0 assigned for vector 0x400
[    4.278486] perf: AMD IBS detected (0x000000ff)
[    4.284786] kvm: Nested Virtualization enabled
[    4.289447] kvm: Nested Paging enabled
[    4.308496] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
[    4.308512] ------------[ cut here ]------------
[    4.309026] WARNING: CPU: 10 PID: 222 at kernel/locking/lockdep.c:3924 check_flags.part.45+0x1a5/0x1b0
[    4.309026] Modules linked in:
[    4.309026] CPU: 10 PID: 222 Comm: modprobe Not tainted 4.14.0-01345-g9490674-dirty #1
[    4.309026] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 2.0b       03/01/2012
[    4.309026] task: ffff880814b08000 task.stack: ffffc90007dcc000
[    4.309026] RIP: 0010:check_flags.part.45+0x1a5/0x1b0
[    4.309026] RSP: 0018:ffffc90007dcfeb0 EFLAGS: 00010082
[    4.309026] RAX: 000000000000002e RBX: ffff880814b08000 RCX: 0000000000000000
[    4.309026] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff81152e76
[    4.309026] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[    4.309026] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    4.309026] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[    4.309026] FS:  0000000000000000(0000) GS:ffff880817c80000(0000) knlGS:0000000000000000
[    4.309026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.309026] CR2: 00007f8badc39218 CR3: 000000081584b000 CR4: 00000000000406e0
[    4.309026] Call Trace:
[    4.309026]  lock_acquire+0x11a/0x1d0
[    4.309026]  vtime_user_exit+0x3c/0xa0
[    4.309026]  ? __context_tracking_exit.part.4+0x45/0x130
[    4.309026]  __context_tracking_exit.part.4+0x45/0x130
[    4.309026]  do_syscall_64+0x13f/0x220
[    4.309026]  entry_SYSCALL64_slow_path+0x25/0x25
[    4.309026] RIP: 0033:0x7f8bada2e42a
[    4.309026] RSP: 002b:00007fffba194e38 EFLAGS: 00000246 ORIG_RAX: 000000000000000c
[    4.309026] RAX: ffffffffffffffda RBX: 0000000000400040 RCX: 00007f8bada2e42a
[    4.309026] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    4.309026] RBP: 0000000000000009 R08: 000000000000037f R09: 0000000000000064
[    4.309026] R10: 00000000178bfbff R11: 0000000000000246 R12: 00007f8bada19ce0
[    4.309026] R13: 0000000000000000 R14: 0000000000403308 R15: 0000000000001000
[    4.309026] Code: c6 9c f6 ed 81 48 c7 c7 e2 88 ed 81 e8 a0 32 01 00 0f ff e9 ef fe ff ff 48 c7 c6 9d f6 ed 81 48 c7 c7 e2 88 ed 81 e8 86 32 01 00 <0f> ff e9 5f ff ff ff 0f 1f 40 00 41 57 41 56 41 55 41 54 55 53 
[    4.309026] ---[ end trace bdfbcef9b01b1cbb ]---
[    4.309026] possible reason: unannotated irqs-off.
[    4.309026] irq event stamp: 445
[    4.309026] hardirqs last  enabled at (445): [<ffffffff81a4ae23>] swapgs_restore_regs_and_return_to_usermode+0x0/0x3c
[    4.309026] hardirqs last disabled at (444): [<ffffffff81a4bac6>] error_exit+0x6/0x20
[    4.309026] softirqs last  enabled at (24): [<ffffffff81a4e6f7>] __do_softirq+0x3b7/0x468
[    4.309026] softirqs last disabled at (7): [<ffffffff810ebec4>] irq_exit+0xc4/0xd0
[    4.572172] audit: initializing netlink subsys (disabled)
[    4.577846] audit: type=2000 audit(1511245386.576:1): state=initialized audit_enabled=0 res=1
[    4.578742] workingset: timestamp_bits=53 max_order=24 bucket_order=0
[    4.582656] SELinux:  Registering netfilter hooks
[    4.606349] NET: Registered protocol family 38

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 16/16] x86/entry/64: Move the IST stacks into cpu_entry_area
  2017-11-20 17:07 ` [PATCH 16/16] x86/entry/64: Move the IST stacks into cpu_entry_area Andy Lutomirski
@ 2017-11-21  7:38   ` Ingo Molnar
  2017-11-21 14:45     ` Andrey Ryabinin
  2017-11-21 15:33     ` Andy Lutomirski
  0 siblings, 2 replies; 52+ messages in thread
From: Ingo Molnar @ 2017-11-21  7:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf


* Andy Lutomirski <luto@kernel.org> wrote:

>  /* May not be marked __init: used by software suspend */
>  void syscall_init(void)
>  {
> @@ -1627,7 +1637,7 @@ void cpu_init(void)
>  	 * set up and load the per-CPU TSS
>  	 */
>  	if (!oist->ist[0]) {
> -		char *estacks = per_cpu(exception_stacks, cpu);
> +		char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
>  
>  		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
>  			estacks += exception_stack_sizes[v];

This generates a new build warning:

 /home/mingo/tip/arch/x86/kernel/cpu/common.c: In function ‘syscall_init’:
 /home/mingo/tip/arch/x86/kernel/cpu/common.c:1443:6: warning: unused variable ‘cpu’ [-Wunused-variable]
   int cpu = smp_processor_id();

because 'cpu' is now unused in the !CONFIG_IA32_EMULATION part.

The naive fix is something like the patch below, untested.

Thanks,

	Ingo

 arch/x86/kernel/cpu/common.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c7f3a0a19dce..557bad9f4179 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1440,24 +1440,26 @@ void syscall_init(void)
 	extern char _entry_trampoline[];
 	extern char entry_SYSCALL_64_trampoline[];
 
-	int cpu = smp_processor_id();
-
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
 	wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(smp_processor_id())->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));
 
 #ifdef CONFIG_IA32_EMULATION
-	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
-	/*
-	 * This only works on Intel CPUs.
-	 * On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
-	 * This does not cause SYSENTER to jump to the wrong location, because
-	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
-	 */
-	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
-		    offsetofend(struct tss_struct, SYSENTER_stack));
-	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
+	{
+		int cpu = smp_processor_id();
+
+		wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
+		/*
+		 * This only works on Intel CPUs.
+		 * On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
+		 * This does not cause SYSENTER to jump to the wrong location, because
+		 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
+		 */
+		wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
+		wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
+			    (unsigned long)&get_cpu_entry_area(cpu)->tss +
+			    offsetofend(struct tss_struct, SYSENTER_stack));
+		wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
+	}
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 04/16] x86/fixmap: Generalize the GDT fixmap mechanism
  2017-11-21  1:21     ` Andy Lutomirski
@ 2017-11-21  8:29       ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2017-11-21  8:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf

On Mon, 20 Nov 2017, Andy Lutomirski wrote:
> On Mon, Nov 20, 2017 at 2:01 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Mon, 20 Nov 2017, Andy Lutomirski wrote:
> >> + * to avoid circular header dependencies.
> >
> > :(
> 
> Hmm.  I could probably fix this, but it involves (at least) moving a
> struct definition and adding several new includes, and I'm not sure
> it'll actually converge to something  working.

Yeah, it's include hell. Looked at it and it's major churn.

> >> + */
> >> +struct cpu_entry_area
> >> +{
> >> +     char gdt[PAGE_SIZE];
> >> +};
> >> +
> >> +#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
> >
> >> +static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
> >> +{
> >> +     BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
> >> +
> >> +     return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
> >> +}
> >> +
> >> +#define __get_cpu_entry_area_offset_index(cpu, offset) ({            \
> >> +     BUILD_BUG_ON(offset % PAGE_SIZE != 0);                          \
> >> +     __get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);       \
> >> +     })
> >> +
> >> +#define get_cpu_entry_area_index(cpu, field)                         \
> >> +     __get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))
> >
> > Any reason why those need to be macros?
> 
> The former is a macro because I doubt that BUILD_BUG_ON is valid in
> that context in a function.

Fair enough.

> The latter is a macro because 'field' is a name, not a value.

Bah. right. 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 01/16] x86/asm/64: Allocate and enable the SYSENTER stack
  2017-11-20 17:07 ` [PATCH 01/16] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
  2017-11-20 21:55   ` Thomas Gleixner
@ 2017-11-21 10:47   ` Borislav Petkov
  1 sibling, 0 replies; 52+ messages in thread
From: Borislav Petkov @ 2017-11-21 10:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Brian Gerst, Dave Hansen, Linus Torvalds,
	Josh Poimboeuf

On Mon, Nov 20, 2017 at 09:07:32AM -0800, Andy Lutomirski wrote:
> This will simplify future changes that want scratch variables early in
> the SYSENTER handler -- they'll be able to spill registers to the
> stack.  It also lets us get rid of a SWAPGS_UNSAFE_STACK user.
> 
> This does not depend on CONFIG_IA32_EMULATION because we'll want the
> stack space even without IA32 emulation.
> 
> As far as I can tell, the reason that this wasn't done from day 1 is
> that we use IST for #DB and #BP, which is IMO rather nasty and causes
> a lot more problems than it solves.  But, since #DB uses IST, we don't
> actually need a real stack for SYSENTER (because SYSENTER with TF set
> will invoke #DB on the IST stack rather than the SYSENTER stack).
> I want to remove IST usage from these vectors some day, and this patch
> is a prerequisite for that as well.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/entry_64_compat.S | 2 +-
>  arch/x86/include/asm/processor.h | 3 ---
>  arch/x86/kernel/asm-offsets.c    | 5 +++++
>  arch/x86/kernel/asm-offsets_32.c | 5 -----
>  arch/x86/kernel/cpu/common.c     | 4 +++-
>  arch/x86/kernel/process.c        | 2 --
>  arch/x86/kernel/traps.c          | 3 +--
>  7 files changed, 10 insertions(+), 14 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 16/16] x86/entry/64: Move the IST stacks into cpu_entry_area
  2017-11-21  7:38   ` Ingo Molnar
@ 2017-11-21 14:45     ` Andrey Ryabinin
  2017-11-23 15:25       ` Andy Lutomirski
  2017-11-21 15:33     ` Andy Lutomirski
  1 sibling, 1 reply; 52+ messages in thread
From: Andrey Ryabinin @ 2017-11-21 14:45 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, linux-kernel, Brian Gerst, Dave Hansen,
	Linus Torvalds, Josh Poimboeuf



On 11/21/2017 10:38 AM, Ingo Molnar wrote:
> 
> * Andy Lutomirski <luto@kernel.org> wrote:
> 
>>  /* May not be marked __init: used by software suspend */
>>  void syscall_init(void)
>>  {
>> @@ -1627,7 +1637,7 @@ void cpu_init(void)
>>  	 * set up and load the per-CPU TSS
>>  	 */
>>  	if (!oist->ist[0]) {
>> -		char *estacks = per_cpu(exception_stacks, cpu);
>> +		char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
>>  
>>  		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
>>  			estacks += exception_stack_sizes[v];
> 
> This generates a new build warning:
> 
>  /home/mingo/tip/arch/x86/kernel/cpu/common.c: In function ‘syscall_init’:
>  /home/mingo/tip/arch/x86/kernel/cpu/common.c:1443:6: warning: unused variable ‘cpu’ [-Wunused-variable]
>    int cpu = smp_processor_id();
> 
> because 'cpu' is now unused in the !CONFIG_IA32_EMULATION part.
> 
> The naive fix is something like the patch below, untested.
> 
> Thanks,
> 
> 	Ingo
> 
>  arch/x86/kernel/cpu/common.c | 30 ++++++++++++++++--------------
>  1 file changed, 16 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index c7f3a0a19dce..557bad9f4179 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1440,24 +1440,26 @@ void syscall_init(void)
>  	extern char _entry_trampoline[];
>  	extern char entry_SYSCALL_64_trampoline[];
>  
> -	int cpu = smp_processor_id();
> -
>  	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
>  	wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(smp_processor_id())->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));

             It would be better to use 'cpu' variable here  ^^^^^^^^^^^^^^^^^
>  
>  #ifdef CONFIG_IA32_EMULATION
> -	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
> -	/*
> -	 * This only works on Intel CPUs.
> -	 * On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
> -	 * This does not cause SYSENTER to jump to the wrong location, because
> -	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
> -	 */
> -	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
> -	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
> -		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
> -		    offsetofend(struct tss_struct, SYSENTER_stack));
> -	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
> +	{
> +		int cpu = smp_processor_id();
> +
> +		wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
> +		/*
> +		 * This only works on Intel CPUs.
> +		 * On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
> +		 * This does not cause SYSENTER to jump to the wrong location, because
> +		 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
> +		 */
> +		wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
> +		wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
> +			    (unsigned long)&get_cpu_entry_area(cpu)->tss +
> +			    offsetofend(struct tss_struct, SYSENTER_stack));
> +		wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
> +	}
>  #else
>  	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
>  	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 16/16] x86/entry/64: Move the IST stacks into cpu_entry_area
  2017-11-21  7:38   ` Ingo Molnar
  2017-11-21 14:45     ` Andrey Ryabinin
@ 2017-11-21 15:33     ` Andy Lutomirski
  1 sibling, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-21 15:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds, Josh Poimboeuf

On Mon, Nov 20, 2017 at 11:38 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>>  /* May not be marked __init: used by software suspend */
>>  void syscall_init(void)
>>  {
>> @@ -1627,7 +1637,7 @@ void cpu_init(void)
>>        * set up and load the per-CPU TSS
>>        */
>>       if (!oist->ist[0]) {
>> -             char *estacks = per_cpu(exception_stacks, cpu);
>> +             char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
>>
>>               for (v = 0; v < N_EXCEPTION_STACKS; v++) {
>>                       estacks += exception_stack_sizes[v];
>
> This generates a new build warning:
>
>  /home/mingo/tip/arch/x86/kernel/cpu/common.c: In function ‘syscall_init’:
>  /home/mingo/tip/arch/x86/kernel/cpu/common.c:1443:6: warning: unused variable ‘cpu’ [-Wunused-variable]
>    int cpu = smp_processor_id();
>
> because 'cpu' is now unused in the !CONFIG_IA32_EMULATION part.
>
> The naive fix is something like the patch below, untested.

I don't think I get this warning (at least not on my latest tree), but
there's an obvious cleanup to the earlier SYSCALL trampoline patch
that makes cpu be definitely used.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 11/16] x86/asm/64: Use a percpu trampoline stack for IDT entries
  2017-11-21  7:20   ` Ingo Molnar
@ 2017-11-21 15:36     ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-21 15:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds, Josh Poimboeuf

On Mon, Nov 20, 2017 at 11:20 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index 1ea03027a4a9..e4a941be96cf 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>
>> -asmlinkage __visible notrace
>> +asmlinkage __visible notrace __no_sanitize_address
>>  struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
>>  {
>>       /*
>>        * This is called from entry_64.S early in handling a fault
>>        * caused by a bad iret to user mode.  To handle the fault
>> -      * correctly, we want move our stack frame to task_pt_regs
>> -      * and we want to pretend that the exception came from the
>> -      * iret target.
>> +      * correctly, we want move our stack frame to where it would
>> +      * be had we entered directly on the entry stack (rather than
>> +      * just below the IRET frame) and we want to pretend that the
>> +      * exception came from the iret target.
>>        */
>>       struct bad_iret_stack *new_stack =
>> -             container_of(task_pt_regs(current),
>> -                          struct bad_iret_stack, regs);
>> +             (struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
>>
>>       /* Copy the IRET target to the new stack. */
>>       memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
>
> So the addition of the __no_sanitize_address attribute to fixup_bad_iret() made
> the 64-bit allyesconfig/allmodconfig kernels fail the build:

Yeah, I *hate* the way this works.  Apparently getting
__no_sanitize_address to be fully reliable involves just building the
whole file without KASAN.

That being said, I have this fixed much better in my latest tree: I
fixed up KASAN so that I don't need to turn it off for these helpers.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 00/16] Entry stuff, in decent shape now
  2017-11-21  7:33 ` Ingo Molnar
@ 2017-11-21 15:59   ` Andy Lutomirski
  2017-11-21 16:12     ` Andy Lutomirski
  0 siblings, 1 reply; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-21 15:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Dave Hansen, Linus Torvalds, Josh Poimboeuf

On Mon, Nov 20, 2017 at 11:33 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> This sets up stack switching, including for SYSCALL.  I think it's
>> in decent shape.
>>
>> Known issues:
>>  - KASAN is likely to be busted.  This could be fixed either by teaching
>>    KASAN that cpu_entry_area contains valid stacks (I have no clue how
>>    to go about doing this) or by rigging up the IST entry code to switch
>>    RSP to point to the direct-mapped copy of the stacks before calling
>>    into non-KASAN-excluded C code.
>>
>>  - 32-bit kernels are failing the sigreturn_32 test.  But they're also
>>    failing without the patches, so I'm not sure this is a bug in the
>>    series per se.  Needs further investigation.  (Off the top of my head,
>>    this could be further fallout from Thomas's IDT rework.)
>>
>>  - I think we're going to want a way to turn the stack switching on and
>>    off either at boot time or at runtime.  It should be fairly straightforward
>>    to make it work.
>>
>>  - I think the ORC unwinder isn't so good at dealing with stack overflows.
>>    It bails too early (I think), resulting in lots of ? entries.  This
>>    isn't a regression with this series -- it's just something that could
>>    be improved.
>
> Another problem I just found: IRQ tracing appears busted on 64-bit kernels - with
> lockdep enabled I get this boot warning:
>
>  [    4.309026] WARNING: CPU: 10 PID: 222 at kernel/locking/lockdep.c:3924 check_flags.part.45+0x1a5/0x1b0
>  ...
>  [    4.309026] possible reason: unannotated irqs-off.
>
> That's on a x86-64 defconfig-ish kernel with CONFIG_PROVE_LOCKING=y, running on an
> AMD system. Full splat below.
>
> Thanks,
>
>         Ingo
>
> [    4.272197] LVT offset 0 assigned for vector 0x400
> [    4.278486] perf: AMD IBS detected (0x000000ff)
> [    4.284786] kvm: Nested Virtualization enabled
> [    4.289447] kvm: Nested Paging enabled
> [    4.308496] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
> [    4.308512] ------------[ cut here ]------------
> [    4.309026] WARNING: CPU: 10 PID: 222 at kernel/locking/lockdep.c:3924 check_flags.part.45+0x1a5/0x1b0
> [    4.309026] Modules linked in:
> [    4.309026] CPU: 10 PID: 222 Comm: modprobe Not tainted 4.14.0-01345-g9490674-dirty #1
> [    4.309026] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 2.0b       03/01/2012
> [    4.309026] task: ffff880814b08000 task.stack: ffffc90007dcc000
> [    4.309026] RIP: 0010:check_flags.part.45+0x1a5/0x1b0
> [    4.309026] RSP: 0018:ffffc90007dcfeb0 EFLAGS: 00010082
> [    4.309026] RAX: 000000000000002e RBX: ffff880814b08000 RCX: 0000000000000000
> [    4.309026] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff81152e76
> [    4.309026] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
> [    4.309026] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [    4.309026] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> [    4.309026] FS:  0000000000000000(0000) GS:ffff880817c80000(0000) knlGS:0000000000000000
> [    4.309026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    4.309026] CR2: 00007f8badc39218 CR3: 000000081584b000 CR4: 00000000000406e0
> [    4.309026] Call Trace:
> [    4.309026]  lock_acquire+0x11a/0x1d0
> [    4.309026]  vtime_user_exit+0x3c/0xa0
> [    4.309026]  ? __context_tracking_exit.part.4+0x45/0x130
> [    4.309026]  __context_tracking_exit.part.4+0x45/0x130
> [    4.309026]  do_syscall_64+0x13f/0x220
> [    4.309026]  entry_SYSCALL64_slow_path+0x25/0x25

I'm not reproducing this.  On quick inspection, the only potential
issue I see is that native_gs_load_index is missing IRQ tracing
annotations, but I don't see why this series would have any particular
effect on that.

Do you see it on my latest tree?

--Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 00/16] Entry stuff, in decent shape now
  2017-11-21 15:59   ` Andy Lutomirski
@ 2017-11-21 16:12     ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-21 16:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, X86 ML, Borislav Petkov, linux-kernel, Brian Gerst,
	Dave Hansen, Linus Torvalds, Josh Poimboeuf

On Tue, Nov 21, 2017 at 7:59 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On Mon, Nov 20, 2017 at 11:33 PM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Andy Lutomirski <luto@kernel.org> wrote:
>>
>>> This sets up stack switching, including for SYSCALL.  I think it's
>>> in decent shape.
>>>
>>> Known issues:
>>>  - KASAN is likely to be busted.  This could be fixed either by teaching
>>>    KASAN that cpu_entry_area contains valid stacks (I have no clue how
>>>    to go about doing this) or by rigging up the IST entry code to switch
>>>    RSP to point to the direct-mapped copy of the stacks before calling
>>>    into non-KASAN-excluded C code.
>>>
>>>  - 32-bit kernels are failing the sigreturn_32 test.  But they're also
>>>    failing without the patches, so I'm not sure this is a bug in the
>>>    series per se.  Needs further investigation.  (Off the top of my head,
>>>    this could be further fallout from Thomas's IDT rework.)
>>>
>>>  - I think we're going to want a way to turn the stack switching on and
>>>    off either at boot time or at runtime.  It should be fairly straightforward
>>>    to make it work.
>>>
>>>  - I think the ORC unwinder isn't so good at dealing with stack overflows.
>>>    It bails too early (I think), resulting in lots of ? entries.  This
>>>    isn't a regression with this series -- it's just something that could
>>>    be improved.
>>
>> Another problem I just found: IRQ tracing appears busted on 64-bit kernels - with
>> lockdep enabled I get this boot warning:
>>
>>  [    4.309026] WARNING: CPU: 10 PID: 222 at kernel/locking/lockdep.c:3924 check_flags.part.45+0x1a5/0x1b0
>>  ...
>>  [    4.309026] possible reason: unannotated irqs-off.
>>
>> That's on a x86-64 defconfig-ish kernel with CONFIG_PROVE_LOCKING=y, running on an
>> AMD system. Full splat below.
>>
>> Thanks,
>>
>>         Ingo
>>
>> [    4.272197] LVT offset 0 assigned for vector 0x400
>> [    4.278486] perf: AMD IBS detected (0x000000ff)
>> [    4.284786] kvm: Nested Virtualization enabled
>> [    4.289447] kvm: Nested Paging enabled
>> [    4.308496] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
>> [    4.308512] ------------[ cut here ]------------
>> [    4.309026] WARNING: CPU: 10 PID: 222 at kernel/locking/lockdep.c:3924 check_flags.part.45+0x1a5/0x1b0
>> [    4.309026] Modules linked in:
>> [    4.309026] CPU: 10 PID: 222 Comm: modprobe Not tainted 4.14.0-01345-g9490674-dirty #1
>> [    4.309026] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 2.0b       03/01/2012
>> [    4.309026] task: ffff880814b08000 task.stack: ffffc90007dcc000
>> [    4.309026] RIP: 0010:check_flags.part.45+0x1a5/0x1b0
>> [    4.309026] RSP: 0018:ffffc90007dcfeb0 EFLAGS: 00010082
>> [    4.309026] RAX: 000000000000002e RBX: ffff880814b08000 RCX: 0000000000000000
>> [    4.309026] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff81152e76
>> [    4.309026] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
>> [    4.309026] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> [    4.309026] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
>> [    4.309026] FS:  0000000000000000(0000) GS:ffff880817c80000(0000) knlGS:0000000000000000
>> [    4.309026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    4.309026] CR2: 00007f8badc39218 CR3: 000000081584b000 CR4: 00000000000406e0
>> [    4.309026] Call Trace:
>> [    4.309026]  lock_acquire+0x11a/0x1d0
>> [    4.309026]  vtime_user_exit+0x3c/0xa0
>> [    4.309026]  ? __context_tracking_exit.part.4+0x45/0x130
>> [    4.309026]  __context_tracking_exit.part.4+0x45/0x130
>> [    4.309026]  do_syscall_64+0x13f/0x220
>> [    4.309026]  entry_SYSCALL64_slow_path+0x25/0x25
>
> I'm not reproducing this.  On quick inspection, the only potential
> issue I see is that native_gs_load_index is missing IRQ tracing
> annotations, but I don't see why this series would have any particular
> effect on that.

Never mind, I reproduced it.

>
> Do you see it on my latest tree?
>
> --Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 03/16] x86/gdt: Put per-cpu GDT remaps in ascending order
  2017-11-20 17:07 ` [PATCH 03/16] x86/gdt: Put per-cpu GDT remaps in ascending order Andy Lutomirski
  2017-11-20 21:56   ` Thomas Gleixner
@ 2017-11-21 17:16   ` Borislav Petkov
  1 sibling, 0 replies; 52+ messages in thread
From: Borislav Petkov @ 2017-11-21 17:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Brian Gerst, Dave Hansen, Linus Torvalds,
	Josh Poimboeuf

On Mon, Nov 20, 2017 at 09:07:34AM -0800, Andy Lutomirski wrote:
> We currently have CPU 0's GDT at the top of the GDT range and
> higher-numbered CPUs at lower addreses.  This happens because the

addresses

> fixmap is upside down (index 0 is the top of the fixmap).
> 
> Flip it so that GDTs are in ascending order by virtual address.
> This will simplify a future patch that will generalize the GDT
> remap to contain multiple pages.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/include/asm/desc.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
> index 4011cb03ef08..95cd95eb7285 100644
> --- a/arch/x86/include/asm/desc.h
> +++ b/arch/x86/include/asm/desc.h
> @@ -63,7 +63,7 @@ static inline struct desc_struct *get_current_gdt_rw(void)
>  /* Get the fixmap index for a specific processor */
>  static inline unsigned int get_cpu_gdt_ro_index(int cpu)
>  {
> -	return FIX_GDT_REMAP_BEGIN + cpu;
> +	return FIX_GDT_REMAP_END - cpu;
>  }
>  
>  /* Provide the fixmap address of the remapped GDT */
> -- 

Other than the typo above:

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 11/16] x86/asm/64: Use a percpu trampoline stack for IDT entries
  2017-11-20 17:07 ` [PATCH 11/16] x86/asm/64: Use a percpu trampoline stack for IDT entries Andy Lutomirski
  2017-11-21  7:20   ` Ingo Molnar
@ 2017-11-21 18:57   ` Dave Hansen
  2017-11-22  3:45     ` Andy Lutomirski
  1 sibling, 1 reply; 52+ messages in thread
From: Dave Hansen @ 2017-11-21 18:57 UTC (permalink / raw)
  To: Andy Lutomirski, X86 ML
  Cc: Borislav Petkov, linux-kernel, Brian Gerst, Linus Torvalds,
	Josh Poimboeuf

On 11/20/2017 09:07 AM, Andy Lutomirski wrote:
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -560,6 +560,14 @@ END(irq_entries_start)
>  	.macro interrupt func
>  	cld
>  	ALLOC_PT_GPREGS_ON_STACK
> +
> +	testb	$3, CS(%rsp)
> +	jz	1f
> +	SWAPGS
> +	call	switch_to_thread_stack
> +	SWAPGS
> +1:

This looks really weird to me.  SWAPGS, switch stack, and SWAPGS again?

Is this so that we can use some per-cpu data in switch_to_thread_stack,
and then put GS back so that the normal, non-trampoline entry code can
do its normal SWAPGS voodoo?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 11/16] x86/asm/64: Use a percpu trampoline stack for IDT entries
  2017-11-21 18:57   ` Dave Hansen
@ 2017-11-22  3:45     ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-22  3:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, linux-kernel,
	Brian Gerst, Linus Torvalds, Josh Poimboeuf

On Tue, Nov 21, 2017 at 10:57 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 11/20/2017 09:07 AM, Andy Lutomirski wrote:
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -560,6 +560,14 @@ END(irq_entries_start)
>>       .macro interrupt func
>>       cld
>>       ALLOC_PT_GPREGS_ON_STACK
>> +
>> +     testb   $3, CS(%rsp)
>> +     jz      1f
>> +     SWAPGS
>> +     call    switch_to_thread_stack
>> +     SWAPGS
>> +1:
>
> This looks really weird to me.  SWAPGS, switch stack, and SWAPGS again?
>
> Is this so that we can use some per-cpu data in switch_to_thread_stack,
> and then put GS back so that the normal, non-trampoline entry code can
> do its normal SWAPGS voodoo?
>
>

Yes.  I was trying to avoid totally restructuring error_entry here.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 16/16] x86/entry/64: Move the IST stacks into cpu_entry_area
  2017-11-21 14:45     ` Andrey Ryabinin
@ 2017-11-23 15:25       ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2017-11-23 15:25 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Ingo Molnar, Andy Lutomirski, X86 ML, Borislav Petkov,
	linux-kernel, Brian Gerst, Dave Hansen, Linus Torvalds,
	Josh Poimboeuf

On Tue, Nov 21, 2017 at 6:45 AM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
>
>
> On 11/21/2017 10:38 AM, Ingo Molnar wrote:
>>
>> * Andy Lutomirski <luto@kernel.org> wrote:
>>
>>>  /* May not be marked __init: used by software suspend */
>>>  void syscall_init(void)
>>>  {
>>> @@ -1627,7 +1637,7 @@ void cpu_init(void)
>>>       * set up and load the per-CPU TSS
>>>       */
>>>      if (!oist->ist[0]) {
>>> -            char *estacks = per_cpu(exception_stacks, cpu);
>>> +            char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
>>>
>>>              for (v = 0; v < N_EXCEPTION_STACKS; v++) {
>>>                      estacks += exception_stack_sizes[v];
>>
>> This generates a new build warning:
>>
>>  /home/mingo/tip/arch/x86/kernel/cpu/common.c: In function ‘syscall_init’:
>>  /home/mingo/tip/arch/x86/kernel/cpu/common.c:1443:6: warning: unused variable ‘cpu’ [-Wunused-variable]
>>    int cpu = smp_processor_id();
>>
>> because 'cpu' is now unused in the !CONFIG_IA32_EMULATION part.
>>
>> The naive fix is something like the patch below, untested.
>>
>> Thanks,
>>
>>       Ingo
>>
>>  arch/x86/kernel/cpu/common.c | 30 ++++++++++++++++--------------
>>  1 file changed, 16 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index c7f3a0a19dce..557bad9f4179 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -1440,24 +1440,26 @@ void syscall_init(void)
>>       extern char _entry_trampoline[];
>>       extern char entry_SYSCALL_64_trampoline[];
>>
>> -     int cpu = smp_processor_id();
>> -
>>       wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
>>       wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(smp_processor_id())->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));
>
>              It would be better to use 'cpu' variable here  ^^^^^^^^^^^^^^^^^

That's what I did :)

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2017-11-23 15:25 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-20 17:07 [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
2017-11-20 17:07 ` [PATCH 01/16] x86/asm/64: Allocate and enable the SYSENTER stack Andy Lutomirski
2017-11-20 21:55   ` Thomas Gleixner
2017-11-21 10:47   ` Borislav Petkov
2017-11-20 17:07 ` [PATCH 02/16] x86/dumpstack: Add get_stack_info() support for " Andy Lutomirski
2017-11-20 20:42   ` Josh Poimboeuf
2017-11-20 20:46     ` Andy Lutomirski
2017-11-20 21:00       ` Josh Poimboeuf
2017-11-20 21:07         ` Andy Lutomirski
2017-11-20 21:27           ` Josh Poimboeuf
2017-11-20 21:30             ` Andy Lutomirski
2017-11-20 21:55               ` Josh Poimboeuf
2017-11-21  1:39                 ` Andy Lutomirski
2017-11-21  2:29                   ` Josh Poimboeuf
2017-11-20 17:07 ` [PATCH 03/16] x86/gdt: Put per-cpu GDT remaps in ascending order Andy Lutomirski
2017-11-20 21:56   ` Thomas Gleixner
2017-11-21 17:16   ` Borislav Petkov
2017-11-20 17:07 ` [PATCH 04/16] x86/fixmap: Generalize the GDT fixmap mechanism Andy Lutomirski
2017-11-20 22:01   ` Thomas Gleixner
2017-11-21  1:21     ` Andy Lutomirski
2017-11-21  8:29       ` Thomas Gleixner
2017-11-20 17:07 ` [PATCH 05/16] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss Andy Lutomirski
2017-11-20 22:22   ` Thomas Gleixner
2017-11-20 17:07 ` [PATCH 06/16] x86/dumpstack: Handle stack overflow on all stacks Andy Lutomirski
2017-11-20 17:07 ` [PATCH 07/16] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct Andy Lutomirski
2017-11-20 23:37   ` Thomas Gleixner
2017-11-21  1:25     ` Andy Lutomirski
2017-11-20 17:07 ` [PATCH 08/16] x86/asm: Remap the TSS into the cpu entry area Andy Lutomirski
2017-11-20 17:07 ` [PATCH 09/16] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0 Andy Lutomirski
2017-11-20 17:07 ` [PATCH 10/16] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Andy Lutomirski
2017-11-20 17:07 ` [PATCH 11/16] x86/asm/64: Use a percpu trampoline stack for IDT entries Andy Lutomirski
2017-11-21  7:20   ` Ingo Molnar
2017-11-21 15:36     ` Andy Lutomirski
2017-11-21 18:57   ` Dave Hansen
2017-11-22  3:45     ` Andy Lutomirski
2017-11-20 17:07 ` [PATCH 12/16] x86/asm/64: Return to userspace from the trampoline stack Andy Lutomirski
2017-11-20 17:07 ` [PATCH 13/16] x86/entry/64: Create a percpu SYSCALL entry trampoline Andy Lutomirski
2017-11-21  2:34   ` Josh Poimboeuf
2017-11-21  3:20     ` Andy Lutomirski
2017-11-20 17:07 ` [PATCH 14/16] x86/irq: Remove an old outdated comment about context tracking races Andy Lutomirski
2017-11-21  6:25   ` Ingo Molnar
2017-11-20 17:07 ` [PATCH 15/16] x86/irq/64: In the stack overflow warning, print the offending IP Andy Lutomirski
2017-11-21  6:26   ` Ingo Molnar
2017-11-20 17:07 ` [PATCH 16/16] x86/entry/64: Move the IST stacks into cpu_entry_area Andy Lutomirski
2017-11-21  7:38   ` Ingo Molnar
2017-11-21 14:45     ` Andrey Ryabinin
2017-11-23 15:25       ` Andy Lutomirski
2017-11-21 15:33     ` Andy Lutomirski
2017-11-20 18:48 ` [PATCH 00/16] Entry stuff, in decent shape now Andy Lutomirski
2017-11-21  7:33 ` Ingo Molnar
2017-11-21 15:59   ` Andy Lutomirski
2017-11-21 16:12     ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.