[PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs
@ 2018-09-03 22:59 Andy Lutomirski
  2018-09-03 22:59 ` [PATCH v2 1/3] x86/entry/64: Document idtentry Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Andy Lutomirski @ 2018-09-03 22:59 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, LKML, Dave Hansen, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Linus Torvalds,
	Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski

This gets rid of entry trampolines.  It's more or less the same as
the RFC version, except that I rebased it to v4.19-rc1 due to
massive conflicts with some perf changes.  I have *not* reverted all
of the perf support for entry trampolines -- I leave that to the
perf crew, if needed.

Changes from v1: Get rid of the rsp_scratch macro

Andy Lutomirski (3):
  x86/entry/64: Document idtentry
  x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space
  x86/pti/64: Remove the SYSCALL64 entry trampoline

 arch/x86/entry/entry_64.S             | 116 ++++++++++----------------
 arch/x86/include/asm/cpu_entry_area.h |   2 -
 arch/x86/include/asm/processor.h      |   6 ++
 arch/x86/include/asm/sections.h       |   1 -
 arch/x86/kernel/asm-offsets.c         |   5 +-
 arch/x86/kernel/cpu/common.c          |  11 +--
 arch/x86/kernel/kprobes/core.c        |  10 +--
 arch/x86/kernel/process_64.c          |   2 -
 arch/x86/kernel/traps.c               |   4 +
 arch/x86/kernel/vmlinux.lds.S         |  10 ---
 arch/x86/mm/cpu_entry_area.c          |  36 --------
 arch/x86/mm/pti.c                     |  33 +++++++-
 arch/x86/xen/xen-asm_64.S             |   8 +-
 13 files changed, 95 insertions(+), 149 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/3] x86/entry/64: Document idtentry
  2018-09-03 22:59 [PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs Andy Lutomirski
@ 2018-09-03 22:59 ` Andy Lutomirski
  2018-09-06  9:50   ` Borislav Petkov
  2018-09-08  9:33   ` [tip:x86/pti] " tip-bot for Andy Lutomirski
  2018-09-03 22:59 ` [PATCH v2 2/3] x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space Andy Lutomirski
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 24+ messages in thread
From: Andy Lutomirski @ 2018-09-03 22:59 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, LKML, Dave Hansen, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Linus Torvalds,
	Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski

The idtentry macro is complicated and magical.  Document what it
does to help future readers and to allow future patches to adjust
the code and docs at the same time.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S | 35 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c   |  4 ++++
 2 files changed, 39 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 957dfb693ecc..1796c42e08af 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -900,6 +900,41 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + ((x) - 1) * 8)
 
+/**
+ * idtentry - Generate an IDT entry stub
+ * @sym: Name of the generated entry point
+ * @do_sym: C function to be called
+ * @has_error_code: True if this IDT vector has an error code on the stack
+ * @paranoid: non-zero means that this vector may be invoked from kernel
+ *            mode with user GSBASE and/or user CR3.  2 is special -- see below.
+ * @shift_ist: Set to an IST index if entries from kernel mode should
+ *             decrement the IST stack so that nested entries get a fresh
+ *             stack.  (This is for #DB, which has a nasty habit of
+ *             recursing.)
+ *
+ * idtentry generates an IDT stub that sets up a usable kernel context,
+ * creates struct pt_regs, and calls @do_sym.  The stub has the following
+ * special behaviors:
+ *
+ * On an entry from user mode, the stub switches from the trampoline or
+ * IST stack to the normal thread stack.  On an exit to user mode, the
+ * normal exit-to-usermode path is invoked.
+ *
+ * On an exit to kernel mode, if paranoid == 0, we check for preemption,
+ * whereas we omit the preemption check if @paranoid != 0.  This is purely
+ * because the implementation is simpler this way.  The kernel only needs
+ * to check for asynchronous kernel preemption when IRQ handlers return.
+ *
+ * If @paranoid == 0, then the stub will handle IRET faults by pretending
+ * that the fault came from user mode.  It will handle gs_change faults by
+ * pretending that the fault happened with kernel GSBASE.  Since this handling
+ * is omitted for @paranoid != 0, the #GP, #SS, and #NP stubs must have
+ * @paranoid == 0.  This special handling will do the wrong thing for
+ * espfix-induced #DF on IRET, so #DF must not use @paranoid == 0.
+ *
+ * @paranoid == 2 is special: the stub will never switch stacks.  This is for
+ * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
+ */
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e6db475164ed..1a90821c0b74 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -383,6 +383,10 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		 * we won't enable interupts or schedule before we invoke
 		 * general_protection, so nothing will clobber the stack
 		 * frame we just set up.
+		 *
+		 * We will enter general_protection with kernel GSBASE,
+		 * which is what the stub expects, given that the faulting
+		 * RIP will be the IRET instruction.
 		 */
 		regs->ip = (unsigned long)general_protection;
 		regs->sp = (unsigned long)&gpregs->orig_ax;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 2/3] x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space
  2018-09-03 22:59 [PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs Andy Lutomirski
  2018-09-03 22:59 ` [PATCH v2 1/3] x86/entry/64: Document idtentry Andy Lutomirski
@ 2018-09-03 22:59 ` Andy Lutomirski
  2018-09-07  8:00   ` Borislav Petkov
  2018-09-08  9:34   ` [tip:x86/pti] " tip-bot for Andy Lutomirski
  2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
  2018-09-04  3:43 ` [PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs Linus Torvalds
  3 siblings, 2 replies; 24+ messages in thread
From: Andy Lutomirski @ 2018-09-03 22:59 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, LKML, Dave Hansen, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Linus Torvalds,
	Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski

In the non-trampoline SYSCALL64 path, we use a percpu variable to
temporarily store the user RSP value.  Instead of a separate
variable, use the otherwise unused sp2 slot in the TSS.  This will
improve cache locality, as the sp1 slot is already used in the same
code to find the kernel stack.  It will also simplify a future
change to make the non-trampoline path work in PTI mode.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S        | 16 +++++++++-------
 arch/x86/include/asm/processor.h |  6 ++++++
 arch/x86/kernel/asm-offsets.c    |  3 ++-
 arch/x86/kernel/process_64.c     |  2 --
 arch/x86/xen/xen-asm_64.S        |  8 +++++---
 5 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1796c42e08af..17ae40602539 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -215,18 +215,20 @@ ENTRY(entry_SYSCALL_64)
 	/*
 	 * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
 	 * is not required to switch CR3.
+	 *
+	 * tss.sp2 is scratch space.
 	 */
-	movq	%rsp, PER_CPU_VAR(rsp_scratch)
+	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
-	pushq	$__USER_DS			/* pt_regs->ss */
-	pushq	PER_CPU_VAR(rsp_scratch)	/* pt_regs->sp */
-	pushq	%r11				/* pt_regs->flags */
-	pushq	$__USER_CS			/* pt_regs->cs */
-	pushq	%rcx				/* pt_regs->ip */
+	pushq	$__USER_DS				/* pt_regs->ss */
+	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
+	pushq	%r11					/* pt_regs->flags */
+	pushq	$__USER_CS				/* pt_regs->cs */
+	pushq	%rcx					/* pt_regs->ip */
 GLOBAL(entry_SYSCALL_64_after_hwframe)
-	pushq	%rax				/* pt_regs->orig_ax */
+	pushq	%rax					/* pt_regs->orig_ax */
 
 	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c24297268ebc..8433d76bc37b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -313,7 +313,13 @@ struct x86_hw_tss {
 	 */
 	u64			sp1;
 
+	/*
+	 * Since Linux does not use ring 2, the 'sp2' slot is unused by
+	 * hardware.  entry_SYSCALL_64 uses it as scratch space to stash
+	 * the user RSP value.
+	 */
 	u64			sp2;
+
 	u64			reserved2;
 	u64			ist[7];
 	u32			reserved3;
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 01de31db300d..fc2e90d3429a 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -105,7 +105,8 @@ void common(void) {
 	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
 	DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
 
-	/* Offset for sp0 and sp1 into the tss_struct */
+	/* Offset for fields in tss_struct */
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
 	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
+	OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
 }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index a451bc374b9b..0fa7aa19f09e 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -59,8 +59,6 @@
 #include <asm/unistd_32_ia32.h>
 #endif
 
-__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
-
 /* Prints also some state that isn't saved in the pt_regs */
 void __show_regs(struct pt_regs *regs, int all)
 {
diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S
index 417b339e5c8e..bb1c2da0381d 100644
--- a/arch/x86/xen/xen-asm_64.S
+++ b/arch/x86/xen/xen-asm_64.S
@@ -91,13 +91,15 @@ ENTRY(xen_iret)
 ENTRY(xen_sysret64)
 	/*
 	 * We're already on the usermode stack at this point, but
-	 * still with the kernel gs, so we can easily switch back
+	 * still with the kernel gs, so we can easily switch back.
+	 *
+	 * tss.sp2 is scratch space.
 	 */
-	movq %rsp, PER_CPU_VAR(rsp_scratch)
+	movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
 	movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	pushq $__USER_DS
-	pushq PER_CPU_VAR(rsp_scratch)
+	pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
 	pushq %r11
 	pushq $__USER_CS
 	pushq %rcx
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-03 22:59 [PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs Andy Lutomirski
  2018-09-03 22:59 ` [PATCH v2 1/3] x86/entry/64: Document idtentry Andy Lutomirski
  2018-09-03 22:59 ` [PATCH v2 2/3] x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space Andy Lutomirski
@ 2018-09-03 22:59 ` Andy Lutomirski
  2018-09-04  7:04   ` Peter Zijlstra
                     ` (6 more replies)
  2018-09-04  3:43 ` [PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs Linus Torvalds
  3 siblings, 7 replies; 24+ messages in thread
From: Andy Lutomirski @ 2018-09-03 22:59 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, LKML, Dave Hansen, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Linus Torvalds,
	Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen,
	Peter Zijlstra, Andy Lutomirski

The SYSCALL64 trampoline has a couple of nice properties:

 - The usual sequence of SWAPGS followed by two GS-relative accesses to
   set up RSP is somewhat slow because the GS-relative accesses need
   to wait for SWAPGS to finish.  The trampoline approach allows
   RIP-relative accesses to set up RSP, which avoids the stall.

 - The trampoline avoids any percpu access before CR3 is set up,
   which means that no percpu memory needs to be mapped in the user
   page tables.  This prevents using Meltdown to read any percpu memory
   outside the cpu_entry_area and prevents using timing leaks
   to directly locate the percpu areas.

The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up.  The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text.  With retpolines enabled, the indirect jump is extremely slow.

This patch changes the code to map the percpu TSS into the user page
tables to allow the non-trampoline SYSCALL64 path to work under PTI.
This does not add a new direct information leak, since the TSS is
readable by Meltdown from the cpu_entry_area alias regardless.  It
does allow a timing attack to locate the percpu area, but KASLR is
more or less a lost cause against local attack on CPUs vulnerable to
Meltdown regardless.  As far as I'm concerned, on current hardware,
KASLR is only useful to mitigate remote attacks that try to attack
the kernel without first gaining RCE against a vulnerable user
process.

On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
syscall overhead from ~237ns to ~228ns.

There is a possible alternative approach: we could instead move the
trampoline within 2G of the entry text and make a separate copy for
each CPU.  Then we could use a direct jump to rejoin the normal
entry path.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/entry/entry_64.S             | 69 +--------------------------
 arch/x86/include/asm/cpu_entry_area.h |  2 -
 arch/x86/include/asm/sections.h       |  1 -
 arch/x86/kernel/asm-offsets.c         |  2 -
 arch/x86/kernel/cpu/common.c          | 11 +----
 arch/x86/kernel/kprobes/core.c        | 10 +---
 arch/x86/kernel/vmlinux.lds.S         | 10 ----
 arch/x86/mm/cpu_entry_area.c          | 36 --------------
 arch/x86/mm/pti.c                     | 33 ++++++++++++-
 9 files changed, 36 insertions(+), 138 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 17ae40602539..d2f87ca119a5 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -142,67 +142,6 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
-	.pushsection .entry_trampoline, "ax"
-
-/*
- * The code in here gets remapped into cpu_entry_area's trampoline.  This means
- * that the assembler and linker have the wrong idea as to where this code
- * lives (and, in fact, it's mapped more than once, so it's not even at a
- * fixed address).  So we can't reference any symbols outside the entry
- * trampoline and expect it to work.
- *
- * Instead, we carefully abuse %rip-relative addressing.
- * _entry_trampoline(%rip) refers to the start of the remapped) entry
- * trampoline.  We can thus find cpu_entry_area with this macro:
- */
-
-#define CPU_ENTRY_AREA \
-	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
-
-/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH	CPU_ENTRY_AREA_entry_stack + \
-			SIZEOF_entry_stack - 8 + CPU_ENTRY_AREA
-
-ENTRY(entry_SYSCALL_64_trampoline)
-	UNWIND_HINT_EMPTY
-	swapgs
-
-	/* Stash the user RSP. */
-	movq	%rsp, RSP_SCRATCH
-
-	/* Note: using %rsp as a scratch reg. */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
-	/* Load the top of the task stack into RSP */
-	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
-
-	/* Start building the simulated IRET frame. */
-	pushq	$__USER_DS			/* pt_regs->ss */
-	pushq	RSP_SCRATCH			/* pt_regs->sp */
-	pushq	%r11				/* pt_regs->flags */
-	pushq	$__USER_CS			/* pt_regs->cs */
-	pushq	%rcx				/* pt_regs->ip */
-
-	/*
-	 * x86 lacks a near absolute jump, and we can't jump to the real
-	 * entry text with a relative jump.  We could push the target
-	 * address and then use retq, but this destroys the pipeline on
-	 * many CPUs (wasting over 20 cycles on Sandy Bridge).  Instead,
-	 * spill RDI and restore it in a second-stage trampoline.
-	 */
-	pushq	%rdi
-	movq	$entry_SYSCALL_64_stage2, %rdi
-	JMP_NOSPEC %rdi
-END(entry_SYSCALL_64_trampoline)
-
-	.popsection
-
-ENTRY(entry_SYSCALL_64_stage2)
-	UNWIND_HINT_EMPTY
-	popq	%rdi
-	jmp	entry_SYSCALL_64_after_hwframe
-END(entry_SYSCALL_64_stage2)
-
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
@@ -212,13 +151,9 @@ ENTRY(entry_SYSCALL_64)
 	 */
 
 	swapgs
-	/*
-	 * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
-	 * is not required to switch CR3.
-	 *
-	 * tss.sp2 is scratch space.
-	 */
+	/* tss.sp2 is scratch space. */
 	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b8dca5..29c706415443 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@ struct cpu_entry_area {
 	 */
 	struct tss_struct tss;
 
-	char entry_trampoline[PAGE_SIZE];
-
 #ifdef CONFIG_X86_64
 	/*
 	 * Exception stacks used for IST entries.
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 4a911a382ade..8ea1cfdbeabc 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,7 +11,6 @@ extern char __end_rodata_aligned[];
 
 #if defined(CONFIG_X86_64)
 extern char __end_rodata_hpage_align[];
-extern char __entry_trampoline_start[], __entry_trampoline_end[];
 #endif
 
 #endif	/* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index fc2e90d3429a..083c01309027 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -99,8 +99,6 @@ void common(void) {
 	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
 
 	/* Layout info for cpu_entry_area */
-	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
-	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
 	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
 	DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 84dee5ab745a..83068258c856 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1530,19 +1530,10 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
-	extern char _entry_trampoline[];
-	extern char entry_SYSCALL_64_trampoline[];
-
 	int cpu = smp_processor_id();
-	unsigned long SYSCALL64_entry_trampoline =
-		(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
-		(entry_SYSCALL_64_trampoline - _entry_trampoline);
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	if (static_cpu_has(X86_FEATURE_PTI))
-		wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
-	else
-		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b0d1e81c96bb..f802cf5b4478 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1066,18 +1066,10 @@ NOKPROBE_SYMBOL(kprobe_exceptions_notify);
 
 bool arch_within_kprobe_blacklist(unsigned long addr)
 {
-	bool is_in_entry_trampoline_section = false;
-
-#ifdef CONFIG_X86_64
-	is_in_entry_trampoline_section =
-		(addr >= (unsigned long)__entry_trampoline_start &&
-		 addr < (unsigned long)__entry_trampoline_end);
-#endif
 	return  (addr >= (unsigned long)__kprobes_text_start &&
 		 addr < (unsigned long)__kprobes_text_end) ||
 		(addr >= (unsigned long)__entry_text_start &&
-		 addr < (unsigned long)__entry_text_end) ||
-		is_in_entry_trampoline_section;
+		 addr < (unsigned long)__entry_text_end);
 }
 
 int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 8bde0a419f86..9c77d2df9c27 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,16 +118,6 @@ SECTIONS
 		*(.fixup)
 		*(.gnu.warning)
 
-#ifdef CONFIG_X86_64
-		. = ALIGN(PAGE_SIZE);
-		__entry_trampoline_start = .;
-		_entry_trampoline = .;
-		*(.entry_trampoline)
-		. = ALIGN(PAGE_SIZE);
-		__entry_trampoline_end = .;
-		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
-#endif
-
 #ifdef CONFIG_RETPOLINE
 		__indirect_thunk_start = .;
 		*(.text.__x86.indirect_thunk)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 076ebdce9bd4..12d7e7fb4efd 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -15,7 +15,6 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
 #ifdef CONFIG_X86_64
 static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-static DEFINE_PER_CPU(struct kcore_list, kcore_entry_trampoline);
 #endif
 
 struct cpu_entry_area *get_cpu_entry_area(int cpu)
@@ -83,8 +82,6 @@ static void percpu_setup_debug_store(int cpu)
 static void __init setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
-	extern char _entry_trampoline[];
-
 	/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 	pgprot_t tss_prot = PAGE_KERNEL_RO;
@@ -146,43 +143,10 @@ static void __init setup_cpu_entry_area(int cpu)
 	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
 			     &per_cpu(exception_stacks, cpu),
 			     sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
-
-	cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
-		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
-	/*
-	 * The cpu_entry_area alias addresses are not in the kernel binary
-	 * so they do not show up in /proc/kcore normally.  This adds entries
-	 * for them manually.
-	 */
-	kclist_add_remap(&per_cpu(kcore_entry_trampoline, cpu),
-			 _entry_trampoline,
-			 &get_cpu_entry_area(cpu)->entry_trampoline, PAGE_SIZE);
 #endif
 	percpu_setup_debug_store(cpu);
 }
 
-#ifdef CONFIG_X86_64
-int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
-		     char *name)
-{
-	unsigned int cpu, ncpu = 0;
-
-	if (symnum >= num_possible_cpus())
-		return -EINVAL;
-
-	for_each_possible_cpu(cpu) {
-		if (ncpu++ >= symnum)
-			break;
-	}
-
-	*value = (unsigned long)&get_cpu_entry_area(cpu)->entry_trampoline;
-	*type = 't';
-	strlcpy(name, "__entry_SYSCALL_64_trampoline", KSYM_NAME_LEN);
-
-	return 0;
-}
-#endif
-
 static __init void setup_cpu_entry_area_ptes(void)
 {
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 31341ae7309f..7e79154846c8 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
 }
 
 /*
- * Clone the CPU_ENTRY_AREA into the user space visible page table.
+ * Clone the CPU_ENTRY_AREA and associated data into the user space visible
+ * page table.
  */
 static void __init pti_clone_user_shared(void)
 {
+	unsigned cpu;
+
 	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+
+	for_each_possible_cpu(cpu) {
+		/*
+		 * The SYSCALL64 entry code needs to be able to find the
+		 * thread stack and needs one word of scratch space in which
+		 * to spill a register.  All of this lives in the TSS, in
+		 * the sp1 and sp2 slots.
+		 *
+		 * This is done for all possible CPUs during boot to ensure
+		 * that it's propagated to all mms.  If we were to add one of
+		 * these mappings during CPU hotplug, we would need to take
+		 * some measure to make sure that every mm that subsequently
+		 * ran on that CPU would have the relevant PGD entry in its
+		 * pagetables.  The usual vmalloc_fault() mechanism would not
+		 * work for page faults taken in entry_SYSCALL_64 before RSP
+		 * is set up.
+		 */
+
+		unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
+		phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+		pte_t *target_pte;
+
+		target_pte = pti_user_pagetable_walk_pte(va);
+		if (WARN_ON(!target_pte))
+			return;
+
+		*target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+	}
 }
 
 #else /* CONFIG_X86_64 */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs
  2018-09-03 22:59 [PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs Andy Lutomirski
                   ` (2 preceding siblings ...)
  2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
@ 2018-09-04  3:43 ` Linus Torvalds
  3 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2018-09-04  3:43 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: the arch/x86 maintainers, Borislav Petkov,
	Linux Kernel Mailing List, Dave Hansen, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Joerg Roedel, Jiri Olsa, Andi Kleen, Peter Zijlstra

On Mon, Sep 3, 2018 at 3:59 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> Changes from v1: Get rid of the rsp_scratch macro

I can't see anything wrong with this.

            Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
@ 2018-09-04  7:04   ` Peter Zijlstra
  2018-09-05 21:31     ` Andy Lutomirski
  2018-09-07  9:35   ` Borislav Petkov
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-09-04  7:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Borislav Petkov, LKML, Dave Hansen, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Linus Torvalds,
	Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen

On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
> The SYSCALL64 trampoline has a couple of nice properties:
> 
>  - The usual sequence of SWAPGS followed by two GS-relative accesses to
>    set up RSP is somewhat slow because the GS-relative accesses need
>    to wait for SWAPGS to finish.  The trampoline approach allows
>    RIP-relative accesses to set up RSP, which avoids the stall.
> 
>  - The trampoline avoids any percpu access before CR3 is set up,
>    which means that no percpu memory needs to be mapped in the user
>    page tables.  This prevents using Meltdown to read any percpu memory
>    outside the cpu_entry_area and prevents using timing leaks
>    to directly locate the percpu areas.
> 
> The downsides of using a trampoline may outweigh the upsides, however.
> It adds an extra non-contiguous I$ cache line to system calls, and it
> forces an indirect jump to transfer control back to the normal kernel
> text after CR3 is set up.  The latter is because x86 lacks a 64-bit
> direct jump instruction that could jump from the trampoline to the entry
> text.  With retpolines enabled, the indirect jump is extremely slow.
> 
> This patch changes the code to map the percpu TSS into the user page
> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
> This does not add a new direct information leak, since the TSS is
> readable by Meltdown from the cpu_entry_area alias regardless.  It
> does allow a timing attack to locate the percpu area, but KASLR is
> more or less a lost cause against local attack on CPUs vulnerable to
> Meltdown regardless.  As far as I'm concerned, on current hardware,
> KASLR is only useful to mitigate remote attacks that try to attack
> the kernel without first gaining RCE against a vulnerable user
> process.
> 
> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
> syscall overhead from ~237ns to ~228ns.
> 
> There is a possible alternative approach: we could instead move the
> trampoline within 2G of the entry text and make a separate copy for
> each CPU.  Then we could use a direct jump to rejoin the normal
> entry path.

Can we have a few words on why this solution and not this alternative? I
mean, you raise the possibility, but then surely you chose not to
implement that. Might as well share that with us.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-04  7:04   ` Peter Zijlstra
@ 2018-09-05 21:31     ` Andy Lutomirski
  2018-09-07 12:36       ` Peter Zijlstra
  2018-09-07 19:54       ` Thomas Gleixner
  0 siblings, 2 replies; 24+ messages in thread
From: Andy Lutomirski @ 2018-09-05 21:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, LKML, Dave Hansen,
	Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Linus Torvalds, Josh Poimboeuf, Joerg Roedel, Jiri Olsa,
	Andi Kleen

On Tue, Sep 4, 2018 at 12:04 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
>> The SYSCALL64 trampoline has a couple of nice properties:
>>
>>  - The usual sequence of SWAPGS followed by two GS-relative accesses to
>>    set up RSP is somewhat slow because the GS-relative accesses need
>>    to wait for SWAPGS to finish.  The trampoline approach allows
>>    RIP-relative accesses to set up RSP, which avoids the stall.
>>
>>  - The trampoline avoids any percpu access before CR3 is set up,
>>    which means that no percpu memory needs to be mapped in the user
>>    page tables.  This prevents using Meltdown to read any percpu memory
>>    outside the cpu_entry_area and prevents using timing leaks
>>    to directly locate the percpu areas.
>>
>> The downsides of using a trampoline may outweigh the upsides, however.
>> It adds an extra non-contiguous I$ cache line to system calls, and it
>> forces an indirect jump to transfer control back to the normal kernel
>> text after CR3 is set up.  The latter is because x86 lacks a 64-bit
>> direct jump instruction that could jump from the trampoline to the entry
>> text.  With retpolines enabled, the indirect jump is extremely slow.
>>
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless.  It
>> does allow a timing attack to locate the percpu area, but KASLR is
>> more or less a lost cause against local attack on CPUs vulnerable to
>> Meltdown regardless.  As far as I'm concerned, on current hardware,
>> KASLR is only useful to mitigate remote attacks that try to attack
>> the kernel without first gaining RCE against a vulnerable user
>> process.
>>
>> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
>> syscall overhead from ~237ns to ~228ns.
>>
>> There is a possible alternative approach: we could instead move the
>> trampoline within 2G of the entry text and make a separate copy for
>> each CPU.  Then we could use a direct jump to rejoin the normal
>> entry path.
>
> Can we have a few words on why this solution and not this alternative? I
> mean, you raise the possibility, but then surely you chose not to
> implement that. Might as well share that with us.

I can give some pros and cons.  With the other approach:

 - We avoid a pipeline stall.
 - We execute from an extra page and read from another extra page
during the syscall.  (The latter is because we need to use a relative
addressing mode to find sp1 -- it's the same *cacheline* we'd use
anyway, but we're accessing it using an alias, so it's an extra TLB
entry.)
 - We use more memory.  This would be one page per CPU for a simple
implementation and 64-ish bytes per CPU or one page per node for a
more complex implementation.
 - More code complexity.

I'm not convinced this is a good tradeoff.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/3] x86/entry/64: Document idtentry
  2018-09-03 22:59 ` [PATCH v2 1/3] x86/entry/64: Document idtentry Andy Lutomirski
@ 2018-09-06  9:50   ` Borislav Petkov
  2018-09-08  9:33   ` [tip:x86/pti] " tip-bot for Andy Lutomirski
  1 sibling, 0 replies; 24+ messages in thread
From: Borislav Petkov @ 2018-09-06  9:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Dave Hansen, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Linus Torvalds, Josh Poimboeuf,
	Joerg Roedel, Jiri Olsa, Andi Kleen, Peter Zijlstra

On Mon, Sep 03, 2018 at 03:59:42PM -0700, Andy Lutomirski wrote:
> The idtentry macro is complicated and magical.  Document what it
> does to help future readers and to allow future patches to adjust
> the code and docs at the same time.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/entry_64.S | 35 +++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/traps.c   |  4 ++++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 957dfb693ecc..1796c42e08af 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -900,6 +900,41 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
>   */
>  #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + ((x) - 1) * 8)
>  
> +/**
> + * idtentry - Generate an IDT entry stub
> + * @sym: Name of the generated entry point
> + * @do_sym: C function to be called
> + * @has_error_code: True if this IDT vector has an error code on the stack
> + * @paranoid: non-zero means that this vector may be invoked from kernel
> + *            mode with user GSBASE and/or user CR3.  2 is special -- see below.
> + * @shift_ist: Set to an IST index if entries from kernel mode should
> + *             decrement the IST stack so that nested entries get a fresh
> + *             stack.  (This is for #DB, which has a nasty habit of
> + *             recursing.)
> + *
> + * idtentry generates an IDT stub that sets up a usable kernel context,
> + * creates struct pt_regs, and calls @do_sym.  The stub has the following
> + * special behaviors:
> + *
> + * On an entry from user mode, the stub switches from the trampoline or
> + * IST stack to the normal thread stack.  On an exit to user mode, the
> + * normal exit-to-usermode path is invoked.
> + *
> + * On an exit to kernel mode, if paranoid == 0, we check for preemption,

				   @paranoid

Otherwise, documenting our entry maze is always a good idea!

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/3] x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space
  2018-09-03 22:59 ` [PATCH v2 2/3] x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space Andy Lutomirski
@ 2018-09-07  8:00   ` Borislav Petkov
  2018-09-08  9:34   ` [tip:x86/pti] " tip-bot for Andy Lutomirski
  1 sibling, 0 replies; 24+ messages in thread
From: Borislav Petkov @ 2018-09-07  8:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Dave Hansen, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Linus Torvalds, Josh Poimboeuf,
	Joerg Roedel, Jiri Olsa, Andi Kleen, Peter Zijlstra

On Mon, Sep 03, 2018 at 03:59:43PM -0700, Andy Lutomirski wrote:
> In the non-trampoline SYSCALL64 path, we use a percpu variable to
> temporarily store the user RSP value.  Instead of a separate
> variable, use the otherwise unused sp2 slot in the TSS.  This will
> improve cache locality, as the sp1 slot is already used in the same
> code to find the kernel stack.  It will also simplify a future
> change to make the non-trampoline path work in PTI mode.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/entry/entry_64.S        | 16 +++++++++-------
>  arch/x86/include/asm/processor.h |  6 ++++++
>  arch/x86/kernel/asm-offsets.c    |  3 ++-
>  arch/x86/kernel/process_64.c     |  2 --
>  arch/x86/xen/xen-asm_64.S        |  8 +++++---
>  5 files changed, 22 insertions(+), 13 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
  2018-09-04  7:04   ` Peter Zijlstra
@ 2018-09-07  9:35   ` Borislav Petkov
  2018-09-07 16:40   ` Josh Poimboeuf
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Borislav Petkov @ 2018-09-07  9:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, LKML, Dave Hansen, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Linus Torvalds, Josh Poimboeuf,
	Joerg Roedel, Jiri Olsa, Andi Kleen, Peter Zijlstra

On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
> The SYSCALL64 trampoline has a couple of nice properties:
> 
>  - The usual sequence of SWAPGS followed by two GS-relative accesses to
>    set up RSP is somewhat slow because the GS-relative accesses need
>    to wait for SWAPGS to finish.  The trampoline approach allows
>    RIP-relative accesses to set up RSP, which avoids the stall.

...

> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index 31341ae7309f..7e79154846c8 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
>  }
>  
>  /*
> - * Clone the CPU_ENTRY_AREA into the user space visible page table.
> + * Clone the CPU_ENTRY_AREA and associated data into the user space visible
> + * page table.
>   */
>  static void __init pti_clone_user_shared(void)
>  {
> +	unsigned cpu;

Make that

	unsigned int cpu;

Otherwise, patches removing complex code are always good!

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-05 21:31     ` Andy Lutomirski
@ 2018-09-07 12:36       ` Peter Zijlstra
  2018-09-07 19:54       ` Thomas Gleixner
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2018-09-07 12:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, Borislav Petkov, LKML, Dave Hansen, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Linus Torvalds,
	Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen

On Wed, Sep 05, 2018 at 02:31:28PM -0700, Andy Lutomirski wrote:
> On Tue, Sep 4, 2018 at 12:04 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:

> >> There is a possible alternative approach: we could instead move the
> >> trampoline within 2G of the entry text and make a separate copy for
> >> each CPU.  Then we could use a direct jump to rejoin the normal
> >> entry path.
> >
> > Can we have a few words on why this solution and not this alternative? I
> > mean, you raise the possibility, but then surely you chose not to
> > implement that. Might as well share that with us.
> 
> I can give some pros and cons.  With the other approach:
> 
>  - We avoid a pipeline stall.
>  - We execute from an extra page and read from another extra page
> during the syscall.  (The latter is because we need to use a relative
> addressing mode to find sp1 -- it's the same *cacheline* we'd use
> anyway, but we're accessing it using an alias, so it's an extra TLB
> entry.)
>  - We use more memory.  This would be one page per CPU for a simple
> implementation and 64-ish bytes per CPU or one page per node for a
> more complex implementation.
>  - More code complexity.
> 
> I'm not convinced this is a good tradeoff.

Fair enough, thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
  2018-09-04  7:04   ` Peter Zijlstra
  2018-09-07  9:35   ` Borislav Petkov
@ 2018-09-07 16:40   ` Josh Poimboeuf
  2018-09-08  4:35     ` Andy Lutomirski
  2018-09-08  9:35   ` [tip:x86/pti] " tip-bot for Andy Lutomirski
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Josh Poimboeuf @ 2018-09-07 16:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Borislav Petkov, LKML, Dave Hansen, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Linus Torvalds,
	Joerg Roedel, Jiri Olsa, Andi Kleen, Peter Zijlstra

On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
> The SYSCALL64 trampoline has a couple of nice properties:
> 
>  - The usual sequence of SWAPGS followed by two GS-relative accesses to
>    set up RSP is somewhat slow because the GS-relative accesses need
>    to wait for SWAPGS to finish.  The trampoline approach allows
>    RIP-relative accesses to set up RSP, which avoids the stall.
> 
>  - The trampoline avoids any percpu access before CR3 is set up,
>    which means that no percpu memory needs to be mapped in the user
>    page tables.  This prevents using Meltdown to read any percpu memory
>    outside the cpu_entry_area and prevents using timing leaks
>    to directly locate the percpu areas.
> 
> The downsides of using a trampoline may outweigh the upsides, however.
> It adds an extra non-contiguous I$ cache line to system calls, and it
> forces an indirect jump to transfer control back to the normal kernel
> text after CR3 is set up.  The latter is because x86 lacks a 64-bit
> direct jump instruction that could jump from the trampoline to the entry
> text.  With retpolines enabled, the indirect jump is extremely slow.
> 
> This patch changes the code to map the percpu TSS into the user page
> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
> This does not add a new direct information leak, since the TSS is
> readable by Meltdown from the cpu_entry_area alias regardless.  It
> does allow a timing attack to locate the percpu area, but KASLR is
> more or less a lost cause against local attack on CPUs vulnerable to
> Meltdown regardless.  As far as I'm concerned, on current hardware,
> KASLR is only useful to mitigate remote attacks that try to attack
> the kernel without first gaining RCE against a vulnerable user
> process.
> 
> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
> syscall overhead from ~237ns to ~228ns.
> 
> There is a possible alternative approach: we could instead move the
> trampoline within 2G of the entry text and make a separate copy for
> each CPU.  Then we could use a direct jump to rejoin the normal
> entry path.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

The following commit should also be reverted:

  4d99e4136580 ("perf machine: Workaround missing maps for x86 PTI entry trampolines")

-- 
Josh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-05 21:31     ` Andy Lutomirski
  2018-09-07 12:36       ` Peter Zijlstra
@ 2018-09-07 19:54       ` Thomas Gleixner
  2018-09-08  0:04         ` Linus Torvalds
  1 sibling, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2018-09-07 19:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, X86 ML, Borislav Petkov, LKML, Dave Hansen,
	Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Linus Torvalds, Josh Poimboeuf, Joerg Roedel, Jiri Olsa,
	Andi Kleen

On Wed, 5 Sep 2018, Andy Lutomirski wrote:
> On Tue, Sep 4, 2018 at 12:04 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Can we have a few words on why this solution and not this alternative? I
> > mean, you raise the possibility, but then surely you chose not to
> > implement that. Might as well share that with us.
> 
> I can give some pros and cons.  With the other approach:
> 
>  - We avoid a pipeline stall.

Which is good.

>  - We execute from an extra page and read from another extra page
> during the syscall.  (The latter is because we need to use a relative
> addressing mode to find sp1 -- it's the same *cacheline* we'd use
> anyway, but we're accessing it using an alias, so it's an extra TLB
> entry.)

Ok, but is this really an issue with PTI?

>  - We use more memory.  This would be one page per CPU for a simple
> implementation and 64-ish bytes per CPU or one page per node for a
> more complex implementation.

That's the least interesting argument really.

>  - More code complexity.

Ok, but how much complex code is that?

> I'm not convinced this is a good tradeoff.

Well, the real question here is whether this has any advantage vs. the
percpu area exposure?

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-07 19:54       ` Thomas Gleixner
@ 2018-09-08  0:04         ` Linus Torvalds
  2018-09-08  4:32           ` Andy Lutomirski
  2018-09-08  6:33           ` Thomas Gleixner
  0 siblings, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2018-09-08  0:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Lutomirski, Peter Zijlstra, the arch/x86 maintainers,
	Borislav Petkov, Linux Kernel Mailing List, Dave Hansen,
	Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen

On Fri, Sep 7, 2018 at 12:54 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> >  - We execute from an extra page and read from another extra page
> > during the syscall.  (The latter is because we need to use a relative
> > addressing mode to find sp1 -- it's the same *cacheline* we'd use
> > anyway, but we're accessing it using an alias, so it's an extra TLB
> > entry.)
>
> Ok, but is this really an issue with PTI?

I'd expect it to be *more* of an issue with PTI, since you're already
wasting TLB entries due to the whole "two different page tables".

Sure, address space ID's save you from reloading them all the time,
but don't help with capacity.

But yeah, in the sense of "with PTI, all kernel entries are slow
anyway, so none of this matters" is probably correct in a very real
sense.

That said, the real reason I like Andy's patch series is that I think
it's simpler than the alternatives (including the current setup). No
subtle mappings, no nothing. It removes a lot more lines than it adds,
and half the lines that it *does* add are comments.

Virtual mapping tricks may be cool, but in the end, not having to use
them is better still, I think.

              Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-08  0:04         ` Linus Torvalds
@ 2018-09-08  4:32           ` Andy Lutomirski
  2018-09-08  6:36             ` Thomas Gleixner
  2018-09-08  6:33           ` Thomas Gleixner
  1 sibling, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2018-09-08  4:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Andrew Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, Borislav Petkov,
	Linux Kernel Mailing List, Dave Hansen, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Josh Poimboeuf,
	Joerg Roedel, Jiri Olsa, Andi Kleen

On Fri, Sep 7, 2018 at 5:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Sep 7, 2018 at 12:54 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> >  - We execute from an extra page and read from another extra page
>> > during the syscall.  (The latter is because we need to use a relative
>> > addressing mode to find sp1 -- it's the same *cacheline* we'd use
>> > anyway, but we're accessing it using an alias, so it's an extra TLB
>> > entry.)
>>
>> Ok, but is this really an issue with PTI?
>
> I'd expect it to be *more* of an issue with PTI, since you're already
> wasting TLB entries due to the whole "two different page tables".
>
> Sure, address space ID's save you from reloading them all the time,
> but don't help with capacity.
>
> But yeah, in the sense of "with PTI, all kernel entries are slow
> anyway, so none of this matters" is probably correct in a very real
> sense.
>
> That said, the real reason I like Andy's patch series is that I think
> it's simpler than the alternatives (including the current setup). No
> subtle mappings, no nothing. It removes a lot more lines than it adds,
> and half the lines that it *does* add are comments.
>
> Virtual mapping tricks may be cool, but in the end, not having to use
> them is better still, I think.
>

If (and this is a *big* if) all the percpu data is within 2GB of the
entry text, then we could avoid this extra TLB reference by accessing
it directly instead of using an alias.

I suppose the summary is that the retpoline-free trampoline variant is
even more complicated than the code I'm removing in this series, and
that it would be at best a teeny tiny win.  Once all the Spectre dust
settles and we get fixed CPUs, we could consider re-optimizing this.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-07 16:40   ` Josh Poimboeuf
@ 2018-09-08  4:35     ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2018-09-08  4:35 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, LKML, Dave Hansen,
	Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Linus Torvalds, Joerg Roedel, Jiri Olsa, Andi Kleen,
	Peter Zijlstra

On Fri, Sep 7, 2018 at 9:40 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
>> The SYSCALL64 trampoline has a couple of nice properties:
>>
>>  - The usual sequence of SWAPGS followed by two GS-relative accesses to
>>    set up RSP is somewhat slow because the GS-relative accesses need
>>    to wait for SWAPGS to finish.  The trampoline approach allows
>>    RIP-relative accesses to set up RSP, which avoids the stall.
>>
>>  - The trampoline avoids any percpu access before CR3 is set up,
>>    which means that no percpu memory needs to be mapped in the user
>>    page tables.  This prevents using Meltdown to read any percpu memory
>>    outside the cpu_entry_area and prevents using timing leaks
>>    to directly locate the percpu areas.
>>
>> The downsides of using a trampoline may outweigh the upsides, however.
>> It adds an extra non-contiguous I$ cache line to system calls, and it
>> forces an indirect jump to transfer control back to the normal kernel
>> text after CR3 is set up.  The latter is because x86 lacks a 64-bit
>> direct jump instruction that could jump from the trampoline to the entry
>> text.  With retpolines enabled, the indirect jump is extremely slow.
>>
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless.  It
>> does allow a timing attack to locate the percpu area, but KASLR is
>> more or less a lost cause against local attack on CPUs vulnerable to
>> Meltdown regardless.  As far as I'm concerned, on current hardware,
>> KASLR is only useful to mitigate remote attacks that try to attack
>> the kernel without first gaining RCE against a vulnerable user
>> process.
>>
>> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
>> syscall overhead from ~237ns to ~228ns.
>>
>> There is a possible alternative approach: we could instead move the
>> trampoline within 2G of the entry text and make a separate copy for
>> each CPU.  Then we could use a direct jump to rejoin the normal
>> entry path.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>
> The following commit should also be reverted:
>
>   4d99e4136580 ("perf machine: Workaround missing maps for x86 PTI entry trampolines")

I think we shouldn't, since the perf folks want new perf versions to
be fully functional on older kernels.  My plan was to let the perf
crew do whatever reversions they felt appropriate.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-08  0:04         ` Linus Torvalds
  2018-09-08  4:32           ` Andy Lutomirski
@ 2018-09-08  6:33           ` Thomas Gleixner
  1 sibling, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2018-09-08  6:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Lutomirski, Peter Zijlstra, the arch/x86 maintainers,
	Borislav Petkov, Linux Kernel Mailing List, Dave Hansen,
	Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen

On Fri, 7 Sep 2018, Linus Torvalds wrote:
> On Fri, Sep 7, 2018 at 12:54 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > >  - We execute from an extra page and read from another extra page
> > > during the syscall.  (The latter is because we need to use a relative
> > > addressing mode to find sp1 -- it's the same *cacheline* we'd use
> > > anyway, but we're accessing it using an alias, so it's an extra TLB
> > > entry.)
> >
> > Ok, but is this really an issue with PTI?
> 
> I'd expect it to be *more* of an issue with PTI, since you're already
> wasting TLB entries due to the whole "two different page tables".
> 
> Sure, address space ID's save you from reloading them all the time,
> but don't help with capacity.
> 
> But yeah, in the sense of "with PTI, all kernel entries are slow
> anyway, so none of this matters" is probably correct in a very real
> sense.

Unfortunately yes. As Andy pointed out the alternative approach would trade
an avoided pipeline stall with an extra TLB.

> That said, the real reason I like Andy's patch series is that I think
> it's simpler than the alternatives (including the current setup). No
> subtle mappings, no nothing. It removes a lot more lines than it adds,
> and half the lines that it *does* add are comments.

I agree and in fact I had the series already queued for merging, I just
wanted to get the full picture and add some of that information to the
changelog.

> Virtual mapping tricks may be cool, but in the end, not having to use
> them is better still, I think.

Sure, unless they provide a real value. The maybe smaller leak of address
information is probably not worth it.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-08  4:32           ` Andy Lutomirski
@ 2018-09-08  6:36             ` Thomas Gleixner
  0 siblings, 0 replies; 24+ messages in thread
From: Thomas Gleixner @ 2018-09-08  6:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Peter Zijlstra, the arch/x86 maintainers,
	Borislav Petkov, Linux Kernel Mailing List, Dave Hansen,
	Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Josh Poimboeuf, Joerg Roedel, Jiri Olsa, Andi Kleen

On Fri, 7 Sep 2018, Andy Lutomirski wrote:
> On Fri, Sep 7, 2018 at 5:04 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > Virtual mapping tricks may be cool, but in the end, not having to use
> > them is better still, I think.
> >
> 
> If (and this is a *big* if) all the percpu data is within 2GB of the
> entry text, then we could avoid this extra TLB reference by accessing
> it directly instead of using an alias.
> 
> I suppose the summary is that the retpoline-free trampoline variant is
> even more complicated than the code I'm removing in this series, and
> that it would be at best a teeny tiny win.  Once all the Spectre dust
> settles and we get fixed CPUs, we could consider re-optimizing this.

That's going to be after my retirement ...



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [tip:x86/pti] x86/entry/64: Document idtentry
  2018-09-03 22:59 ` [PATCH v2 1/3] x86/entry/64: Document idtentry Andy Lutomirski
  2018-09-06  9:50   ` Borislav Petkov
@ 2018-09-08  9:33   ` tip-bot for Andy Lutomirski
  1 sibling, 0 replies; 24+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-09-08  9:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: ak, luto, mingo, linux-kernel, tglx, dave.hansen, torvalds,
	adrian.hunter, jolsa, joro, alexander.shishkin, jpoimboe, bp,
	acme, hpa, bp, peterz

Commit-ID:  bd7b1f7cbf9cb35dab8e1b99145d07afc5b7a132
Gitweb:     https://git.kernel.org/tip/bd7b1f7cbf9cb35dab8e1b99145d07afc5b7a132
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Mon, 3 Sep 2018 15:59:42 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sat, 8 Sep 2018 11:20:11 +0200

x86/entry/64: Document idtentry

The idtentry macro is complicated and magical.  Document what it
does to help future readers and to allow future patches to adjust
the code and docs at the same time.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/6e56c3ad94879e41afe345750bc28ccc0e820ea8.1536015544.git.luto@kernel.org

---
 arch/x86/entry/entry_64.S | 36 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c   |  4 ++++
 2 files changed, 40 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 957dfb693ecc..ce6af4460e9c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -900,6 +900,42 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + ((x) - 1) * 8)
 
+/**
+ * idtentry - Generate an IDT entry stub
+ * @sym:		Name of the generated entry point
+ * @do_sym: 		C function to be called
+ * @has_error_code: 	True if this IDT vector has an error code on the stack
+ * @paranoid: 		non-zero means that this vector may be invoked from
+ *			kernel mode with user GSBASE and/or user CR3.
+ *			2 is special -- see below.
+ * @shift_ist:		Set to an IST index if entries from kernel mode should
+ *             		decrement the IST stack so that nested entries get a
+ *			fresh stack.  (This is for #DB, which has a nasty habit
+ *             		of recursing.)
+ *
+ * idtentry generates an IDT stub that sets up a usable kernel context,
+ * creates struct pt_regs, and calls @do_sym.  The stub has the following
+ * special behaviors:
+ *
+ * On an entry from user mode, the stub switches from the trampoline or
+ * IST stack to the normal thread stack.  On an exit to user mode, the
+ * normal exit-to-usermode path is invoked.
+ *
+ * On an exit to kernel mode, if @paranoid == 0, we check for preemption,
+ * whereas we omit the preemption check if @paranoid != 0.  This is purely
+ * because the implementation is simpler this way.  The kernel only needs
+ * to check for asynchronous kernel preemption when IRQ handlers return.
+ *
+ * If @paranoid == 0, then the stub will handle IRET faults by pretending
+ * that the fault came from user mode.  It will handle gs_change faults by
+ * pretending that the fault happened with kernel GSBASE.  Since this handling
+ * is omitted for @paranoid != 0, the #GP, #SS, and #NP stubs must have
+ * @paranoid == 0.  This special handling will do the wrong thing for
+ * espfix-induced #DF on IRET, so #DF must not use @paranoid == 0.
+ *
+ * @paranoid == 2 is special: the stub will never switch stacks.  This is for
+ * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
+ */
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e6db475164ed..1a90821c0b74 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -383,6 +383,10 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		 * we won't enable interupts or schedule before we invoke
 		 * general_protection, so nothing will clobber the stack
 		 * frame we just set up.
+		 *
+		 * We will enter general_protection with kernel GSBASE,
+		 * which is what the stub expects, given that the faulting
+		 * RIP will be the IRET instruction.
 		 */
 		regs->ip = (unsigned long)general_protection;
 		regs->sp = (unsigned long)&gpregs->orig_ax;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [tip:x86/pti] x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space
  2018-09-03 22:59 ` [PATCH v2 2/3] x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space Andy Lutomirski
  2018-09-07  8:00   ` Borislav Petkov
@ 2018-09-08  9:34   ` tip-bot for Andy Lutomirski
  1 sibling, 0 replies; 24+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-09-08  9:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, joro, torvalds, acme, peterz, dave.hansen, ak,
	jpoimboe, alexander.shishkin, hpa, bp, bp, mingo, adrian.hunter,
	jolsa, tglx, luto

Commit-ID:  98f05b5138f0a9b56022295cc1387e635b25635d
Gitweb:     https://git.kernel.org/tip/98f05b5138f0a9b56022295cc1387e635b25635d
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Mon, 3 Sep 2018 15:59:43 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sat, 8 Sep 2018 11:20:11 +0200

x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space

In the non-trampoline SYSCALL64 path, a percpu variable is used to
temporarily store the user RSP value.

Instead of a separate variable, use the otherwise unused sp2 slot in the
TSS.  This will improve cache locality, as the sp1 slot is already used in
the same code to find the kernel stack.  It will also simplify a future
change to make the non-trampoline path work in PTI mode.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/08e769a0023dbad4bac6f34f3631dbaf8ad59f4f.1536015544.git.luto@kernel.org

---
 arch/x86/entry/entry_64.S        | 16 +++++++++-------
 arch/x86/include/asm/processor.h |  6 ++++++
 arch/x86/kernel/asm-offsets.c    |  3 ++-
 arch/x86/kernel/process_64.c     |  2 --
 arch/x86/xen/xen-asm_64.S        |  8 +++++---
 5 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ce6af4460e9c..7e82e553183a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -215,18 +215,20 @@ ENTRY(entry_SYSCALL_64)
 	/*
 	 * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
 	 * is not required to switch CR3.
+	 *
+	 * tss.sp2 is scratch space.
 	 */
-	movq	%rsp, PER_CPU_VAR(rsp_scratch)
+	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
-	pushq	$__USER_DS			/* pt_regs->ss */
-	pushq	PER_CPU_VAR(rsp_scratch)	/* pt_regs->sp */
-	pushq	%r11				/* pt_regs->flags */
-	pushq	$__USER_CS			/* pt_regs->cs */
-	pushq	%rcx				/* pt_regs->ip */
+	pushq	$__USER_DS				/* pt_regs->ss */
+	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
+	pushq	%r11					/* pt_regs->flags */
+	pushq	$__USER_CS				/* pt_regs->cs */
+	pushq	%rcx					/* pt_regs->ip */
 GLOBAL(entry_SYSCALL_64_after_hwframe)
-	pushq	%rax				/* pt_regs->orig_ax */
+	pushq	%rax					/* pt_regs->orig_ax */
 
 	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index d53c54b842da..b2bb1d691efc 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -315,7 +315,13 @@ struct x86_hw_tss {
 	 */
 	u64			sp1;
 
+	/*
+	 * Since Linux does not use ring 2, the 'sp2' slot is unused by
+	 * hardware.  entry_SYSCALL_64 uses it as scratch space to stash
+	 * the user RSP value.
+	 */
 	u64			sp2;
+
 	u64			reserved2;
 	u64			ist[7];
 	u32			reserved3;
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 01de31db300d..fc2e90d3429a 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -105,7 +105,8 @@ void common(void) {
 	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
 	DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
 
-	/* Offset for sp0 and sp1 into the tss_struct */
+	/* Offset for fields in tss_struct */
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
 	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
+	OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
 }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index a451bc374b9b..0fa7aa19f09e 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -59,8 +59,6 @@
 #include <asm/unistd_32_ia32.h>
 #endif
 
-__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
-
 /* Prints also some state that isn't saved in the pt_regs */
 void __show_regs(struct pt_regs *regs, int all)
 {
diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S
index 417b339e5c8e..bb1c2da0381d 100644
--- a/arch/x86/xen/xen-asm_64.S
+++ b/arch/x86/xen/xen-asm_64.S
@@ -91,13 +91,15 @@ ENTRY(xen_iret)
 ENTRY(xen_sysret64)
 	/*
 	 * We're already on the usermode stack at this point, but
-	 * still with the kernel gs, so we can easily switch back
+	 * still with the kernel gs, so we can easily switch back.
+	 *
+	 * tss.sp2 is scratch space.
 	 */
-	movq %rsp, PER_CPU_VAR(rsp_scratch)
+	movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
 	movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	pushq $__USER_DS
-	pushq PER_CPU_VAR(rsp_scratch)
+	pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
 	pushq %r11
 	pushq $__USER_CS
 	pushq %rcx

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [tip:x86/pti] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
                     ` (2 preceding siblings ...)
  2018-09-07 16:40   ` Josh Poimboeuf
@ 2018-09-08  9:35   ` tip-bot for Andy Lutomirski
  2018-09-08  9:57   ` tip-bot for Andy Lutomirski
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-09-08  9:35 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: joro, tglx, bp, acme, mingo, peterz, jolsa, dave.hansen, bp,
	alexander.shishkin, hpa, linux-kernel, jpoimboe, torvalds, ak,
	adrian.hunter, luto

Commit-ID:  344347941aba1ec906ff50b4cdb8178c906746f2
Gitweb:     https://git.kernel.org/tip/344347941aba1ec906ff50b4cdb8178c906746f2
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Mon, 3 Sep 2018 15:59:44 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sat, 8 Sep 2018 11:20:12 +0200

x86/pti/64: Remove the SYSCALL64 entry trampoline

The SYSCALL64 trampoline has a couple of nice properties:

 - The usual sequence of SWAPGS followed by two GS-relative accesses to
   set up RSP is somewhat slow because the GS-relative accesses need
   to wait for SWAPGS to finish.  The trampoline approach allows
   RIP-relative accesses to set up RSP, which avoids the stall.

 - The trampoline avoids any percpu access before CR3 is set up,
   which means that no percpu memory needs to be mapped in the user
   page tables.  This prevents using Meltdown to read any percpu memory
   outside the cpu_entry_area and prevents using timing leaks
   to directly locate the percpu areas.

The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up.  The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text.  With retpolines enabled, the indirect jump is extremely slow.

Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI.  This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless.  It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless.  As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.

On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.

There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU.  This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:

 + It avoids a pipeline stall
 
 - It executes from an extra page and read from another extra page during
   the syscall. The latter is because it needs to use a relative
   addressing mode to find sp1 -- it's the same *cacheline*, but accessed
   using an alias, so it's an extra TLB entry.

 - Slightly more memory. This would be one page per CPU for a simple
   implementation and 64-ish bytes per CPU or one page per node for a more
   complex implementation.

 - More code complexity.

The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.

[ tglx: Added the alternative discussion to the changelog ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org

---
 arch/x86/entry/entry_64.S             | 69 +----------------------------------
 arch/x86/include/asm/cpu_entry_area.h |  2 -
 arch/x86/include/asm/sections.h       |  1 -
 arch/x86/kernel/asm-offsets.c         |  2 -
 arch/x86/kernel/cpu/common.c          | 11 +-----
 arch/x86/kernel/kprobes/core.c        | 10 +----
 arch/x86/kernel/vmlinux.lds.S         | 10 -----
 arch/x86/mm/cpu_entry_area.c          | 36 ------------------
 arch/x86/mm/pti.c                     | 33 ++++++++++++++++-
 9 files changed, 36 insertions(+), 138 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7e82e553183a..0d728142467f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -142,67 +142,6 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
-	.pushsection .entry_trampoline, "ax"
-
-/*
- * The code in here gets remapped into cpu_entry_area's trampoline.  This means
- * that the assembler and linker have the wrong idea as to where this code
- * lives (and, in fact, it's mapped more than once, so it's not even at a
- * fixed address).  So we can't reference any symbols outside the entry
- * trampoline and expect it to work.
- *
- * Instead, we carefully abuse %rip-relative addressing.
- * _entry_trampoline(%rip) refers to the start of the remapped) entry
- * trampoline.  We can thus find cpu_entry_area with this macro:
- */
-
-#define CPU_ENTRY_AREA \
-	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
-
-/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH	CPU_ENTRY_AREA_entry_stack + \
-			SIZEOF_entry_stack - 8 + CPU_ENTRY_AREA
-
-ENTRY(entry_SYSCALL_64_trampoline)
-	UNWIND_HINT_EMPTY
-	swapgs
-
-	/* Stash the user RSP. */
-	movq	%rsp, RSP_SCRATCH
-
-	/* Note: using %rsp as a scratch reg. */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
-	/* Load the top of the task stack into RSP */
-	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
-
-	/* Start building the simulated IRET frame. */
-	pushq	$__USER_DS			/* pt_regs->ss */
-	pushq	RSP_SCRATCH			/* pt_regs->sp */
-	pushq	%r11				/* pt_regs->flags */
-	pushq	$__USER_CS			/* pt_regs->cs */
-	pushq	%rcx				/* pt_regs->ip */
-
-	/*
-	 * x86 lacks a near absolute jump, and we can't jump to the real
-	 * entry text with a relative jump.  We could push the target
-	 * address and then use retq, but this destroys the pipeline on
-	 * many CPUs (wasting over 20 cycles on Sandy Bridge).  Instead,
-	 * spill RDI and restore it in a second-stage trampoline.
-	 */
-	pushq	%rdi
-	movq	$entry_SYSCALL_64_stage2, %rdi
-	JMP_NOSPEC %rdi
-END(entry_SYSCALL_64_trampoline)
-
-	.popsection
-
-ENTRY(entry_SYSCALL_64_stage2)
-	UNWIND_HINT_EMPTY
-	popq	%rdi
-	jmp	entry_SYSCALL_64_after_hwframe
-END(entry_SYSCALL_64_stage2)
-
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
@@ -212,13 +151,9 @@ ENTRY(entry_SYSCALL_64)
 	 */
 
 	swapgs
-	/*
-	 * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
-	 * is not required to switch CR3.
-	 *
-	 * tss.sp2 is scratch space.
-	 */
+	/* tss.sp2 is scratch space. */
 	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b8dca5..29c706415443 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@ struct cpu_entry_area {
 	 */
 	struct tss_struct tss;
 
-	char entry_trampoline[PAGE_SIZE];
-
 #ifdef CONFIG_X86_64
 	/*
 	 * Exception stacks used for IST entries.
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 4a911a382ade..8ea1cfdbeabc 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,7 +11,6 @@ extern char __end_rodata_aligned[];
 
 #if defined(CONFIG_X86_64)
 extern char __end_rodata_hpage_align[];
-extern char __entry_trampoline_start[], __entry_trampoline_end[];
 #endif
 
 #endif	/* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index fc2e90d3429a..083c01309027 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -99,8 +99,6 @@ void common(void) {
 	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
 
 	/* Layout info for cpu_entry_area */
-	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
-	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
 	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
 	DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 44c4ef3d989b..a3c939656579 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1531,19 +1531,10 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
-	extern char _entry_trampoline[];
-	extern char entry_SYSCALL_64_trampoline[];
-
 	int cpu = smp_processor_id();
-	unsigned long SYSCALL64_entry_trampoline =
-		(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
-		(entry_SYSCALL_64_trampoline - _entry_trampoline);
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	if (static_cpu_has(X86_FEATURE_PTI))
-		wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
-	else
-		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b0d1e81c96bb..f802cf5b4478 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1066,18 +1066,10 @@ NOKPROBE_SYMBOL(kprobe_exceptions_notify);
 
 bool arch_within_kprobe_blacklist(unsigned long addr)
 {
-	bool is_in_entry_trampoline_section = false;
-
-#ifdef CONFIG_X86_64
-	is_in_entry_trampoline_section =
-		(addr >= (unsigned long)__entry_trampoline_start &&
-		 addr < (unsigned long)__entry_trampoline_end);
-#endif
 	return  (addr >= (unsigned long)__kprobes_text_start &&
 		 addr < (unsigned long)__kprobes_text_end) ||
 		(addr >= (unsigned long)__entry_text_start &&
-		 addr < (unsigned long)__entry_text_end) ||
-		is_in_entry_trampoline_section;
+		 addr < (unsigned long)__entry_text_end);
 }
 
 int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 8bde0a419f86..9c77d2df9c27 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,16 +118,6 @@ SECTIONS
 		*(.fixup)
 		*(.gnu.warning)
 
-#ifdef CONFIG_X86_64
-		. = ALIGN(PAGE_SIZE);
-		__entry_trampoline_start = .;
-		_entry_trampoline = .;
-		*(.entry_trampoline)
-		. = ALIGN(PAGE_SIZE);
-		__entry_trampoline_end = .;
-		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
-#endif
-
 #ifdef CONFIG_RETPOLINE
 		__indirect_thunk_start = .;
 		*(.text.__x86.indirect_thunk)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 076ebdce9bd4..12d7e7fb4efd 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -15,7 +15,6 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
 #ifdef CONFIG_X86_64
 static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-static DEFINE_PER_CPU(struct kcore_list, kcore_entry_trampoline);
 #endif
 
 struct cpu_entry_area *get_cpu_entry_area(int cpu)
@@ -83,8 +82,6 @@ static void percpu_setup_debug_store(int cpu)
 static void __init setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
-	extern char _entry_trampoline[];
-
 	/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 	pgprot_t tss_prot = PAGE_KERNEL_RO;
@@ -146,43 +143,10 @@ static void __init setup_cpu_entry_area(int cpu)
 	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
 			     &per_cpu(exception_stacks, cpu),
 			     sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
-
-	cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
-		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
-	/*
-	 * The cpu_entry_area alias addresses are not in the kernel binary
-	 * so they do not show up in /proc/kcore normally.  This adds entries
-	 * for them manually.
-	 */
-	kclist_add_remap(&per_cpu(kcore_entry_trampoline, cpu),
-			 _entry_trampoline,
-			 &get_cpu_entry_area(cpu)->entry_trampoline, PAGE_SIZE);
 #endif
 	percpu_setup_debug_store(cpu);
 }
 
-#ifdef CONFIG_X86_64
-int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
-		     char *name)
-{
-	unsigned int cpu, ncpu = 0;
-
-	if (symnum >= num_possible_cpus())
-		return -EINVAL;
-
-	for_each_possible_cpu(cpu) {
-		if (ncpu++ >= symnum)
-			break;
-	}
-
-	*value = (unsigned long)&get_cpu_entry_area(cpu)->entry_trampoline;
-	*type = 't';
-	strlcpy(name, "__entry_SYSCALL_64_trampoline", KSYM_NAME_LEN);
-
-	return 0;
-}
-#endif
-
 static __init void setup_cpu_entry_area_ptes(void)
 {
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index c1fc1ae6b429..6bcf651c8bd6 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
 }
 
 /*
- * Clone the CPU_ENTRY_AREA into the user space visible page table.
+ * Clone the CPU_ENTRY_AREA and associated data into the user space visible
+ * page table.
  */
 static void __init pti_clone_user_shared(void)
 {
+	unsigned cpu;
+
 	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+
+	for_each_possible_cpu(cpu) {
+		/*
+		 * The SYSCALL64 entry code needs to be able to find the
+		 * thread stack and needs one word of scratch space in which
+		 * to spill a register.  All of this lives in the TSS, in
+		 * the sp1 and sp2 slots.
+		 *
+		 * This is done for all possible CPUs during boot to ensure
+		 * that it's propagated to all mms.  If we were to add one of
+		 * these mappings during CPU hotplug, we would need to take
+		 * some measure to make sure that every mm that subsequently
+		 * ran on that CPU would have the relevant PGD entry in its
+		 * pagetables.  The usual vmalloc_fault() mechanism would not
+		 * work for page faults taken in entry_SYSCALL_64 before RSP
+		 * is set up.
+		 */
+
+		unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
+		phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+		pte_t *target_pte;
+
+		target_pte = pti_user_pagetable_walk_pte(va);
+		if (WARN_ON(!target_pte))
+			return;
+
+		*target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+	}
 }
 
 #else /* CONFIG_X86_64 */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [tip:x86/pti] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
                     ` (3 preceding siblings ...)
  2018-09-08  9:35   ` [tip:x86/pti] " tip-bot for Andy Lutomirski
@ 2018-09-08  9:57   ` tip-bot for Andy Lutomirski
  2018-09-12 19:33   ` tip-bot for Andy Lutomirski
  2018-09-12 19:36   ` tip-bot for Andy Lutomirski
  6 siblings, 0 replies; 24+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-09-08  9:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: luto, linux-kernel, dave.hansen, torvalds, alexander.shishkin,
	jpoimboe, acme, adrian.hunter, bp, hpa, jolsa, tglx, joro, ak,
	bp, peterz, mingo

Commit-ID:  86635715ee4228ded59f662dab36e9732b9c978f
Gitweb:     https://git.kernel.org/tip/86635715ee4228ded59f662dab36e9732b9c978f
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Mon, 3 Sep 2018 15:59:44 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Sat, 8 Sep 2018 11:53:16 +0200

x86/pti/64: Remove the SYSCALL64 entry trampoline

The SYSCALL64 trampoline has a couple of nice properties:

 - The usual sequence of SWAPGS followed by two GS-relative accesses to
   set up RSP is somewhat slow because the GS-relative accesses need
   to wait for SWAPGS to finish.  The trampoline approach allows
   RIP-relative accesses to set up RSP, which avoids the stall.

 - The trampoline avoids any percpu access before CR3 is set up,
   which means that no percpu memory needs to be mapped in the user
   page tables.  This prevents using Meltdown to read any percpu memory
   outside the cpu_entry_area and prevents using timing leaks
   to directly locate the percpu areas.

The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up.  The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text.  With retpolines enabled, the indirect jump is extremely slow.

Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI.  This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless.  It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless.  As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.

On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.

There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU.  This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:

 + It avoids a pipeline stall

 - It executes from an extra page and read from another extra page during
   the syscall. The latter is because it needs to use a relative
   addressing mode to find sp1 -- it's the same *cacheline*, but accessed
   using an alias, so it's an extra TLB entry.

 - Slightly more memory. This would be one page per CPU for a simple
   implementation and 64-ish bytes per CPU or one page per node for a more
   complex implementation.

 - More code complexity.

The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.

[ tglx: Added the alternative discussion to the changelog ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
---
 arch/x86/entry/entry_64.S             | 69 +----------------------------------
 arch/x86/include/asm/cpu_entry_area.h |  2 -
 arch/x86/include/asm/sections.h       |  1 -
 arch/x86/kernel/asm-offsets.c         |  2 -
 arch/x86/kernel/cpu/common.c          | 11 +-----
 arch/x86/kernel/kprobes/core.c        | 10 +----
 arch/x86/kernel/vmlinux.lds.S         | 10 -----
 arch/x86/mm/cpu_entry_area.c          | 36 ------------------
 arch/x86/mm/pti.c                     | 33 ++++++++++++++++-
 9 files changed, 36 insertions(+), 138 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7e82e553183a..0d728142467f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -142,67 +142,6 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
-	.pushsection .entry_trampoline, "ax"
-
-/*
- * The code in here gets remapped into cpu_entry_area's trampoline.  This means
- * that the assembler and linker have the wrong idea as to where this code
- * lives (and, in fact, it's mapped more than once, so it's not even at a
- * fixed address).  So we can't reference any symbols outside the entry
- * trampoline and expect it to work.
- *
- * Instead, we carefully abuse %rip-relative addressing.
- * _entry_trampoline(%rip) refers to the start of the remapped) entry
- * trampoline.  We can thus find cpu_entry_area with this macro:
- */
-
-#define CPU_ENTRY_AREA \
-	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
-
-/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH	CPU_ENTRY_AREA_entry_stack + \
-			SIZEOF_entry_stack - 8 + CPU_ENTRY_AREA
-
-ENTRY(entry_SYSCALL_64_trampoline)
-	UNWIND_HINT_EMPTY
-	swapgs
-
-	/* Stash the user RSP. */
-	movq	%rsp, RSP_SCRATCH
-
-	/* Note: using %rsp as a scratch reg. */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
-	/* Load the top of the task stack into RSP */
-	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
-
-	/* Start building the simulated IRET frame. */
-	pushq	$__USER_DS			/* pt_regs->ss */
-	pushq	RSP_SCRATCH			/* pt_regs->sp */
-	pushq	%r11				/* pt_regs->flags */
-	pushq	$__USER_CS			/* pt_regs->cs */
-	pushq	%rcx				/* pt_regs->ip */
-
-	/*
-	 * x86 lacks a near absolute jump, and we can't jump to the real
-	 * entry text with a relative jump.  We could push the target
-	 * address and then use retq, but this destroys the pipeline on
-	 * many CPUs (wasting over 20 cycles on Sandy Bridge).  Instead,
-	 * spill RDI and restore it in a second-stage trampoline.
-	 */
-	pushq	%rdi
-	movq	$entry_SYSCALL_64_stage2, %rdi
-	JMP_NOSPEC %rdi
-END(entry_SYSCALL_64_trampoline)
-
-	.popsection
-
-ENTRY(entry_SYSCALL_64_stage2)
-	UNWIND_HINT_EMPTY
-	popq	%rdi
-	jmp	entry_SYSCALL_64_after_hwframe
-END(entry_SYSCALL_64_stage2)
-
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
@@ -212,13 +151,9 @@ ENTRY(entry_SYSCALL_64)
 	 */
 
 	swapgs
-	/*
-	 * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
-	 * is not required to switch CR3.
-	 *
-	 * tss.sp2 is scratch space.
-	 */
+	/* tss.sp2 is scratch space. */
 	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b8dca5..29c706415443 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@ struct cpu_entry_area {
 	 */
 	struct tss_struct tss;
 
-	char entry_trampoline[PAGE_SIZE];
-
 #ifdef CONFIG_X86_64
 	/*
 	 * Exception stacks used for IST entries.
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 4a911a382ade..8ea1cfdbeabc 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,7 +11,6 @@ extern char __end_rodata_aligned[];
 
 #if defined(CONFIG_X86_64)
 extern char __end_rodata_hpage_align[];
-extern char __entry_trampoline_start[], __entry_trampoline_end[];
 #endif
 
 #endif	/* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index fc2e90d3429a..083c01309027 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -99,8 +99,6 @@ void common(void) {
 	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
 
 	/* Layout info for cpu_entry_area */
-	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
-	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
 	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
 	DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 44c4ef3d989b..a3c939656579 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1531,19 +1531,10 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
-	extern char _entry_trampoline[];
-	extern char entry_SYSCALL_64_trampoline[];
-
 	int cpu = smp_processor_id();
-	unsigned long SYSCALL64_entry_trampoline =
-		(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
-		(entry_SYSCALL_64_trampoline - _entry_trampoline);
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	if (static_cpu_has(X86_FEATURE_PTI))
-		wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
-	else
-		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b0d1e81c96bb..f802cf5b4478 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1066,18 +1066,10 @@ NOKPROBE_SYMBOL(kprobe_exceptions_notify);
 
 bool arch_within_kprobe_blacklist(unsigned long addr)
 {
-	bool is_in_entry_trampoline_section = false;
-
-#ifdef CONFIG_X86_64
-	is_in_entry_trampoline_section =
-		(addr >= (unsigned long)__entry_trampoline_start &&
-		 addr < (unsigned long)__entry_trampoline_end);
-#endif
 	return  (addr >= (unsigned long)__kprobes_text_start &&
 		 addr < (unsigned long)__kprobes_text_end) ||
 		(addr >= (unsigned long)__entry_text_start &&
-		 addr < (unsigned long)__entry_text_end) ||
-		is_in_entry_trampoline_section;
+		 addr < (unsigned long)__entry_text_end);
 }
 
 int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 8bde0a419f86..9c77d2df9c27 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,16 +118,6 @@ SECTIONS
 		*(.fixup)
 		*(.gnu.warning)
 
-#ifdef CONFIG_X86_64
-		. = ALIGN(PAGE_SIZE);
-		__entry_trampoline_start = .;
-		_entry_trampoline = .;
-		*(.entry_trampoline)
-		. = ALIGN(PAGE_SIZE);
-		__entry_trampoline_end = .;
-		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
-#endif
-
 #ifdef CONFIG_RETPOLINE
 		__indirect_thunk_start = .;
 		*(.text.__x86.indirect_thunk)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 076ebdce9bd4..12d7e7fb4efd 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -15,7 +15,6 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
 #ifdef CONFIG_X86_64
 static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-static DEFINE_PER_CPU(struct kcore_list, kcore_entry_trampoline);
 #endif
 
 struct cpu_entry_area *get_cpu_entry_area(int cpu)
@@ -83,8 +82,6 @@ static void percpu_setup_debug_store(int cpu)
 static void __init setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
-	extern char _entry_trampoline[];
-
 	/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 	pgprot_t tss_prot = PAGE_KERNEL_RO;
@@ -146,43 +143,10 @@ static void __init setup_cpu_entry_area(int cpu)
 	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
 			     &per_cpu(exception_stacks, cpu),
 			     sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
-
-	cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
-		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
-	/*
-	 * The cpu_entry_area alias addresses are not in the kernel binary
-	 * so they do not show up in /proc/kcore normally.  This adds entries
-	 * for them manually.
-	 */
-	kclist_add_remap(&per_cpu(kcore_entry_trampoline, cpu),
-			 _entry_trampoline,
-			 &get_cpu_entry_area(cpu)->entry_trampoline, PAGE_SIZE);
 #endif
 	percpu_setup_debug_store(cpu);
 }
 
-#ifdef CONFIG_X86_64
-int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
-		     char *name)
-{
-	unsigned int cpu, ncpu = 0;
-
-	if (symnum >= num_possible_cpus())
-		return -EINVAL;
-
-	for_each_possible_cpu(cpu) {
-		if (ncpu++ >= symnum)
-			break;
-	}
-
-	*value = (unsigned long)&get_cpu_entry_area(cpu)->entry_trampoline;
-	*type = 't';
-	strlcpy(name, "__entry_SYSCALL_64_trampoline", KSYM_NAME_LEN);
-
-	return 0;
-}
-#endif
-
 static __init void setup_cpu_entry_area_ptes(void)
 {
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index c1fc1ae6b429..4fee5c3003ed 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
 }
 
 /*
- * Clone the CPU_ENTRY_AREA into the user space visible page table.
+ * Clone the CPU_ENTRY_AREA and associated data into the user space visible
+ * page table.
  */
 static void __init pti_clone_user_shared(void)
 {
+	unsigned int cpu;
+
 	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+
+	for_each_possible_cpu(cpu) {
+		/*
+		 * The SYSCALL64 entry code needs to be able to find the
+		 * thread stack and needs one word of scratch space in which
+		 * to spill a register.  All of this lives in the TSS, in
+		 * the sp1 and sp2 slots.
+		 *
+		 * This is done for all possible CPUs during boot to ensure
+		 * that it's propagated to all mms.  If we were to add one of
+		 * these mappings during CPU hotplug, we would need to take
+		 * some measure to make sure that every mm that subsequently
+		 * ran on that CPU would have the relevant PGD entry in its
+		 * pagetables.  The usual vmalloc_fault() mechanism would not
+		 * work for page faults taken in entry_SYSCALL_64 before RSP
+		 * is set up.
+		 */
+
+		unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
+		phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+		pte_t *target_pte;
+
+		target_pte = pti_user_pagetable_walk_pte(va);
+		if (WARN_ON(!target_pte))
+			return;
+
+		*target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+	}
 }
 
 #else /* CONFIG_X86_64 */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [tip:x86/pti] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
                     ` (4 preceding siblings ...)
  2018-09-08  9:57   ` tip-bot for Andy Lutomirski
@ 2018-09-12 19:33   ` tip-bot for Andy Lutomirski
  2018-09-12 19:36   ` tip-bot for Andy Lutomirski
  6 siblings, 0 replies; 24+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-09-12 19:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, jpoimboe, ak, linux-kernel, joro, acme, jolsa, peterz,
	dave.hansen, tglx, hpa, mingo, adrian.hunter, alexander.shishkin,
	luto, bp, torvalds

Commit-ID:  e536a56190d412b0f98dbd4dde608f9f7081bb6d
Gitweb:     https://git.kernel.org/tip/e536a56190d412b0f98dbd4dde608f9f7081bb6d
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Mon, 3 Sep 2018 15:59:44 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 12 Sep 2018 21:29:09 +0200

x86/pti/64: Remove the SYSCALL64 entry trampoline

The SYSCALL64 trampoline has a couple of nice properties:

 - The usual sequence of SWAPGS followed by two GS-relative accesses to
   set up RSP is somewhat slow because the GS-relative accesses need
   to wait for SWAPGS to finish.  The trampoline approach allows
   RIP-relative accesses to set up RSP, which avoids the stall.

 - The trampoline avoids any percpu access before CR3 is set up,
   which means that no percpu memory needs to be mapped in the user
   page tables.  This prevents using Meltdown to read any percpu memory
   outside the cpu_entry_area and prevents using timing leaks
   to directly locate the percpu areas.

The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up.  The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text.  With retpolines enabled, the indirect jump is extremely slow.

Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI.  This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless.  It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless.  As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.

On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.

There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU.  This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:

 + It avoids a pipeline stall

 - It executes from an extra page and read from another extra page during
   the syscall. The latter is because it needs to use a relative
   addressing mode to find sp1 -- it's the same *cacheline*, but accessed
   using an alias, so it's an extra TLB entry.

 - Slightly more memory. This would be one page per CPU for a simple
   implementation and 64-ish bytes per CPU or one page per node for a more
   complex implementation.

 - More code complexity.

The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.

[ tglx: Added the alternative discussion to the changelog ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
---
 arch/x86/entry/entry_64.S             | 69 +----------------------------------
 arch/x86/include/asm/cpu_entry_area.h |  2 -
 arch/x86/include/asm/sections.h       |  1 -
 arch/x86/kernel/asm-offsets.c         |  2 -
 arch/x86/kernel/cpu/common.c          | 13 +------
 arch/x86/kernel/kprobes/core.c        | 10 +----
 arch/x86/kernel/vmlinux.lds.S         | 10 -----
 arch/x86/mm/cpu_entry_area.c          | 36 ------------------
 arch/x86/mm/pti.c                     | 33 ++++++++++++++++-
 9 files changed, 36 insertions(+), 140 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7e82e553183a..0d728142467f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -142,67 +142,6 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
-	.pushsection .entry_trampoline, "ax"
-
-/*
- * The code in here gets remapped into cpu_entry_area's trampoline.  This means
- * that the assembler and linker have the wrong idea as to where this code
- * lives (and, in fact, it's mapped more than once, so it's not even at a
- * fixed address).  So we can't reference any symbols outside the entry
- * trampoline and expect it to work.
- *
- * Instead, we carefully abuse %rip-relative addressing.
- * _entry_trampoline(%rip) refers to the start of the remapped) entry
- * trampoline.  We can thus find cpu_entry_area with this macro:
- */
-
-#define CPU_ENTRY_AREA \
-	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
-
-/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH	CPU_ENTRY_AREA_entry_stack + \
-			SIZEOF_entry_stack - 8 + CPU_ENTRY_AREA
-
-ENTRY(entry_SYSCALL_64_trampoline)
-	UNWIND_HINT_EMPTY
-	swapgs
-
-	/* Stash the user RSP. */
-	movq	%rsp, RSP_SCRATCH
-
-	/* Note: using %rsp as a scratch reg. */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
-	/* Load the top of the task stack into RSP */
-	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
-
-	/* Start building the simulated IRET frame. */
-	pushq	$__USER_DS			/* pt_regs->ss */
-	pushq	RSP_SCRATCH			/* pt_regs->sp */
-	pushq	%r11				/* pt_regs->flags */
-	pushq	$__USER_CS			/* pt_regs->cs */
-	pushq	%rcx				/* pt_regs->ip */
-
-	/*
-	 * x86 lacks a near absolute jump, and we can't jump to the real
-	 * entry text with a relative jump.  We could push the target
-	 * address and then use retq, but this destroys the pipeline on
-	 * many CPUs (wasting over 20 cycles on Sandy Bridge).  Instead,
-	 * spill RDI and restore it in a second-stage trampoline.
-	 */
-	pushq	%rdi
-	movq	$entry_SYSCALL_64_stage2, %rdi
-	JMP_NOSPEC %rdi
-END(entry_SYSCALL_64_trampoline)
-
-	.popsection
-
-ENTRY(entry_SYSCALL_64_stage2)
-	UNWIND_HINT_EMPTY
-	popq	%rdi
-	jmp	entry_SYSCALL_64_after_hwframe
-END(entry_SYSCALL_64_stage2)
-
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
@@ -212,13 +151,9 @@ ENTRY(entry_SYSCALL_64)
 	 */
 
 	swapgs
-	/*
-	 * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
-	 * is not required to switch CR3.
-	 *
-	 * tss.sp2 is scratch space.
-	 */
+	/* tss.sp2 is scratch space. */
 	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b8dca5..29c706415443 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@ struct cpu_entry_area {
 	 */
 	struct tss_struct tss;
 
-	char entry_trampoline[PAGE_SIZE];
-
 #ifdef CONFIG_X86_64
 	/*
 	 * Exception stacks used for IST entries.
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 4a911a382ade..8ea1cfdbeabc 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,7 +11,6 @@ extern char __end_rodata_aligned[];
 
 #if defined(CONFIG_X86_64)
 extern char __end_rodata_hpage_align[];
-extern char __entry_trampoline_start[], __entry_trampoline_end[];
 #endif
 
 #endif	/* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index fc2e90d3429a..083c01309027 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -99,8 +99,6 @@ void common(void) {
 	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
 
 	/* Layout info for cpu_entry_area */
-	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
-	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
 	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
 	DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 44c4ef3d989b..155082ae8b3f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1531,19 +1531,8 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
-	extern char _entry_trampoline[];
-	extern char entry_SYSCALL_64_trampoline[];
-
-	int cpu = smp_processor_id();
-	unsigned long SYSCALL64_entry_trampoline =
-		(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
-		(entry_SYSCALL_64_trampoline - _entry_trampoline);
-
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	if (static_cpu_has(X86_FEATURE_PTI))
-		wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
-	else
-		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b0d1e81c96bb..f802cf5b4478 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1066,18 +1066,10 @@ NOKPROBE_SYMBOL(kprobe_exceptions_notify);
 
 bool arch_within_kprobe_blacklist(unsigned long addr)
 {
-	bool is_in_entry_trampoline_section = false;
-
-#ifdef CONFIG_X86_64
-	is_in_entry_trampoline_section =
-		(addr >= (unsigned long)__entry_trampoline_start &&
-		 addr < (unsigned long)__entry_trampoline_end);
-#endif
 	return  (addr >= (unsigned long)__kprobes_text_start &&
 		 addr < (unsigned long)__kprobes_text_end) ||
 		(addr >= (unsigned long)__entry_text_start &&
-		 addr < (unsigned long)__entry_text_end) ||
-		is_in_entry_trampoline_section;
+		 addr < (unsigned long)__entry_text_end);
 }
 
 int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 8bde0a419f86..9c77d2df9c27 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,16 +118,6 @@ SECTIONS
 		*(.fixup)
 		*(.gnu.warning)
 
-#ifdef CONFIG_X86_64
-		. = ALIGN(PAGE_SIZE);
-		__entry_trampoline_start = .;
-		_entry_trampoline = .;
-		*(.entry_trampoline)
-		. = ALIGN(PAGE_SIZE);
-		__entry_trampoline_end = .;
-		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
-#endif
-
 #ifdef CONFIG_RETPOLINE
 		__indirect_thunk_start = .;
 		*(.text.__x86.indirect_thunk)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 076ebdce9bd4..12d7e7fb4efd 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -15,7 +15,6 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
 #ifdef CONFIG_X86_64
 static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-static DEFINE_PER_CPU(struct kcore_list, kcore_entry_trampoline);
 #endif
 
 struct cpu_entry_area *get_cpu_entry_area(int cpu)
@@ -83,8 +82,6 @@ static void percpu_setup_debug_store(int cpu)
 static void __init setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
-	extern char _entry_trampoline[];
-
 	/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 	pgprot_t tss_prot = PAGE_KERNEL_RO;
@@ -146,43 +143,10 @@ static void __init setup_cpu_entry_area(int cpu)
 	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
 			     &per_cpu(exception_stacks, cpu),
 			     sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
-
-	cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
-		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
-	/*
-	 * The cpu_entry_area alias addresses are not in the kernel binary
-	 * so they do not show up in /proc/kcore normally.  This adds entries
-	 * for them manually.
-	 */
-	kclist_add_remap(&per_cpu(kcore_entry_trampoline, cpu),
-			 _entry_trampoline,
-			 &get_cpu_entry_area(cpu)->entry_trampoline, PAGE_SIZE);
 #endif
 	percpu_setup_debug_store(cpu);
 }
 
-#ifdef CONFIG_X86_64
-int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
-		     char *name)
-{
-	unsigned int cpu, ncpu = 0;
-
-	if (symnum >= num_possible_cpus())
-		return -EINVAL;
-
-	for_each_possible_cpu(cpu) {
-		if (ncpu++ >= symnum)
-			break;
-	}
-
-	*value = (unsigned long)&get_cpu_entry_area(cpu)->entry_trampoline;
-	*type = 't';
-	strlcpy(name, "__entry_SYSCALL_64_trampoline", KSYM_NAME_LEN);
-
-	return 0;
-}
-#endif
-
 static __init void setup_cpu_entry_area_ptes(void)
 {
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index c1fc1ae6b429..4fee5c3003ed 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
 }
 
 /*
- * Clone the CPU_ENTRY_AREA into the user space visible page table.
+ * Clone the CPU_ENTRY_AREA and associated data into the user space visible
+ * page table.
  */
 static void __init pti_clone_user_shared(void)
 {
+	unsigned int cpu;
+
 	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+
+	for_each_possible_cpu(cpu) {
+		/*
+		 * The SYSCALL64 entry code needs to be able to find the
+		 * thread stack and needs one word of scratch space in which
+		 * to spill a register.  All of this lives in the TSS, in
+		 * the sp1 and sp2 slots.
+		 *
+		 * This is done for all possible CPUs during boot to ensure
+		 * that it's propagated to all mms.  If we were to add one of
+		 * these mappings during CPU hotplug, we would need to take
+		 * some measure to make sure that every mm that subsequently
+		 * ran on that CPU would have the relevant PGD entry in its
+		 * pagetables.  The usual vmalloc_fault() mechanism would not
+		 * work for page faults taken in entry_SYSCALL_64 before RSP
+		 * is set up.
+		 */
+
+		unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
+		phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+		pte_t *target_pte;
+
+		target_pte = pti_user_pagetable_walk_pte(va);
+		if (WARN_ON(!target_pte))
+			return;
+
+		*target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+	}
 }
 
 #else /* CONFIG_X86_64 */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [tip:x86/pti] x86/pti/64: Remove the SYSCALL64 entry trampoline
  2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
                     ` (5 preceding siblings ...)
  2018-09-12 19:33   ` tip-bot for Andy Lutomirski
@ 2018-09-12 19:36   ` tip-bot for Andy Lutomirski
  6 siblings, 0 replies; 24+ messages in thread
From: tip-bot for Andy Lutomirski @ 2018-09-12 19:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, alexander.shishkin, luto, joro, mingo, ak, dave.hansen,
	linux-kernel, adrian.hunter, torvalds, hpa, bp, jpoimboe, acme,
	jolsa, bp, peterz

Commit-ID:  bf904d2762ee6fc1e4acfcb0772bbfb4a27ad8a6
Gitweb:     https://git.kernel.org/tip/bf904d2762ee6fc1e4acfcb0772bbfb4a27ad8a6
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Mon, 3 Sep 2018 15:59:44 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 12 Sep 2018 21:33:53 +0200

x86/pti/64: Remove the SYSCALL64 entry trampoline

The SYSCALL64 trampoline has a couple of nice properties:

 - The usual sequence of SWAPGS followed by two GS-relative accesses to
   set up RSP is somewhat slow because the GS-relative accesses need
   to wait for SWAPGS to finish.  The trampoline approach allows
   RIP-relative accesses to set up RSP, which avoids the stall.

 - The trampoline avoids any percpu access before CR3 is set up,
   which means that no percpu memory needs to be mapped in the user
   page tables.  This prevents using Meltdown to read any percpu memory
   outside the cpu_entry_area and prevents using timing leaks
   to directly locate the percpu areas.

The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up.  The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text.  With retpolines enabled, the indirect jump is extremely slow.

Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI.  This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless.  It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless.  As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.

On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.

There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU.  This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:

 + It avoids a pipeline stall

 - It executes from an extra page and read from another extra page during
   the syscall. The latter is because it needs to use a relative
   addressing mode to find sp1 -- it's the same *cacheline*, but accessed
   using an alias, so it's an extra TLB entry.

 - Slightly more memory. This would be one page per CPU for a simple
   implementation and 64-ish bytes per CPU or one page per node for a more
   complex implementation.

 - More code complexity.

The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.

[ tglx: Added the alternative discussion to the changelog ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
---
 arch/x86/entry/entry_64.S             | 69 +----------------------------------
 arch/x86/include/asm/cpu_entry_area.h |  2 -
 arch/x86/include/asm/sections.h       |  1 -
 arch/x86/kernel/asm-offsets.c         |  2 -
 arch/x86/kernel/cpu/common.c          | 13 +------
 arch/x86/kernel/kprobes/core.c        | 10 +----
 arch/x86/kernel/vmlinux.lds.S         | 10 -----
 arch/x86/mm/cpu_entry_area.c          | 36 ------------------
 arch/x86/mm/pti.c                     | 33 ++++++++++++++++-
 9 files changed, 37 insertions(+), 139 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7e82e553183a..0d728142467f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -142,67 +142,6 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
-	.pushsection .entry_trampoline, "ax"
-
-/*
- * The code in here gets remapped into cpu_entry_area's trampoline.  This means
- * that the assembler and linker have the wrong idea as to where this code
- * lives (and, in fact, it's mapped more than once, so it's not even at a
- * fixed address).  So we can't reference any symbols outside the entry
- * trampoline and expect it to work.
- *
- * Instead, we carefully abuse %rip-relative addressing.
- * _entry_trampoline(%rip) refers to the start of the remapped) entry
- * trampoline.  We can thus find cpu_entry_area with this macro:
- */
-
-#define CPU_ENTRY_AREA \
-	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
-
-/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH	CPU_ENTRY_AREA_entry_stack + \
-			SIZEOF_entry_stack - 8 + CPU_ENTRY_AREA
-
-ENTRY(entry_SYSCALL_64_trampoline)
-	UNWIND_HINT_EMPTY
-	swapgs
-
-	/* Stash the user RSP. */
-	movq	%rsp, RSP_SCRATCH
-
-	/* Note: using %rsp as a scratch reg. */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
-	/* Load the top of the task stack into RSP */
-	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
-
-	/* Start building the simulated IRET frame. */
-	pushq	$__USER_DS			/* pt_regs->ss */
-	pushq	RSP_SCRATCH			/* pt_regs->sp */
-	pushq	%r11				/* pt_regs->flags */
-	pushq	$__USER_CS			/* pt_regs->cs */
-	pushq	%rcx				/* pt_regs->ip */
-
-	/*
-	 * x86 lacks a near absolute jump, and we can't jump to the real
-	 * entry text with a relative jump.  We could push the target
-	 * address and then use retq, but this destroys the pipeline on
-	 * many CPUs (wasting over 20 cycles on Sandy Bridge).  Instead,
-	 * spill RDI and restore it in a second-stage trampoline.
-	 */
-	pushq	%rdi
-	movq	$entry_SYSCALL_64_stage2, %rdi
-	JMP_NOSPEC %rdi
-END(entry_SYSCALL_64_trampoline)
-
-	.popsection
-
-ENTRY(entry_SYSCALL_64_stage2)
-	UNWIND_HINT_EMPTY
-	popq	%rdi
-	jmp	entry_SYSCALL_64_after_hwframe
-END(entry_SYSCALL_64_stage2)
-
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
@@ -212,13 +151,9 @@ ENTRY(entry_SYSCALL_64)
 	 */
 
 	swapgs
-	/*
-	 * This path is only taken when PAGE_TABLE_ISOLATION is disabled so it
-	 * is not required to switch CR3.
-	 *
-	 * tss.sp2 is scratch space.
-	 */
+	/* tss.sp2 is scratch space. */
 	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b8dca5..29c706415443 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -30,8 +30,6 @@ struct cpu_entry_area {
 	 */
 	struct tss_struct tss;
 
-	char entry_trampoline[PAGE_SIZE];
-
 #ifdef CONFIG_X86_64
 	/*
 	 * Exception stacks used for IST entries.
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 4a911a382ade..8ea1cfdbeabc 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -11,7 +11,6 @@ extern char __end_rodata_aligned[];
 
 #if defined(CONFIG_X86_64)
 extern char __end_rodata_hpage_align[];
-extern char __entry_trampoline_start[], __entry_trampoline_end[];
 #endif
 
 #endif	/* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index fc2e90d3429a..083c01309027 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -99,8 +99,6 @@ void common(void) {
 	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
 
 	/* Layout info for cpu_entry_area */
-	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
-	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
 	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
 	DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 44c4ef3d989b..2d5b1fa5f9c6 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1531,19 +1531,10 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
-	extern char _entry_trampoline[];
-	extern char entry_SYSCALL_64_trampoline[];
-
-	int cpu = smp_processor_id();
-	unsigned long SYSCALL64_entry_trampoline =
-		(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
-		(entry_SYSCALL_64_trampoline - _entry_trampoline);
+	int __maybe_unused cpu = smp_processor_id();
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	if (static_cpu_has(X86_FEATURE_PTI))
-		wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
-	else
-		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index b0d1e81c96bb..f802cf5b4478 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1066,18 +1066,10 @@ NOKPROBE_SYMBOL(kprobe_exceptions_notify);
 
 bool arch_within_kprobe_blacklist(unsigned long addr)
 {
-	bool is_in_entry_trampoline_section = false;
-
-#ifdef CONFIG_X86_64
-	is_in_entry_trampoline_section =
-		(addr >= (unsigned long)__entry_trampoline_start &&
-		 addr < (unsigned long)__entry_trampoline_end);
-#endif
 	return  (addr >= (unsigned long)__kprobes_text_start &&
 		 addr < (unsigned long)__kprobes_text_end) ||
 		(addr >= (unsigned long)__entry_text_start &&
-		 addr < (unsigned long)__entry_text_end) ||
-		is_in_entry_trampoline_section;
+		 addr < (unsigned long)__entry_text_end);
 }
 
 int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 8bde0a419f86..9c77d2df9c27 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,16 +118,6 @@ SECTIONS
 		*(.fixup)
 		*(.gnu.warning)
 
-#ifdef CONFIG_X86_64
-		. = ALIGN(PAGE_SIZE);
-		__entry_trampoline_start = .;
-		_entry_trampoline = .;
-		*(.entry_trampoline)
-		. = ALIGN(PAGE_SIZE);
-		__entry_trampoline_end = .;
-		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
-#endif
-
 #ifdef CONFIG_RETPOLINE
 		__indirect_thunk_start = .;
 		*(.text.__x86.indirect_thunk)
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 076ebdce9bd4..12d7e7fb4efd 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -15,7 +15,6 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage)
 #ifdef CONFIG_X86_64
 static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-static DEFINE_PER_CPU(struct kcore_list, kcore_entry_trampoline);
 #endif
 
 struct cpu_entry_area *get_cpu_entry_area(int cpu)
@@ -83,8 +82,6 @@ static void percpu_setup_debug_store(int cpu)
 static void __init setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
-	extern char _entry_trampoline[];
-
 	/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 	pgprot_t tss_prot = PAGE_KERNEL_RO;
@@ -146,43 +143,10 @@ static void __init setup_cpu_entry_area(int cpu)
 	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
 			     &per_cpu(exception_stacks, cpu),
 			     sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
-
-	cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
-		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
-	/*
-	 * The cpu_entry_area alias addresses are not in the kernel binary
-	 * so they do not show up in /proc/kcore normally.  This adds entries
-	 * for them manually.
-	 */
-	kclist_add_remap(&per_cpu(kcore_entry_trampoline, cpu),
-			 _entry_trampoline,
-			 &get_cpu_entry_area(cpu)->entry_trampoline, PAGE_SIZE);
 #endif
 	percpu_setup_debug_store(cpu);
 }
 
-#ifdef CONFIG_X86_64
-int arch_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
-		     char *name)
-{
-	unsigned int cpu, ncpu = 0;
-
-	if (symnum >= num_possible_cpus())
-		return -EINVAL;
-
-	for_each_possible_cpu(cpu) {
-		if (ncpu++ >= symnum)
-			break;
-	}
-
-	*value = (unsigned long)&get_cpu_entry_area(cpu)->entry_trampoline;
-	*type = 't';
-	strlcpy(name, "__entry_SYSCALL_64_trampoline", KSYM_NAME_LEN);
-
-	return 0;
-}
-#endif
-
 static __init void setup_cpu_entry_area_ptes(void)
 {
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index c1fc1ae6b429..4fee5c3003ed 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
 }
 
 /*
- * Clone the CPU_ENTRY_AREA into the user space visible page table.
+ * Clone the CPU_ENTRY_AREA and associated data into the user space visible
+ * page table.
  */
 static void __init pti_clone_user_shared(void)
 {
+	unsigned int cpu;
+
 	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+
+	for_each_possible_cpu(cpu) {
+		/*
+		 * The SYSCALL64 entry code needs to be able to find the
+		 * thread stack and needs one word of scratch space in which
+		 * to spill a register.  All of this lives in the TSS, in
+		 * the sp1 and sp2 slots.
+		 *
+		 * This is done for all possible CPUs during boot to ensure
+		 * that it's propagated to all mms.  If we were to add one of
+		 * these mappings during CPU hotplug, we would need to take
+		 * some measure to make sure that every mm that subsequently
+		 * ran on that CPU would have the relevant PGD entry in its
+		 * pagetables.  The usual vmalloc_fault() mechanism would not
+		 * work for page faults taken in entry_SYSCALL_64 before RSP
+		 * is set up.
+		 */
+
+		unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
+		phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+		pte_t *target_pte;
+
+		target_pte = pti_user_pagetable_walk_pte(va);
+		if (WARN_ON(!target_pte))
+			return;
+
+		*target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+	}
 }
 
 #else /* CONFIG_X86_64 */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-09-12 19:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-03 22:59 [PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs Andy Lutomirski
2018-09-03 22:59 ` [PATCH v2 1/3] x86/entry/64: Document idtentry Andy Lutomirski
2018-09-06  9:50   ` Borislav Petkov
2018-09-08  9:33   ` [tip:x86/pti] " tip-bot for Andy Lutomirski
2018-09-03 22:59 ` [PATCH v2 2/3] x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space Andy Lutomirski
2018-09-07  8:00   ` Borislav Petkov
2018-09-08  9:34   ` [tip:x86/pti] " tip-bot for Andy Lutomirski
2018-09-03 22:59 ` [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline Andy Lutomirski
2018-09-04  7:04   ` Peter Zijlstra
2018-09-05 21:31     ` Andy Lutomirski
2018-09-07 12:36       ` Peter Zijlstra
2018-09-07 19:54       ` Thomas Gleixner
2018-09-08  0:04         ` Linus Torvalds
2018-09-08  4:32           ` Andy Lutomirski
2018-09-08  6:36             ` Thomas Gleixner
2018-09-08  6:33           ` Thomas Gleixner
2018-09-07  9:35   ` Borislav Petkov
2018-09-07 16:40   ` Josh Poimboeuf
2018-09-08  4:35     ` Andy Lutomirski
2018-09-08  9:35   ` [tip:x86/pti] " tip-bot for Andy Lutomirski
2018-09-08  9:57   ` tip-bot for Andy Lutomirski
2018-09-12 19:33   ` tip-bot for Andy Lutomirski
2018-09-12 19:36   ` tip-bot for Andy Lutomirski
2018-09-04  3:43 ` [PATCH v2 0/3] x86/pti: Get rid of entry trampolines and add some docs Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.