linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code
@ 2020-11-09 11:22 Alexandre Chartre
  2020-11-09 11:22 ` [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function Alexandre Chartre
                   ` (24 more replies)
  0 siblings, 25 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:22 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com


With Page Table Isolation (PTI), syscalls as well as interrupts and
exceptions occurring in userspace enter the kernel with a user
page-table. The kernel entry code will then switch the page-table
from the user page-table to the kernel page-table by updating the
CR3 control register. This CR3 switch is currently done early in
the kernel entry sequence using assembly code.

This RFC proposes to defer the PTI CR3 switch until we reach C code.
The benefit is that this simplifies the assembly entry code, and make
the PTI CR3 switch code easier to understand. This also paves the way
for further possible projects such an easier integration of Address
Space Isolation (ASI), or the possibilily to execute some selected
syscall or interrupt handlers without switching to the kernel page-table
(and thus avoid the PTI page-table switch overhead).

Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

 - map more syscall, interrupt and exception entry code into the user
   page-table (map all noinstr code);

 - map additional data used in the entry code (such as stack canary);

 - run more entry code on the trampoline stack (which is mapped both
   in the kernel and in the user page-table) until we switch to the
   kernel page-table and then switch to the kernel stack;

 - have a per-task trampoline stack instead of a per-cpu trampoline
   stack, so the task can be scheduled out while it hasn't switched
   to the kernel stack.

Note that, for now, the CR3 switch can only be pushed as far as interrupts
remain disabled in the entry code. This is because the CR3 switch is done
based on the privilege level from the CS register from the interrupt frame.
I plan to fix this but that's some extra complication (need to track if the
user page-table is used or not).

The proposed patchset is in RFC state to get early feedback about this
proposal.

The code survives running a kernel build and LTP. Note that changes are
only for 64-bit at the moment, I haven't looked at 32-bit yet but I will
definitively check it.

Code is based on v5.10-rc3.

Thanks,

alex.

-----

Alexandre Chartre (24):
  x86/syscall: Add wrapper for invoking syscall function
  x86/entry: Update asm_call_on_stack to support more function arguments
  x86/entry: Consolidate IST entry from userspace
  x86/sev-es: Define a setup stack function for the VC idtentry
  x86/entry: Implement ret_from_fork body with C code
  x86/pti: Provide C variants of PTI switch CR3 macros
  x86/entry: Fill ESPFIX stack using C code
  x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
  x86/entry: Add C version of paranoid_entry/exit
  x86/pti: Introduce per-task PTI trampoline stack
  x86/pti: Function to clone page-table entries from a specified mm
  x86/pti: Function to map per-cpu page-table entry
  x86/pti: Extend PTI user mappings
  x86/pti: Use PTI stack instead of trampoline stack
  x86/pti: Execute syscall functions on the kernel stack
  x86/pti: Execute IDT handlers on the kernel stack
  x86/pti: Execute IDT handlers with error code on the kernel stack
  x86/pti: Execute system vector handlers on the kernel stack
  x86/pti: Execute page fault handler on the kernel stack
  x86/pti: Execute NMI handler on the kernel stack
  x86/entry: Disable stack-protector for IST entry C handlers
  x86/entry: Defer paranoid entry/exit to C code
  x86/entry: Remove paranoid_entry and paranoid_exit
  x86/pti: Defer CR3 switch to C code for non-IST and syscall entries

 arch/x86/entry/common.c               | 259 ++++++++++++-
 arch/x86/entry/entry_64.S             | 513 ++++++++------------------
 arch/x86/entry/entry_64_compat.S      |  22 --
 arch/x86/include/asm/entry-common.h   | 108 ++++++
 arch/x86/include/asm/idtentry.h       | 153 +++++++-
 arch/x86/include/asm/irq_stack.h      |  11 +
 arch/x86/include/asm/page_64_types.h  |  36 +-
 arch/x86/include/asm/paravirt.h       |  15 +
 arch/x86/include/asm/paravirt_types.h |  17 +-
 arch/x86/include/asm/processor.h      |   3 +
 arch/x86/include/asm/pti.h            |  18 +
 arch/x86/include/asm/switch_to.h      |   7 +-
 arch/x86/include/asm/traps.h          |   2 +-
 arch/x86/kernel/cpu/mce/core.c        |   7 +-
 arch/x86/kernel/espfix_64.c           |  41 ++
 arch/x86/kernel/nmi.c                 |  34 +-
 arch/x86/kernel/sev-es.c              |  52 +++
 arch/x86/kernel/traps.c               |  61 +--
 arch/x86/mm/fault.c                   |  11 +-
 arch/x86/mm/pti.c                     |  71 ++--
 kernel/fork.c                         |  22 ++
 21 files changed, 1002 insertions(+), 461 deletions(-)

-- 
2.18.4


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
@ 2020-11-09 11:22 ` Alexandre Chartre
  2020-11-09 17:25   ` Andy Lutomirski
  2020-11-09 11:22 ` [RFC][PATCH 02/24] x86/entry: Update asm_call_on_stack to support more function arguments Alexandre Chartre
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:22 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

Add a wrapper function for invoking a syscall function.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/common.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..d222212908ad 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,15 @@
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
 
+static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
+					struct pt_regs *regs)
+{
+	if (!sysfunc)
+		return;
+
+	regs->ax = sysfunc(regs);
+}
+
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
@@ -43,15 +52,16 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 	instrumentation_begin();
 	if (likely(nr < NR_syscalls)) {
 		nr = array_index_nospec(nr, NR_syscalls);
-		regs->ax = sys_call_table[nr](regs);
+		run_syscall(sys_call_table[nr], regs);
 #ifdef CONFIG_X86_X32_ABI
 	} else if (likely((nr & __X32_SYSCALL_BIT) &&
 			  (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
 		nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
 					X32_NR_syscalls);
-		regs->ax = x32_sys_call_table[nr](regs);
+		run_syscall(x32_sys_call_table[nr], regs);
 #endif
 	}
+
 	instrumentation_end();
 	syscall_exit_to_user_mode(regs);
 }
@@ -75,7 +85,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
 	if (likely(nr < IA32_NR_syscalls)) {
 		instrumentation_begin();
 		nr = array_index_nospec(nr, IA32_NR_syscalls);
-		regs->ax = ia32_sys_call_table[nr](regs);
+		run_syscall(ia32_sys_call_table[nr], regs);
 		instrumentation_end();
 	}
 }
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 02/24] x86/entry: Update asm_call_on_stack to support more function arguments
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
  2020-11-09 11:22 ` [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function Alexandre Chartre
@ 2020-11-09 11:22 ` Alexandre Chartre
  2020-11-09 11:22 ` [RFC][PATCH 03/24] x86/entry: Consolidate IST entry from userspace Alexandre Chartre
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:22 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

Update the asm_call_on_stack() function so that it can be invoked
with a function having up to three arguments instead of only one.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/entry_64.S        | 15 +++++++++++----
 arch/x86/include/asm/irq_stack.h |  8 ++++++++
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..c42948aca0a8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -759,9 +759,14 @@ SYM_CODE_END(.Lbad_gs)
 /*
  * rdi: New stack pointer points to the top word of the stack
  * rsi: Function pointer
- * rdx: Function argument (can be NULL if none)
+ * rdx: Function argument 1 (can be NULL if none)
+ * rcx: Function argument 2 (can be NULL if none)
+ * r8 : Function argument 3 (can be NULL if none)
  */
 SYM_FUNC_START(asm_call_on_stack)
+SYM_FUNC_START(asm_call_on_stack_1)
+SYM_FUNC_START(asm_call_on_stack_2)
+SYM_FUNC_START(asm_call_on_stack_3)
 SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
 SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
 	/*
@@ -777,15 +782,17 @@ SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
 	 */
 	mov		%rsp, (%rdi)
 	mov		%rdi, %rsp
-	/* Move the argument to the right place */
+	mov		%rsi, %rax
+	/* Move arguments to the right place */
 	mov		%rdx, %rdi
-
+	mov		%rcx, %rsi
+	mov		%r8, %rdx
 1:
 	.pushsection .discard.instr_begin
 	.long 1b - .
 	.popsection
 
-	CALL_NOSPEC	rsi
+	CALL_NOSPEC	rax
 
 2:
 	.pushsection .discard.instr_end
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 775816965c6a..359427216336 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -13,6 +13,14 @@ static __always_inline bool irqstack_active(void)
 }
 
 void asm_call_on_stack(void *sp, void (*func)(void), void *arg);
+
+void asm_call_on_stack_1(void *sp, void (*func)(void),
+			 void *arg1);
+void asm_call_on_stack_2(void *sp, void (*func)(void),
+			 void *arg1, void *arg2);
+void asm_call_on_stack_3(void *sp, void (*func)(void),
+			 void *arg1, void *arg2, void *arg3);
+
 void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs),
 			      struct pt_regs *regs);
 void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 03/24] x86/entry: Consolidate IST entry from userspace
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
  2020-11-09 11:22 ` [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function Alexandre Chartre
  2020-11-09 11:22 ` [RFC][PATCH 02/24] x86/entry: Update asm_call_on_stack to support more function arguments Alexandre Chartre
@ 2020-11-09 11:22 ` Alexandre Chartre
  2020-11-09 11:22 ` [RFC][PATCH 04/24] x86/sev-es: Define a setup stack function for the VC idtentry Alexandre Chartre
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:22 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

Most IST entries (NMI, MCE, DEBUG, VC but not DF) handle an entry from
userspace the same way: they switch from the IST stack to the kernel
stack, call the handler and then return to userspace. However, NMI,
MCE/DEBUG and VC implement this same behavior using different code paths,
so consolidate this code into a single assembly macro.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/entry_64.S | 137 +++++++++++++++++++++-----------------
 1 file changed, 75 insertions(+), 62 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index c42948aca0a8..51df9f1871c6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -316,6 +316,72 @@ SYM_CODE_END(ret_from_fork)
 #endif
 .endm
 
+/*
+ * Macro to handle an IDT entry defined with the IST mechanism. It should
+ * be invoked at the beginning of the IDT handler with a pointer to the C
+ * function (cfunc_user) to invoke if the IDT was entered from userspace.
+ *
+ * If the IDT was entered from userspace, the macro will switch from the
+ * IST stack to the regular task stack, call the provided function and
+ * return to userland.
+ *
+ * If IDT was entered from the kernel, the macro will just return.
+ */
+.macro ist_entry_user cfunc_user has_error_code=0
+	UNWIND_HINT_IRET_REGS
+	ASM_CLAC
+
+	/* only process entry from userspace */
+	.if \has_error_code == 1
+		testb	$3, CS-ORIG_RAX(%rsp)
+		jz	.List_entry_from_kernel_\@
+	.else
+		testb	$3, CS-RIP(%rsp)
+		jz	.List_entry_from_kernel_\@
+		pushq	$-1	/* ORIG_RAX: no syscall to restart */
+	.endif
+
+	/* Use %rdx as a temp variable */
+	pushq	%rdx
+
+	/*
+	 * Switch from the IST stack to the regular task stack and
+	 * use the provided entry point.
+	 */
+	swapgs
+	cld
+	FENCE_SWAPGS_USER_ENTRY
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
+	movq	%rsp, %rdx
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	UNWIND_HINT_IRET_REGS base=%rdx offset=8
+	pushq	6*8(%rdx)	/* pt_regs->ss */
+	pushq	5*8(%rdx)	/* pt_regs->rsp */
+	pushq	4*8(%rdx)	/* pt_regs->flags */
+	pushq	3*8(%rdx)	/* pt_regs->cs */
+	pushq	2*8(%rdx)	/* pt_regs->rip */
+	UNWIND_HINT_IRET_REGS
+	pushq   1*8(%rdx)	/* pt_regs->orig_ax */
+	PUSH_AND_CLEAR_REGS rdx=(%rdx)
+	ENCODE_FRAME_POINTER
+
+	/*
+	 * At this point we no longer need to worry about stack damage
+	 * due to nesting -- we're on the normal thread stack and we're
+	 * done with the IST stack.
+	 */
+
+	mov	%rsp, %rdi
+	.if \has_error_code == 1
+		movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
+		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
+	.endif
+	call	\cfunc_user
+	jmp	swapgs_restore_regs_and_return_to_usermode
+
+.List_entry_from_kernel_\@:
+.endm
+
 /**
  * idtentry_body - Macro to emit code calling the C function
  * @cfunc:		C function to be called
@@ -417,18 +483,15 @@ SYM_CODE_END(\asmsym)
  */
 .macro idtentry_mce_db vector asmsym cfunc
 SYM_CODE_START(\asmsym)
-	UNWIND_HINT_IRET_REGS
-	ASM_CLAC
-
-	pushq	$-1			/* ORIG_RAX: no syscall to restart */
-
 	/*
 	 * If the entry is from userspace, switch stacks and treat it as
 	 * a normal entry.
 	 */
-	testb	$3, CS-ORIG_RAX(%rsp)
-	jnz	.Lfrom_usermode_switch_stack_\@
+	ist_entry_user noist_\cfunc
 
+	/* Entry from kernel */
+
+	pushq	$-1			/* ORIG_RAX: no syscall to restart */
 	/* paranoid_entry returns GS information for paranoid_exit in EBX. */
 	call	paranoid_entry
 
@@ -440,10 +503,6 @@ SYM_CODE_START(\asmsym)
 
 	jmp	paranoid_exit
 
-	/* Switch to the regular task stack and use the noist entry point */
-.Lfrom_usermode_switch_stack_\@:
-	idtentry_body noist_\cfunc, has_error_code=0
-
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
 .endm
@@ -472,15 +531,11 @@ SYM_CODE_END(\asmsym)
  */
 .macro idtentry_vc vector asmsym cfunc
 SYM_CODE_START(\asmsym)
-	UNWIND_HINT_IRET_REGS
-	ASM_CLAC
-
 	/*
 	 * If the entry is from userspace, switch stacks and treat it as
 	 * a normal entry.
 	 */
-	testb	$3, CS-ORIG_RAX(%rsp)
-	jnz	.Lfrom_usermode_switch_stack_\@
+	ist_entry_user safe_stack_\cfunc, has_error_code=1
 
 	/*
 	 * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
@@ -517,10 +572,6 @@ SYM_CODE_START(\asmsym)
 	 */
 	jmp	paranoid_exit
 
-	/* Switch to the regular task stack */
-.Lfrom_usermode_switch_stack_\@:
-	idtentry_body safe_stack_\cfunc, has_error_code=1
-
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
 .endm
@@ -1113,8 +1164,6 @@ SYM_CODE_END(error_return)
  *	      when PAGE_TABLE_ISOLATION is in use.  Do not clobber.
  */
 SYM_CODE_START(asm_exc_nmi)
-	UNWIND_HINT_IRET_REGS
-
 	/*
 	 * We allow breakpoints in NMIs. If a breakpoint occurs, then
 	 * the iretq it performs will take us out of NMI context.
@@ -1153,14 +1202,6 @@ SYM_CODE_START(asm_exc_nmi)
 	 * other IST entries.
 	 */
 
-	ASM_CLAC
-
-	/* Use %rdx as our temp variable throughout */
-	pushq	%rdx
-
-	testb	$3, CS-RIP+8(%rsp)
-	jz	.Lnmi_from_kernel
-
 	/*
 	 * NMI from user mode.  We need to run on the thread stack, but we
 	 * can't go through the normal entry paths: NMIs are masked, and
@@ -1171,41 +1212,13 @@ SYM_CODE_START(asm_exc_nmi)
 	 * We also must not push anything to the stack before switching
 	 * stacks lest we corrupt the "NMI executing" variable.
 	 */
+	ist_entry_user exc_nmi
 
-	swapgs
-	cld
-	FENCE_SWAPGS_USER_ENTRY
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
-	movq	%rsp, %rdx
-	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
-	UNWIND_HINT_IRET_REGS base=%rdx offset=8
-	pushq	5*8(%rdx)	/* pt_regs->ss */
-	pushq	4*8(%rdx)	/* pt_regs->rsp */
-	pushq	3*8(%rdx)	/* pt_regs->flags */
-	pushq	2*8(%rdx)	/* pt_regs->cs */
-	pushq	1*8(%rdx)	/* pt_regs->rip */
-	UNWIND_HINT_IRET_REGS
-	pushq   $-1		/* pt_regs->orig_ax */
-	PUSH_AND_CLEAR_REGS rdx=(%rdx)
-	ENCODE_FRAME_POINTER
-
-	/*
-	 * At this point we no longer need to worry about stack damage
-	 * due to nesting -- we're on the normal thread stack and we're
-	 * done with the NMI stack.
-	 */
-
-	movq	%rsp, %rdi
-	movq	$-1, %rsi
-	call	exc_nmi
+	/* NMI from kernel */
 
-	/*
-	 * Return back to user mode.  We must *not* do the normal exit
-	 * work, because we don't want to enable interrupts.
-	 */
-	jmp	swapgs_restore_regs_and_return_to_usermode
+	/* Use %rdx as our temp variable throughout */
+	pushq	%rdx
 
-.Lnmi_from_kernel:
 	/*
 	 * Here's what our stack frame will look like:
 	 * +---------------------------------------------------------+
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 04/24] x86/sev-es: Define a setup stack function for the VC idtentry
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (2 preceding siblings ...)
  2020-11-09 11:22 ` [RFC][PATCH 03/24] x86/entry: Consolidate IST entry from userspace Alexandre Chartre
@ 2020-11-09 11:22 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 05/24] x86/entry: Implement ret_from_fork body with C code Alexandre Chartre
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:22 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

The #VC exception assembly entry code uses C code (vc_switch_off_ist)
to get and configure a stack, then return to assembly to switch to
that stack and finally invoked the C function exception handler.

To pave the way for deferring CR3 switch from assembly to C code,
define a setup stack function for the VC idtentry. This function is
used to get and configure the stack before invoking idtentry handler.

For now, the setup stack function is just a wrapper around the
the vc_switch_off_ist() function but it will eventually also
contain the C code to switch CR3. The vc_switch_off_ist() function
is also refactored to just return the stack pointer, and the stack
configuration is done in the setup stack function (so that the
stack can be also be used to propagate CR3 switch information to
the idtentry handler for switching CR3 back).

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/entry_64.S       |  8 +++-----
 arch/x86/include/asm/idtentry.h | 14 ++++++++++++++
 arch/x86/include/asm/traps.h    |  2 +-
 arch/x86/kernel/sev-es.c        | 34 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c         | 19 +++---------------
 5 files changed, 55 insertions(+), 22 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 51df9f1871c6..274384644b5e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -546,13 +546,11 @@ SYM_CODE_START(\asmsym)
 	UNWIND_HINT_REGS
 
 	/*
-	 * Switch off the IST stack to make it free for nested exceptions. The
-	 * vc_switch_off_ist() function will switch back to the interrupted
-	 * stack if it is safe to do so. If not it switches to the VC fall-back
-	 * stack.
+	 * Call the setup stack function. It configures and returns
+	 * the stack we should be using to run the exception handler.
 	 */
 	movq	%rsp, %rdi		/* pt_regs pointer */
-	call	vc_switch_off_ist
+	call	setup_stack_\cfunc
 	movq	%rax, %rsp		/* Switch to new stack */
 
 	UNWIND_HINT_REGS
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b2442eb0ac2f..4b4aca2b1420 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -318,6 +318,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  */
 #define DECLARE_IDTENTRY_VC(vector, func)				\
 	DECLARE_IDTENTRY_RAW_ERRORCODE(vector, func);			\
+	__visible noinstr unsigned long setup_stack_##func(struct pt_regs *regs);	\
 	__visible noinstr void ist_##func(struct pt_regs *regs, unsigned long error_code);	\
 	__visible noinstr void safe_stack_##func(struct pt_regs *regs, unsigned long error_code)
 
@@ -380,6 +381,19 @@ static __always_inline void __##func(struct pt_regs *regs)
 #define DEFINE_IDTENTRY_VC_IST(func)				\
 	DEFINE_IDTENTRY_RAW_ERRORCODE(ist_##func)
 
+/**
+ * DEFINE_IDTENTRY_VC_SETUP_STACK - Emit code for setting up the stack to
+				    run the VMM communication handler
+ * @func:	Function name of the entry point
+ *
+ * The stack setup code is executed before the VMM communication handler.
+ * It configures and returns the stack to switch to before running the
+ * VMM communication handler.
+ */
+#define DEFINE_IDTENTRY_VC_SETUP_STACK(func)			\
+	__visible noinstr					\
+	unsigned long setup_stack_##func(struct pt_regs *regs)
+
 /**
  * DEFINE_IDTENTRY_VC - Emit code for VMM communication handler
  * @func:	Function name of the entry point
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 7f7200021bd1..cfcc9d34d2a0 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -15,7 +15,7 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
 asmlinkage __visible notrace
 struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s);
 void __init trap_init(void);
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *eregs);
+asmlinkage __visible noinstr unsigned long vc_switch_off_ist(struct pt_regs *eregs);
 #endif
 
 #ifdef CONFIG_X86_F00F_BUG
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index 0bd1a0fc587e..bd977c917cd6 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -1349,6 +1349,40 @@ DEFINE_IDTENTRY_VC_IST(exc_vmm_communication)
 	instrumentation_end();
 }
 
+struct exc_vc_frame {
+	/* pt_regs should be first */
+	struct pt_regs regs;
+};
+
+DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
+{
+	struct exc_vc_frame *frame;
+	unsigned long sp;
+
+	/*
+	 * Switch off the IST stack to make it free for nested exceptions.
+	 * The vc_switch_off_ist() function will switch back to the
+	 * interrupted stack if it is safe to do so. If not it switches
+	 * to the VC fall-back stack.
+	 */
+	sp = vc_switch_off_ist(regs);
+
+	/*
+	 * Found a safe stack. Set it up as if the entry has happened on
+	 * that stack. This means that we need to have pt_regs at the top
+	 * of the stack.
+	 *
+	 * The effective stack switch happens in assembly code before
+	 * the #VC handler is called.
+	 */
+	sp = ALIGN_DOWN(sp, 8) - sizeof(*frame);
+
+	frame = (struct exc_vc_frame *)sp;
+	frame->regs = *regs;
+
+	return sp;
+}
+
 DEFINE_IDTENTRY_VC(exc_vmm_communication)
 {
 	if (likely(!on_vc_fallback_stack(regs)))
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e19df6cde35d..09b22a611d99 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -675,11 +675,10 @@ asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
 }
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
-asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *regs)
+asmlinkage __visible noinstr unsigned long vc_switch_off_ist(struct pt_regs *regs)
 {
 	unsigned long sp, *stack;
 	struct stack_info info;
-	struct pt_regs *regs_ret;
 
 	/*
 	 * In the SYSCALL entry path the RSP value comes from user-space - don't
@@ -687,8 +686,7 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
 	 */
 	if (regs->ip >= (unsigned long)entry_SYSCALL_64 &&
 	    regs->ip <  (unsigned long)entry_SYSCALL_64_safe_stack) {
-		sp = this_cpu_read(cpu_current_top_of_stack);
-		goto sync;
+		return this_cpu_read(cpu_current_top_of_stack);
 	}
 
 	/*
@@ -703,18 +701,7 @@ asmlinkage __visible noinstr struct pt_regs *vc_switch_off_ist(struct pt_regs *r
 	    info.type >= STACK_TYPE_EXCEPTION_LAST)
 		sp = __this_cpu_ist_top_va(VC2);
 
-sync:
-	/*
-	 * Found a safe stack - switch to it as if the entry didn't happen via
-	 * IST stack. The code below only copies pt_regs, the real switch happens
-	 * in assembly code.
-	 */
-	sp = ALIGN_DOWN(sp, 8) - sizeof(*regs_ret);
-
-	regs_ret = (struct pt_regs *)sp;
-	*regs_ret = *regs;
-
-	return regs_ret;
+	return sp;
 }
 #endif
 
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 05/24] x86/entry: Implement ret_from_fork body with C code
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (3 preceding siblings ...)
  2020-11-09 11:22 ` [RFC][PATCH 04/24] x86/sev-es: Define a setup stack function for the VC idtentry Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 06/24] x86/pti: Provide C variants of PTI switch CR3 macros Alexandre Chartre
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

ret_from_fork is a mix of assembly code and calls to C functions.
Re-implement ret_from_fork so that it calls a single C function.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/common.c   | 18 ++++++++++++++++++
 arch/x86/entry/entry_64.S | 28 +++++-----------------------
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d222212908ad..7ee15a12c115 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,24 @@
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
 
+__visible noinstr void return_from_fork(struct pt_regs *regs,
+					struct task_struct *prev,
+					void (*kfunc)(void *), void *kargs)
+{
+	schedule_tail(prev);
+	if (kfunc) {
+		/* kernel thread */
+		kfunc(kargs);
+		/*
+		 * A kernel thread is allowed to return here after
+		 * successfully calling kernel_execve(). Exit to
+		 * userspace to complete the execve() syscall.
+		 */
+		regs->ax = 0;
+	}
+	syscall_exit_to_user_mode(regs);
+}
+
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
 					struct pt_regs *regs)
 {
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 274384644b5e..73e9cd47dc83 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -276,31 +276,13 @@ SYM_FUNC_END(__switch_to_asm)
  */
 .pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
-	UNWIND_HINT_EMPTY
-	movq	%rax, %rdi
-	call	schedule_tail			/* rdi: 'prev' task parameter */
-
-	testq	%rbx, %rbx			/* from kernel_thread? */
-	jnz	1f				/* kernel threads are uncommon */
-
-2:
 	UNWIND_HINT_REGS
-	movq	%rsp, %rdi
-	call	syscall_exit_to_user_mode	/* returns with IRQs disabled */
+	movq	%rsp, %rdi			/* pt_regs */
+	movq	%rax, %rsi			/* 'prev' task parameter */
+	movq	%rbx, %rdx			/* kernel thread func */
+	movq	%r12, %rcx			/* kernel thread arg */
+	call	return_from_fork		/* returns with IRQs disabled */
 	jmp	swapgs_restore_regs_and_return_to_usermode
-
-1:
-	/* kernel thread */
-	UNWIND_HINT_EMPTY
-	movq	%r12, %rdi
-	CALL_NOSPEC rbx
-	/*
-	 * A kernel thread is allowed to return here after successfully
-	 * calling kernel_execve().  Exit to userspace to complete the execve()
-	 * syscall.
-	 */
-	movq	$0, RAX(%rsp)
-	jmp	2b
 SYM_CODE_END(ret_from_fork)
 .popsection
 
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 06/24] x86/pti: Provide C variants of PTI switch CR3 macros
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (4 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 05/24] x86/entry: Implement ret_from_fork body with C code Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 07/24] x86/entry: Fill ESPFIX stack using C code Alexandre Chartre
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

Page Table Isolation (PTI) use assembly macros to switch the CR3
register between kernel and user page-tables. Add C functions which
implement the same features. For now, these C functions are not
used but they will eventually replace using the assembly macros.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/common.c             | 44 +++++++++++++++
 arch/x86/include/asm/entry-common.h | 84 +++++++++++++++++++++++++++++
 2 files changed, 128 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7ee15a12c115..d09b1ded5287 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -343,3 +343,47 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 	}
 }
 #endif /* CONFIG_XEN_PV */
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
+{
+	unsigned long cr3, saved_cr3;
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return 0;
+
+	saved_cr3 = cr3 = __read_cr3();
+	if (cr3 & PTI_USER_PGTABLE_MASK) {
+		adjust_kernel_cr3(&cr3);
+		native_write_cr3(cr3);
+	}
+
+	return saved_cr3;
+}
+
+static __always_inline void restore_cr3(unsigned long cr3)
+{
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		if (cr3 & PTI_USER_PGTABLE_MASK)
+			adjust_user_cr3(&cr3);
+		else
+			cr3 |= X86_CR3_PCID_NOFLUSH;
+	}
+
+	native_write_cr3(cr3);
+}
+
+#else /* CONFIG_PAGE_TABLE_ISOLATION */
+
+static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
+{
+	return 0;
+}
+
+static __always_inline void restore_cr3(unsigned long cr3) {}
+
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 6fe54b2813c1..b05b212f5ebc 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -7,6 +7,7 @@
 #include <asm/nospec-branch.h>
 #include <asm/io_bitmap.h>
 #include <asm/fpu/api.h>
+#include <asm/tlbflush.h>
 
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_check_user_regs(struct pt_regs *regs)
@@ -81,4 +82,87 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+#ifndef MODULE
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+/*
+ * PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two
+ * halves:
+ */
+#define PTI_USER_PGTABLE_BIT		PAGE_SHIFT
+#define PTI_USER_PGTABLE_MASK		(1 << PTI_USER_PGTABLE_BIT)
+#define PTI_USER_PCID_BIT		X86_CR3_PTI_PCID_USER_BIT
+#define PTI_USER_PCID_MASK		(1 << PTI_USER_PCID_BIT)
+#define PTI_USER_PGTABLE_AND_PCID_MASK  \
+	(PTI_USER_PCID_MASK | PTI_USER_PGTABLE_MASK)
+
+static __always_inline void adjust_kernel_cr3(unsigned long *cr3)
+{
+	if (static_cpu_has(X86_FEATURE_PCID))
+		*cr3 |= X86_CR3_PCID_NOFLUSH;
+
+	/*
+	 * Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3
+	 * at kernel pagetables.
+	 */
+	*cr3 &= ~PTI_USER_PGTABLE_AND_PCID_MASK;
+}
+
+static __always_inline void adjust_user_cr3(unsigned long *cr3)
+{
+	unsigned short mask;
+	unsigned long asid;
+
+	/*
+	 * Test if the ASID needs a flush.
+	 */
+	asid = *cr3 & 0x7ff;
+	mask = this_cpu_read(cpu_tlbstate.user_pcid_flush_mask);
+	if (mask & (1 << asid)) {
+		/* Flush needed, clear the bit */
+		this_cpu_and(cpu_tlbstate.user_pcid_flush_mask, ~(1 << asid));
+	} else {
+		*cr3 |= X86_CR3_PCID_NOFLUSH;
+	}
+}
+
+static __always_inline void switch_to_kernel_cr3(void)
+{
+	unsigned long cr3;
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	cr3 = __read_cr3();
+	adjust_kernel_cr3(&cr3);
+	native_write_cr3(cr3);
+}
+
+static __always_inline void switch_to_user_cr3(void)
+{
+	unsigned long cr3;
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	cr3 = __read_cr3();
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		adjust_user_cr3(&cr3);
+		/* Flip the ASID to the user version */
+		cr3 |= PTI_USER_PCID_MASK;
+	}
+
+	/* Flip the PGD to the user version */
+	cr3 |= PTI_USER_PGTABLE_MASK;
+	native_write_cr3(cr3);
+}
+
+#else /* CONFIG_PAGE_TABLE_ISOLATION */
+
+static inline void switch_to_kernel_cr3(void) {}
+static inline void switch_to_user_cr3(void) {}
+
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+#endif /* MODULE */
+
 #endif
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 07/24] x86/entry: Fill ESPFIX stack using C code
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (5 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 06/24] x86/pti: Provide C variants of PTI switch CR3 macros Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK Alexandre Chartre
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

The ESPFIX stack is filled using assembly code. Move this code to a C
function so that it is easier to read and modify.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/entry_64.S   | 62 ++++++++++++++++++-------------------
 arch/x86/kernel/espfix_64.c | 41 ++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 73e9cd47dc83..6e0b5b010e0b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -684,8 +684,10 @@ native_irq_return_ldt:
 	 * long (see ESPFIX_STACK_SIZE).  espfix_waddr points to the bottom
 	 * of the ESPFIX stack.
 	 *
-	 * We clobber RAX and RDI in this code.  We stash RDI on the
-	 * normal stack and RAX on the ESPFIX stack.
+	 * We call into C code to fill the ESPFIX stack. We stash registers
+	 * that the C function can clobber on the normal stack. The user RAX
+	 * is stashed first so that it is adjacent to the iret frame which
+	 * will be copied to the ESPFIX stack.
 	 *
 	 * The ESPFIX stack layout we set up looks like this:
 	 *
@@ -699,39 +701,37 @@ native_irq_return_ldt:
 	 * --- bottom of ESPFIX stack ---
 	 */
 
-	pushq	%rdi				/* Stash user RDI */
-	SWAPGS					/* to kernel GS */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi	/* to kernel CR3 */
-
-	movq	PER_CPU_VAR(espfix_waddr), %rdi
-	movq	%rax, (0*8)(%rdi)		/* user RAX */
-	movq	(1*8)(%rsp), %rax		/* user RIP */
-	movq	%rax, (1*8)(%rdi)
-	movq	(2*8)(%rsp), %rax		/* user CS */
-	movq	%rax, (2*8)(%rdi)
-	movq	(3*8)(%rsp), %rax		/* user RFLAGS */
-	movq	%rax, (3*8)(%rdi)
-	movq	(5*8)(%rsp), %rax		/* user SS */
-	movq	%rax, (5*8)(%rdi)
-	movq	(4*8)(%rsp), %rax		/* user RSP */
-	movq	%rax, (4*8)(%rdi)
-	/* Now RAX == RSP. */
-
-	andl	$0xffff0000, %eax		/* RAX = (RSP & 0xffff0000) */
+	/* save registers */
+	pushq	%rax
+	pushq	%rdi
+	pushq	%rsi
+	pushq	%rdx
+	pushq	%rcx
+	pushq	%r8
+	pushq	%r9
+	pushq	%r10
+	pushq	%r11
 
 	/*
-	 * espfix_stack[31:16] == 0.  The page tables are set up such that
-	 * (espfix_stack | (X & 0xffff0000)) points to a read-only alias of
-	 * espfix_waddr for any X.  That is, there are 65536 RO aliases of
-	 * the same page.  Set up RSP so that RSP[31:16] contains the
-	 * respective 16 bits of the /userspace/ RSP and RSP nonetheless
-	 * still points to an RO alias of the ESPFIX stack.
+	 * fill_espfix_stack will copy the iret+rax frame to the ESPFIX
+	 * stack and return with RAX containing a pointer to the ESPFIX
+	 * stack.
 	 */
-	orq	PER_CPU_VAR(espfix_stack), %rax
+	leaq	8*8(%rsp), %rdi		/* points to the iret+rax frame */
+	call	fill_espfix_stack
 
-	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-	SWAPGS					/* to user GS */
-	popq	%rdi				/* Restore user RDI */
+	/*
+	 * RAX contains a pointer to the ESPFIX, so restore registers but
+	 * RAX. RAX will be restored from the ESPFIX stack.
+	 */
+	popq	%r11
+	popq	%r10
+	popq	%r9
+	popq	%r8
+	popq	%rcx
+	popq	%rdx
+	popq	%rsi
+	popq	%rdi
 
 	movq	%rax, %rsp
 	UNWIND_HINT_IRET_REGS offset=8
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 4fe7af58cfe1..6a81c4bd1542 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -33,6 +33,7 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
+#include <asm/entry-common.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -205,3 +206,43 @@ void init_espfix_ap(int cpu)
 	per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
 				      + (addr & ~PAGE_MASK);
 }
+
+/*
+ * iret frame with an additional user_rax register.
+ */
+struct iret_rax_frame {
+	unsigned long user_rax;
+	unsigned long rip;
+	unsigned long cs;
+	unsigned long rflags;
+	unsigned long rsp;
+	unsigned long ss;
+};
+
+noinstr unsigned long fill_espfix_stack(struct iret_rax_frame *frame)
+{
+	struct iret_rax_frame *espfix_frame;
+	unsigned long rsp;
+
+	native_swapgs();
+	switch_to_kernel_cr3();
+
+	espfix_frame = (struct iret_rax_frame *)this_cpu_read(espfix_waddr);
+	*espfix_frame = *frame;
+
+	/*
+	 * espfix_stack[31:16] == 0.  The page tables are set up such that
+	 * (espfix_stack | (X & 0xffff0000)) points to a read-only alias of
+	 * espfix_waddr for any X.  That is, there are 65536 RO aliases of
+	 * the same page.  Set up RSP so that RSP[31:16] contains the
+	 * respective 16 bits of the /userspace/ RSP and RSP nonetheless
+	 * still points to an RO alias of the ESPFIX stack.
+	 */
+	rsp = ((unsigned long)espfix_frame) & 0xffff0000;
+	rsp |= this_cpu_read(espfix_stack);
+
+	switch_to_user_cr3();
+	native_swapgs();
+
+	return rsp;
+}
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (6 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 07/24] x86/entry: Fill ESPFIX stack using C code Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 17:38   ` Andy Lutomirski
  2020-11-09 11:23 ` [RFC][PATCH 09/24] x86/entry: Add C version of paranoid_entry/exit Alexandre Chartre
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions
of these macros (swapgs() and swapgs_unsafe_stack()).

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/paravirt.h       | 15 +++++++++++++++
 arch/x86/include/asm/paravirt_types.h | 17 ++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d25cc6830e89..a4898130b36b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -145,6 +145,21 @@ static inline void __write_cr4(unsigned long x)
 	PVOP_VCALL1(cpu.write_cr4, x);
 }
 
+static inline void swapgs(void)
+{
+	PVOP_VCALL0(cpu.swapgs);
+}
+
+/*
+ * If swapgs is used while the userspace stack is still current,
+ * there's no way to call a pvop.  The PV replacement *must* be
+ * inlined, or the swapgs instruction must be trapped and emulated.
+ */
+static inline void swapgs_unsafe_stack(void)
+{
+	PVOP_VCALL0_ALT(cpu.swapgs, "swapgs");
+}
+
 static inline void arch_safe_halt(void)
 {
 	PVOP_VCALL0(irq.safe_halt);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 0fad9f61c76a..eea9acc942a3 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -532,12 +532,12 @@ int paravirt_disable_iospace(void);
 		      pre, post, ##__VA_ARGS__)
 
 
-#define ____PVOP_VCALL(op, clbr, call_clbr, extra_clbr, pre, post, ...)	\
+#define ____PVOP_VCALL(op, insn, clbr, call_clbr, extra_clbr, pre, post, ...) \
 	({								\
 		PVOP_VCALL_ARGS;					\
 		PVOP_TEST_NULL(op);					\
 		asm volatile(pre					\
-			     paravirt_alt(PARAVIRT_CALL)		\
+			     paravirt_alt(insn)				\
 			     post					\
 			     : call_clbr, ASM_CALL_CONSTRAINT		\
 			     : paravirt_type(op),			\
@@ -547,12 +547,17 @@ int paravirt_disable_iospace(void);
 	})
 
 #define __PVOP_VCALL(op, pre, post, ...)				\
-	____PVOP_VCALL(op, CLBR_ANY, PVOP_VCALL_CLOBBERS,		\
-		       VEXTRA_CLOBBERS,					\
+	____PVOP_VCALL(op, PARAVIRT_CALL, CLBR_ANY,			\
+		       PVOP_VCALL_CLOBBERS, VEXTRA_CLOBBERS,		\
 		       pre, post, ##__VA_ARGS__)
 
+#define __PVOP_VCALL_ALT(op, insn)					\
+	____PVOP_VCALL(op, insn, CLBR_ANY,				\
+		       PVOP_VCALL_CLOBBERS, VEXTRA_CLOBBERS,		\
+		       "", "")
+
 #define __PVOP_VCALLEESAVE(op, pre, post, ...)				\
-	____PVOP_VCALL(op.func, CLBR_RET_REG,				\
+	____PVOP_VCALL(op.func, PARAVIRT_CALL, CLBR_RET_REG,		\
 		      PVOP_VCALLEE_CLOBBERS, ,				\
 		      pre, post, ##__VA_ARGS__)
 
@@ -562,6 +567,8 @@ int paravirt_disable_iospace(void);
 	__PVOP_CALL(rettype, op, "", "")
 #define PVOP_VCALL0(op)							\
 	__PVOP_VCALL(op, "", "")
+#define PVOP_VCALL0_ALT(op, insn)					\
+	__PVOP_VCALL_ALT(op, insn)
 
 #define PVOP_CALLEE0(rettype, op)					\
 	__PVOP_CALLEESAVE(rettype, op, "", "")
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 09/24] x86/entry: Add C version of paranoid_entry/exit
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (7 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 10/24] x86/pti: Introduce per-task PTI trampoline stack Alexandre Chartre
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

paranoid_entry/exit are assembly macros. Provide C versions of
these macros (kernel_paranoid_entry() and kernel_paranoid_exit()).
The C functions are functionally equivalent to the assembly macros,
except that kernel_paranoid_entry() doesn't save registers in
pt_regs like paranoid_entry does.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/common.c             | 157 ++++++++++++++++++++++++++++
 arch/x86/include/asm/entry-common.h |  10 ++
 2 files changed, 167 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index d09b1ded5287..54d0931801e1 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -387,3 +387,160 @@ static __always_inline unsigned long save_and_switch_to_kernel_cr3(void)
 static __always_inline void restore_cr3(unsigned long cr3) {}
 
 #endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
+/*
+ * "Paranoid" entry path from exception stack. Ensure that the CR3 and
+ * GS registers are correctly set for the kernel. Return GSBASE related
+ * information in kernel_entry_state depending on the availability of
+ * the FSGSBASE instructions:
+ *
+ * FSGSBASE	kernel_entry_state
+ *     N        swapgs=true -> SWAPGS on exit
+ *              swapgs=false -> no SWAPGS on exit
+ *
+ *     Y        gsbase=GSBASE value at entry, must be restored in
+ *              kernel_paranoid_exit()
+ *
+ * Note that per-cpu variables are accessed using the GS register,
+ * so paranoid entry code cannot access per-cpu variables before
+ * kernel_paranoid_entry() has been called.
+ */
+noinstr void kernel_paranoid_entry(struct kernel_entry_state *state)
+{
+	unsigned long gsbase;
+	unsigned int cpu;
+
+	/*
+	 * Save CR3 in the kernel entry state.  This value will be
+	 * restored, verbatim, at exit.  Needed if the paranoid entry
+	 * interrupted another entry that already switched to the user
+	 * CR3 value but has not yet returned to userspace.
+	 *
+	 * This is also why CS (stashed in the "iret frame" by the
+	 * hardware at entry) can not be used: this may be a return
+	 * to kernel code, but with a user CR3 value.
+	 *
+	 * Switching CR3 does not depend on kernel GSBASE so it can
+	 * be done before switching to the kernel GSBASE. This is
+	 * required for FSGSBASE because the kernel GSBASE has to
+	 * be retrieved from a kernel internal table.
+	 */
+	state->cr3 = save_and_switch_to_kernel_cr3();
+
+	/*
+	 * Handling GSBASE depends on the availability of FSGSBASE.
+	 *
+	 * Without FSGSBASE the kernel enforces that negative GSBASE
+	 * values indicate kernel GSBASE. With FSGSBASE no assumptions
+	 * can be made about the GSBASE value when entering from user
+	 * space.
+	 */
+	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+		/*
+		 * Read the current GSBASE and store it in the kernel
+		 * entry state unconditionally, retrieve and set the
+		 * current CPUs kernel GSBASE. The stored value has to
+		 * be restored at exit unconditionally.
+		 *
+		 * The unconditional write to GS base below ensures that
+		 * no subsequent loads based on a mispredicted GS base
+		 * can happen, therefore no LFENCE is needed here.
+		 */
+		state->gsbase = rdgsbase();
+
+		/*
+		 * Fetch the per-CPU GSBASE value for this processor. We
+		 * normally use %gs for accessing per-CPU data, but we
+		 * are setting up %gs here and obviously can not use %gs
+		 * itself to access per-CPU data.
+		 */
+		if (IS_ENABLED(CONFIG_SMP)) {
+			/*
+			 * Load CPU from the GDT. Do not use RDPID,
+			 * because KVM loads guest's TSC_AUX on vm-entry
+			 * and may not restore the host's value until
+			 * the CPU returns to userspace. Thus the kernel
+			 * would consume a guest's TSC_AUX if an NMI
+			 * arrives while running KVM's run loop.
+			 */
+			asm_inline volatile ("lsl %[seg],%[p]"
+					     : [p] "=r" (cpu)
+					     : [seg] "r" (__CPUNODE_SEG));
+
+			cpu &= VDSO_CPUNODE_MASK;
+			gsbase = __per_cpu_offset[cpu];
+		} else {
+			gsbase = *pcpu_unit_offsets;
+		}
+
+		wrgsbase(gsbase);
+
+	} else {
+		/*
+		 * The kernel-enforced convention is a negative GSBASE
+		 * indicates a kernel value. No SWAPGS needed on entry
+		 * and exit.
+		 */
+		rdmsrl(MSR_GS_BASE, gsbase);
+		if (((long)gsbase) >= 0) {
+			swapgs();
+			/*
+			 * Do an lfence to prevent GS speculation.
+			 */
+			alternative("", "lfence",
+				    X86_FEATURE_FENCE_SWAPGS_KERNEL);
+			state->swapgs = true;
+		} else {
+			state->swapgs = false;
+		}
+	}
+}
+
+/*
+ * "Paranoid" exit path from exception stack. Restore the CR3 and
+ * GS registers are as they were on entry. This is invoked only
+ * on return from IST interrupts that came from kernel space.
+ *
+ * We may be returning to very strange contexts (e.g. very early
+ * in syscall entry), so checking for preemption here would
+ * be complicated.  Fortunately, there's no good reason to try
+ * to handle preemption here.
+ *
+ * The kernel_entry_state contains the GSBASE related information
+ * depending on the availability of the FSGSBASE instructions:
+ *
+ * FSGSBASE	kernel_entry_state
+ *     N        swapgs=true  -> SWAPGS on exit
+ *              swapgs=false -> no SWAPGS on exit
+ *
+ *     Y        gsbase=GSBASE value at entry, must be restored
+ *              unconditionally
+ *
+ * Note that per-cpu variables are accessed using the GS register,
+ * so paranoid entry code cannot access per-cpu variables after
+ * kernel_paranoid_exit() has been called.
+ */
+noinstr void kernel_paranoid_exit(struct kernel_entry_state *state)
+{
+	/*
+	 * The order of operations is important. RESTORE_CR3 requires
+	 * kernel GSBASE.
+	 *
+	 * NB to anyone to try to optimize this code: this code does
+	 * not execute at all for exceptions from user mode. Those
+	 * exceptions go through error_exit instead.
+	 */
+	restore_cr3(state->cr3);
+
+	/* With FSGSBASE enabled, unconditionally restore GSBASE */
+	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+		wrgsbase(state->gsbase);
+		return;
+	}
+
+	/* On non-FSGSBASE systems, conditionally do SWAPGS */
+	if (state->swapgs) {
+		/* We are returning to a context with user GSBASE */
+		swapgs_unsafe_stack();
+	}
+}
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index b05b212f5ebc..b75e9230c990 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -163,6 +163,16 @@ static inline void switch_to_kernel_cr3(void) {}
 static inline void switch_to_user_cr3(void) {}
 
 #endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
+struct kernel_entry_state {
+	unsigned long cr3;
+	unsigned long gsbase;
+	bool swapgs;
+};
+
+void kernel_paranoid_entry(struct kernel_entry_state *state);
+void kernel_paranoid_exit(struct kernel_entry_state *state);
+
 #endif /* MODULE */
 
 #endif
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 10/24] x86/pti: Introduce per-task PTI trampoline stack
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (8 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 09/24] x86/entry: Add C version of paranoid_entry/exit Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 11/24] x86/pti: Function to clone page-table entries from a specified mm Alexandre Chartre
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

Double the size of the kernel stack when using PTI. The entire stack
is mapped into the kernel address space, and the top half of the stack
(the PTI stack) is also mapped into the user address space.

The PTI stack will be used as a per-task trampoline stack instead of
the current per-cpu trampoline stack. This will allow running more
code on the trampoline stack, in particular code that schedules the
task out.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/page_64_types.h | 36 +++++++++++++++++++++++++++-
 arch/x86/include/asm/processor.h     |  3 +++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 3f49dac03617..733accc20fdb 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -12,7 +12,41 @@
 #define KASAN_STACK_ORDER 0
 #endif
 
-#define THREAD_SIZE_ORDER	(2 + KASAN_STACK_ORDER)
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * PTI doubles the size of the stack. The entire stack is mapped into
+ * the kernel address space. However, only the top half of the stack is
+ * mapped into the user address space.
+ *
+ * On syscall or interrupt, user mode enters the kernel with the user
+ * page-table, and the stack pointer is switched to the top of the
+ * stack (which is mapped in the user address space and in the kernel).
+ * The syscall/interrupt handler will then later decide when to switch
+ * to the kernel address space, and to switch to the top of the kernel
+ * stack which is only mapped in the kernel.
+ *
+ *   +-------------+
+ *   |             | ^                       ^
+ *   | kernel-only | | KERNEL_STACK_SIZE     |
+ *   |    stack    | |                       |
+ *   |             | V                       |
+ *   +-------------+ <- top of kernel stack  | THREAD_SIZE
+ *   |             | ^                       |
+ *   | kernel and  | | KERNEL_STACK_SIZE     |
+ *   | PTI stack   | |                       |
+ *   |             | V                       v
+ *   +-------------+ <- top of stack
+ */
+#define PTI_STACK_ORDER 1
+#else
+#define PTI_STACK_ORDER 0
+#endif
+
+#define KERNEL_STACK_ORDER 2
+#define KERNEL_STACK_SIZE (PAGE_SIZE << KERNEL_STACK_ORDER)
+
+#define THREAD_SIZE_ORDER	\
+	(KERNEL_STACK_ORDER + PTI_STACK_ORDER + KASAN_STACK_ORDER)
 #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
 
 #define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 82a08b585818..47b1b806535b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -769,6 +769,9 @@ static inline void spin_lock_prefetch(const void *x)
 
 #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))
 
+#define task_top_of_kernel_stack(task) \
+	((void *)(((unsigned long)task_stack_page(task)) + KERNEL_STACK_SIZE))
+
 #define task_pt_regs(task) \
 ({									\
 	unsigned long __ptr = (unsigned long)task_stack_page(task);	\
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 11/24] x86/pti: Function to clone page-table entries from a specified mm
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (9 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 10/24] x86/pti: Introduce per-task PTI trampoline stack Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 12/24] x86/pti: Function to map per-cpu page-table entry Alexandre Chartre
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

PTI has a function to clone page-table entries but only from the
init_mm page-table. Provide a new function to clone page-table
entries from a specified mm page-table.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/pti.h | 10 ++++++++++
 arch/x86/mm/pti.c          | 32 ++++++++++++++++----------------
 2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 07375b476c4f..5484e69ff8d3 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -4,9 +4,19 @@
 #ifndef __ASSEMBLY__
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+enum pti_clone_level {
+	PTI_CLONE_PMD,
+	PTI_CLONE_PTE,
+};
+
+struct mm_struct;
+
 extern void pti_init(void);
 extern void pti_check_boottime_disable(void);
 extern void pti_finalize(void);
+extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+			      unsigned long end, enum pti_clone_level level);
 #else
 static inline void pti_check_boottime_disable(void) { }
 #endif
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 1aab92930569..ebc8cd2f1cd8 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -294,14 +294,8 @@ static void __init pti_setup_vsyscall(void)
 static void __init pti_setup_vsyscall(void) { }
 #endif
 
-enum pti_clone_level {
-	PTI_CLONE_PMD,
-	PTI_CLONE_PTE,
-};
-
-static void
-pti_clone_pgtable(unsigned long start, unsigned long end,
-		  enum pti_clone_level level)
+void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
+		       unsigned long end, enum pti_clone_level level)
 {
 	unsigned long addr;
 
@@ -320,7 +314,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
 		if (addr < start)
 			break;
 
-		pgd = pgd_offset_k(addr);
+		pgd = pgd_offset(mm, addr);
 		if (WARN_ON(pgd_none(*pgd)))
 			return;
 		p4d = p4d_offset(pgd, addr);
@@ -409,6 +403,12 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
 	}
 }
 
+static void pti_clone_init_pgtable(unsigned long start, unsigned long end,
+				   enum pti_clone_level level)
+{
+	pti_clone_pgtable(&init_mm, start, end, level);
+}
+
 #ifdef CONFIG_X86_64
 /*
  * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
@@ -476,7 +476,7 @@ static void __init pti_clone_user_shared(void)
 	start = CPU_ENTRY_AREA_BASE;
 	end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
 
-	pti_clone_pgtable(start, end, PTI_CLONE_PMD);
+	pti_clone_init_pgtable(start, end, PTI_CLONE_PMD);
 }
 #endif /* CONFIG_X86_64 */
 
@@ -495,9 +495,9 @@ static void __init pti_setup_espfix64(void)
  */
 static void pti_clone_entry_text(void)
 {
-	pti_clone_pgtable((unsigned long) __entry_text_start,
-			  (unsigned long) __entry_text_end,
-			  PTI_CLONE_PMD);
+	pti_clone_init_pgtable((unsigned long) __entry_text_start,
+			       (unsigned long) __entry_text_end,
+			       PTI_CLONE_PMD);
 }
 
 /*
@@ -572,11 +572,11 @@ static void pti_clone_kernel_text(void)
 	 * pti_set_kernel_image_nonglobal() did to clear the
 	 * global bit.
 	 */
-	pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
+	pti_clone_init_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
 
 	/*
-	 * pti_clone_pgtable() will set the global bit in any PMDs
-	 * that it clones, but we also need to get any PTEs in
+	 * pti_clone_init_pgtable() will set the global bit in any
+	 * PMDs that it clones, but we also need to get any PTEs in
 	 * the last level for areas that are not huge-page-aligned.
 	 */
 
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 12/24] x86/pti: Function to map per-cpu page-table entry
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (10 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 11/24] x86/pti: Function to clone page-table entries from a specified mm Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings Alexandre Chartre
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

Wrap the code used by PTI to map a per-cpu page-table entry into
a new function so that this code can be re-used to map other
per-cpu entries.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/mm/pti.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ebc8cd2f1cd8..71ca245d7b38 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -428,6 +428,21 @@ static void __init pti_clone_p4d(unsigned long addr)
 	*user_p4d = *kernel_p4d;
 }
 
+/*
+ * Clone a single percpu page.
+ */
+static void __init pti_clone_percpu_page(void *addr)
+{
+	phys_addr_t pa = per_cpu_ptr_to_phys(addr);
+	pte_t *target_pte;
+
+	target_pte = pti_user_pagetable_walk_pte((unsigned long)addr);
+	if (WARN_ON(!target_pte))
+		return;
+
+	*target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+}
+
 /*
  * Clone the CPU_ENTRY_AREA and associated data into the user space visible
  * page table.
@@ -448,16 +463,8 @@ static void __init pti_clone_user_shared(void)
 		 * This is done for all possible CPUs during boot to ensure
 		 * that it's propagated to all mms.
 		 */
+		pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));
 
-		unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
-		phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
-		pte_t *target_pte;
-
-		target_pte = pti_user_pagetable_walk_pte(va);
-		if (WARN_ON(!target_pte))
-			return;
-
-		*target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
 	}
 }
 
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (11 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 12/24] x86/pti: Function to map per-cpu page-table entry Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 17:28   ` Andy Lutomirski
  2020-11-09 11:23 ` [RFC][PATCH 14/24] x86/pti: Use PTI stack instead of trampoline stack Alexandre Chartre
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

Extend PTI user mappings so that more kernel entry code can be executed
with the user page-table. To do so, we need to map syscall and interrupt
entry code, per cpu offsets (__per_cpu_offset, which is used some in
entry code), the stack canary, and the PTI stack (which is defined per
task).

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/entry_64.S |  2 --
 arch/x86/mm/pti.c         | 14 ++++++++++++++
 kernel/fork.c             | 22 ++++++++++++++++++++++
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6e0b5b010e0b..458af12ed9a1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -274,7 +274,6 @@ SYM_FUNC_END(__switch_to_asm)
  * rbx: kernel thread func (NULL for user thread)
  * r12: kernel thread arg
  */
-.pushsection .text, "ax"
 SYM_CODE_START(ret_from_fork)
 	UNWIND_HINT_REGS
 	movq	%rsp, %rdi			/* pt_regs */
@@ -284,7 +283,6 @@ SYM_CODE_START(ret_from_fork)
 	call	return_from_fork		/* returns with IRQs disabled */
 	jmp	swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(ret_from_fork)
-.popsection
 
 .macro DEBUG_ENTRY_ASSERT_IRQS_OFF
 #ifdef CONFIG_DEBUG_ENTRY
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 71ca245d7b38..f4f3d9ae4449 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -465,6 +465,11 @@ static void __init pti_clone_user_shared(void)
 		 */
 		pti_clone_percpu_page(&per_cpu(cpu_tss_rw, cpu));
 
+		/*
+		 * Map fixed_percpu_data to get the stack canary.
+		 */
+		if (IS_ENABLED(CONFIG_STACKPROTECTOR))
+			pti_clone_percpu_page(&per_cpu(fixed_percpu_data, cpu));
 	}
 }
 
@@ -505,6 +510,15 @@ static void pti_clone_entry_text(void)
 	pti_clone_init_pgtable((unsigned long) __entry_text_start,
 			       (unsigned long) __entry_text_end,
 			       PTI_CLONE_PMD);
+
+       /*
+	* Syscall and interrupt entry code (which is in the noinstr
+	* section) will be entered with the user page-table, so that
+	* code has to be mapped in.
+	*/
+	pti_clone_init_pgtable((unsigned long) __noinstr_text_start,
+			       (unsigned long) __noinstr_text_end,
+			       PTI_CLONE_PMD);
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d266388d380..31cd77dbdba3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -999,6 +999,25 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static void mm_map_task(struct mm_struct *mm, struct task_struct *tsk)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	unsigned long addr;
+
+	if (!tsk || !static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	/*
+	 * Map the task stack after the kernel stack into the user
+	 * address space, so that this stack can be used when entering
+	 * syscall or interrupt from user mode.
+	 */
+	BUG_ON(!task_stack_page(tsk));
+	addr = (unsigned long)task_top_of_kernel_stack(tsk);
+	pti_clone_pgtable(mm, addr, addr + KERNEL_STACK_SIZE, PTI_CLONE_PTE);
+#endif
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
@@ -1043,6 +1062,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
+	mm_map_task(mm, p);
+
 	mm->user_ns = get_user_ns(user_ns);
 	return mm;
 
@@ -1404,6 +1425,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	vmacache_flush(tsk);
 
 	if (clone_flags & CLONE_VM) {
+		mm_map_task(oldmm, tsk);
 		mmget(oldmm);
 		mm = oldmm;
 		goto good_mm;
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 14/24] x86/pti: Use PTI stack instead of trampoline stack
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (12 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 15/24] x86/pti: Execute syscall functions on the kernel stack Alexandre Chartre
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.

Additional changes will be made to later to switch to the kernel stack
(which is only mapped in the kernel page-table).

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/entry_64.S        | 42 +++++++++-----------------------
 arch/x86/include/asm/pti.h       |  8 ++++++
 arch/x86/include/asm/switch_to.h |  7 +++++-
 3 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 458af12ed9a1..29beab46bedd 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -194,19 +194,9 @@ syscall_return_via_sysret:
 	/* rcx and r11 are already restored (see code above) */
 	POP_REGS pop_rdi=0 skip_r11rcx=1
 
-	/*
-	 * Now all regs are restored except RSP and RDI.
-	 * Save old stack pointer and switch to trampoline stack.
-	 */
-	movq	%rsp, %rdi
-	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
-	UNWIND_HINT_EMPTY
-
-	pushq	RSP-RDI(%rdi)	/* RSP */
-	pushq	(%rdi)		/* RDI */
-
 	/*
 	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We are on the trampoline stack.  All regs except RSP are live.
 	 * We can do future final exit work right here.
 	 */
 	STACKLEAK_ERASE_NOCLOBBER
@@ -214,7 +204,7 @@ syscall_return_via_sysret:
 	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
 	popq	%rdi
-	popq	%rsp
+	movq	RSP-ORIG_RAX(%rsp), %rsp
 	USERGS_SYSRET64
 SYM_CODE_END(entry_SYSCALL_64)
 
@@ -606,24 +596,6 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 #endif
 	POP_REGS pop_rdi=0
 
-	/*
-	 * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
-	 * Save old stack pointer and switch to trampoline stack.
-	 */
-	movq	%rsp, %rdi
-	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
-	UNWIND_HINT_EMPTY
-
-	/* Copy the IRET frame to the trampoline stack. */
-	pushq	6*8(%rdi)	/* SS */
-	pushq	5*8(%rdi)	/* RSP */
-	pushq	4*8(%rdi)	/* EFLAGS */
-	pushq	3*8(%rdi)	/* CS */
-	pushq	2*8(%rdi)	/* RIP */
-
-	/* Push user RDI on the trampoline stack. */
-	pushq	(%rdi)
-
 	/*
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
@@ -634,6 +606,7 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 
 	/* Restore RDI. */
 	popq	%rdi
+	addq	$8, %rsp	/* skip regs->orig_ax */
 	SWAPGS
 	INTERRUPT_RETURN
 
@@ -1062,6 +1035,15 @@ SYM_CODE_START_LOCAL(error_entry)
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 .Lerror_entry_from_usermode_after_swapgs:
+	/*
+	 * We are on the trampoline stack. With PTI, the trampoline
+	 * stack is a per-thread stack so we are all set and we can
+	 * return.
+	 *
+	 * Without PTI, the trampoline stack is a per-cpu stack and
+	 * we need to switch to the normal thread stack.
+	 */
+	ALTERNATIVE "", "ret", X86_FEATURE_PTI
 	/* Put us onto the real thread stack. */
 	popq	%r12				/* save return addr in %12 */
 	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 5484e69ff8d3..ed211fcc3a50 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -17,8 +17,16 @@ extern void pti_check_boottime_disable(void);
 extern void pti_finalize(void);
 extern void pti_clone_pgtable(struct mm_struct *mm, unsigned long start,
 			      unsigned long end, enum pti_clone_level level);
+static inline bool pti_enabled(void)
+{
+	return static_cpu_has(X86_FEATURE_PTI);
+}
 #else
 static inline void pti_check_boottime_disable(void) { }
+static inline bool pti_enabled(void)
+{
+	return false;
+}
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 9f69cc497f4b..457458228462 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_SWITCH_TO_H
 
 #include <linux/sched/task_stack.h>
+#include <asm/pti.h>
 
 struct task_struct; /* one of the stranger aspects of C forward declarations */
 
@@ -76,8 +77,12 @@ static inline void update_task_stack(struct task_struct *task)
 	 * doesn't work on x86-32 because sp1 and
 	 * cpu_current_top_of_stack have different values (because of
 	 * the non-zero stack-padding on 32bit).
+	 *
+	 * If PTI is enabled, sp0 points to the PTI stack (mapped in
+	 * the kernel and user page-table) which is used when entering
+	 * the kernel.
 	 */
-	if (static_cpu_has(X86_FEATURE_XENPV))
+	if (static_cpu_has(X86_FEATURE_XENPV) || pti_enabled())
 		load_sp0(task_top_of_stack(task));
 #endif
 }
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 15/24] x86/pti: Execute syscall functions on the kernel stack
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (13 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 14/24] x86/pti: Use PTI stack instead of trampoline stack Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 16/24] x86/pti: Execute IDT handlers " Alexandre Chartre
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

During a syscall, the kernel is entered and it switches the stack
to the PTI stack which is mapped both in the kernel and in the
user page-table. When executing the syscall function, switch to
the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/common.c          | 11 ++++++++++-
 arch/x86/entry/entry_64.S        |  1 +
 arch/x86/include/asm/irq_stack.h |  3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 54d0931801e1..ead6a4c72e6a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -56,10 +56,19 @@ __visible noinstr void return_from_fork(struct pt_regs *regs,
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
 					struct pt_regs *regs)
 {
+	unsigned long stack;
+
 	if (!sysfunc)
 		return;
 
-	regs->ax = sysfunc(regs);
+	if (!pti_enabled()) {
+		regs->ax = sysfunc(regs);
+		return;
+	}
+
+	stack = (unsigned long)task_top_of_kernel_stack(current);
+	regs->ax = asm_call_syscall_on_stack((void *)(stack - 8),
+					     sysfunc, regs);
 }
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 29beab46bedd..6b88a0eb8975 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -771,6 +771,7 @@ SYM_FUNC_START(asm_call_on_stack_2)
 SYM_FUNC_START(asm_call_on_stack_3)
 SYM_INNER_LABEL(asm_call_sysvec_on_stack, SYM_L_GLOBAL)
 SYM_INNER_LABEL(asm_call_irq_on_stack, SYM_L_GLOBAL)
+SYM_INNER_LABEL(asm_call_syscall_on_stack, SYM_L_GLOBAL)
 	/*
 	 * Save the frame pointer unconditionally. This allows the ORC
 	 * unwinder to handle the stack switch.
diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 359427216336..108d9da7c01c 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -5,6 +5,7 @@
 #include <linux/ptrace.h>
 
 #include <asm/processor.h>
+#include <asm/syscall.h>
 
 #ifdef CONFIG_X86_64
 static __always_inline bool irqstack_active(void)
@@ -25,6 +26,8 @@ void asm_call_sysvec_on_stack(void *sp, void (*func)(struct pt_regs *regs),
 			      struct pt_regs *regs);
 void asm_call_irq_on_stack(void *sp, void (*func)(struct irq_desc *desc),
 			   struct irq_desc *desc);
+long asm_call_syscall_on_stack(void *sp, sys_call_ptr_t func,
+			       struct pt_regs *regs);
 
 static __always_inline void __run_on_irqstack(void (*func)(void))
 {
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 16/24] x86/pti: Execute IDT handlers on the kernel stack
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (14 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 15/24] x86/pti: Execute syscall functions on the kernel stack Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 17/24] x86/pti: Execute IDT handlers with error code " Alexandre Chartre
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

For now, only changes IDT handlers which have no argument other
than the pt_regs registers.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/idtentry.h | 43 +++++++++++++++++++++++++++++++--
 arch/x86/kernel/cpu/mce/core.c  |  2 +-
 arch/x86/kernel/traps.c         |  4 +--
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 4b4aca2b1420..3595a31947b3 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -10,10 +10,49 @@
 #include <linux/hardirq.h>
 
 #include <asm/irq_stack.h>
+#include <asm/pti.h>
 
 bool idtentry_enter_nmi(struct pt_regs *regs);
 void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
 
+/*
+ * The CALL_ON_STACK_* macro call the specified function either directly
+ * if no stack is provided, or on the specified stack.
+ */
+#define CALL_ON_STACK_1(stack, func, arg1)				\
+	((stack) ?							\
+	 asm_call_on_stack_1(stack,					\
+		(void (*)(void))(func), (void *)(arg1)) :		\
+	 func(arg1))
+
+/*
+ * Functions to return the top of the kernel stack if we are using the
+ * user page-table (and thus not running with the kernel stack). If we
+ * are using the kernel page-table (and so already using the kernel
+ * stack) when it returns NULL.
+ */
+static __always_inline void *pti_kernel_stack(struct pt_regs *regs)
+{
+	unsigned long stack;
+
+	if (pti_enabled() && user_mode(regs)) {
+		stack = (unsigned long)task_top_of_kernel_stack(current);
+		return (void *)(stack - 8);
+	} else {
+		return NULL;
+	}
+}
+
+/*
+ * Wrappers to run an IDT handler on the kernel stack if we are not
+ * already using this stack.
+ */
+static __always_inline
+void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs)
+{
+	CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
@@ -55,7 +94,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
-	__##func (regs);						\
+	run_idt(__##func, regs);					\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
 }									\
@@ -271,7 +310,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	instrumentation_begin();					\
 	__irq_enter_raw();						\
 	kvm_set_cpu_l1tf_flush_l1d();					\
-	__##func (regs);						\
+	run_idt(__##func, regs);					\
 	__irq_exit_raw();						\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4102b866e7c0..9407c3cd9355 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2035,7 +2035,7 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
 	unsigned long dr7;
 
 	dr7 = local_db_save();
-	exc_machine_check_user(regs);
+	run_idt(exc_machine_check_user, regs);
 	local_db_restore(dr7);
 }
 #else
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 09b22a611d99..5161385b3670 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -257,7 +257,7 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
 
 	state = irqentry_enter(regs);
 	instrumentation_begin();
-	handle_invalid_op(regs);
+	run_idt(handle_invalid_op, regs);
 	instrumentation_end();
 	irqentry_exit(regs, state);
 }
@@ -647,7 +647,7 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 	if (user_mode(regs)) {
 		irqentry_enter_from_user_mode(regs);
 		instrumentation_begin();
-		do_int3_user(regs);
+		run_idt(do_int3_user, regs);
 		instrumentation_end();
 		irqentry_exit_to_user_mode(regs);
 	} else {
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 17/24] x86/pti: Execute IDT handlers with error code on the kernel stack
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (15 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 16/24] x86/pti: Execute IDT handlers " Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 18/24] x86/pti: Execute system vector handlers " Alexandre Chartre
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes IDT handlers which have an error code.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/idtentry.h | 18 ++++++++++++++++--
 arch/x86/kernel/traps.c         |  2 +-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 3595a31947b3..a82e31b45442 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -25,6 +25,12 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
 		(void (*)(void))(func), (void *)(arg1)) :		\
 	 func(arg1))
 
+#define CALL_ON_STACK_2(stack, func, arg1, arg2)			\
+	((stack) ?							\
+	 asm_call_on_stack_2(stack,					\
+		(void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
+	 func(arg1, arg2))
+
 /*
  * Functions to return the top of the kernel stack if we are using the
  * user page-table (and thus not running with the kernel stack). If we
@@ -53,6 +59,13 @@ void run_idt(void (*func)(struct pt_regs *), struct pt_regs *regs)
 	CALL_ON_STACK_1(pti_kernel_stack(regs), func, regs);
 }
 
+static __always_inline
+void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
+		     struct pt_regs *regs, unsigned long error_code)
+{
+	CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
@@ -141,7 +154,7 @@ __visible noinstr void func(struct pt_regs *regs,			\
 	irqentry_state_t state = irqentry_enter(regs);			\
 									\
 	instrumentation_begin();					\
-	__##func (regs, error_code);					\
+	run_idt_errcode(__##func, regs, error_code);			\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
 }									\
@@ -239,7 +252,8 @@ __visible noinstr void func(struct pt_regs *regs,			\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
 	kvm_set_cpu_l1tf_flush_l1d();					\
-	__##func (regs, (u8)error_code);				\
+	run_idt_errcode((void (*)(struct pt_regs *, unsigned long))__##func, \
+			regs, (u8)error_code);				\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 5161385b3670..9a51aa016fb3 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -979,7 +979,7 @@ DEFINE_IDTENTRY_DEBUG(exc_debug)
 /* User entry, runs on regular task stack */
 DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
 {
-	exc_debug_user(regs, debug_read_clear_dr6());
+	run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6());
 }
 #else
 /* 32 bit does not have separate entry points. */
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 18/24] x86/pti: Execute system vector handlers on the kernel stack
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (16 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 17/24] x86/pti: Execute IDT handlers with error code " Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 19/24] x86/pti: Execute page fault handler " Alexandre Chartre
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

After an interrupt/exception in userland, the kernel is entered
and it switches the stack to the PTI stack which is mapped both in
the kernel and in the user page-table. When executing the interrupt
function, switch to the kernel stack (which is mapped only in the
kernel page-table) so that no kernel data leak to the userland
through the stack.

Changes system vector handlers to execute on the kernel stack.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/idtentry.h | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a82e31b45442..0c5d9f027112 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -66,6 +66,17 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
 	CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
 }
 
+static __always_inline
+void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
+{
+	void *stack = pti_kernel_stack(regs);
+
+	if (stack)
+		asm_call_on_stack_1(stack, (void (*)(void))func, regs);
+	else
+		run_sysvec_on_irqstack_cond(func, regs);
+}
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
@@ -295,7 +306,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
 	kvm_set_cpu_l1tf_flush_l1d();					\
-	run_sysvec_on_irqstack_cond(__##func, regs);			\
+	run_sysvec(__##func, regs);					\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 19/24] x86/pti: Execute page fault handler on the kernel stack
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (17 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 18/24] x86/pti: Execute system vector handlers " Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 20/24] x86/pti: Execute NMI " Alexandre Chartre
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

After a page fault from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the page fault handler, switch
to the kernel stack (which is mapped only in the kernel page-table)
so that no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/idtentry.h | 17 +++++++++++++++++
 arch/x86/mm/fault.c             |  2 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 0c5d9f027112..a6725afaaec0 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -31,6 +31,13 @@ void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
 		(void (*)(void))(func), (void *)(arg1), (void *)(arg2)) : \
 	 func(arg1, arg2))
 
+#define CALL_ON_STACK_3(stack, func, arg1, arg2, arg3)			\
+	((stack) ?							\
+	 asm_call_on_stack_3(stack,					\
+		(void (*)(void))(func), (void *)(arg1), (void *)(arg2),	\
+					(void *)(arg3)) :		\
+	 func(arg1, arg2, arg3))
+
 /*
  * Functions to return the top of the kernel stack if we are using the
  * user page-table (and thus not running with the kernel stack). If we
@@ -66,6 +73,16 @@ void run_idt_errcode(void (*func)(struct pt_regs *, unsigned long),
 	CALL_ON_STACK_2(pti_kernel_stack(regs), func, regs, error_code);
 }
 
+static __always_inline
+void run_idt_pagefault(void (*func)(struct pt_regs *, unsigned long,
+				    unsigned long),
+		       struct pt_regs *regs, unsigned long error_code,
+		       unsigned long address)
+{
+	CALL_ON_STACK_3(pti_kernel_stack(regs),
+			func, regs, error_code, address);
+}
+
 static __always_inline
 void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
 {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82bf37a5c9ec..b9d03603d95d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1482,7 +1482,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 	state = irqentry_enter(regs);
 
 	instrumentation_begin();
-	handle_page_fault(regs, error_code, address);
+	run_idt_pagefault(handle_page_fault, regs, error_code, address);
 	instrumentation_end();
 
 	irqentry_exit(regs, state);
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 20/24] x86/pti: Execute NMI handler on the kernel stack
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (18 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 19/24] x86/pti: Execute page fault handler " Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 21/24] x86/entry: Disable stack-protector for IST entry C handlers Alexandre Chartre
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

After a NMI from userland, the kernel is entered and it switches
the stack to the PTI stack which is mapped both in the kernel and in
the user page-table. When executing the NMI handler, switch to the
kernel stack (which is mapped only in the kernel page-table) so that
no kernel data leak to the userland through the stack.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/kernel/nmi.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aaf1303..be0f654c3095 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -506,8 +506,18 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
 	inc_irq_stat(__nmi_count);
 
-	if (!ignore_nmis)
-		default_do_nmi(regs);
+	if (!ignore_nmis) {
+		if (user_mode(regs)) {
+			/*
+			 * If we come from userland then we are on the
+			 * trampoline stack, switch to the kernel stack
+			 * to execute the NMI handler.
+			 */
+			run_idt(default_do_nmi, regs);
+		} else {
+			default_do_nmi(regs);
+		}
+	}
 
 	idtentry_exit_nmi(regs, irq_state);
 
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 21/24] x86/entry: Disable stack-protector for IST entry C handlers
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (19 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 20/24] x86/pti: Execute NMI " Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 22/24] x86/entry: Defer paranoid entry/exit to C code Alexandre Chartre
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

The stack-protector option adds a stack canary to functions vulnerable
to stack buffer overflow. The stack canary is defined through the GS
register. Add an attribute to disable the stack-protector option; it
will be used for C functions which can be called while the GS register
might not be properly configured yet.

The GS register is not properly configured for the kernel when we enter
the kernel from userspace. The assembly entry code sets the GS register
for the kernel using the swapgs instruction or the paranoid_entry function,
and so, currently, the GS register is correctly configured when assembly
entry code subsequently transfer control to C code.

Deferring the CR3 register switch from assembly to C code will require to
reimplement paranoid_entry in C and hence also defer the GS register setup
for IST entries to C code. To prepare this change, disable stack-protector
for IST entry C handlers where the GS register setup will eventually
happen.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/include/asm/idtentry.h | 25 ++++++++++++++++++++-----
 arch/x86/kernel/nmi.c           |  2 +-
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a6725afaaec0..647af7ea3bf1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -94,6 +94,21 @@ void run_sysvec(void (*func)(struct pt_regs *regs), struct pt_regs *regs)
 		run_sysvec_on_irqstack_cond(func, regs);
 }
 
+/*
+ * Attribute to disable the stack-protector option. The option is
+ * disabled using the optimize attribute which clears all optimize
+ * options. So we need to specify the optimize option to disable but
+ * also optimize options we want to preserve.
+ *
+ * The stack-protector option adds a stack canary to functions
+ * vulnerable to stack buffer overflow. The stack canary is defined
+ * through the GS register. So the attribute is used to disable the
+ * stack-protector option for functions which can be called while the
+ * GS register might not be properly configured yet.
+ */
+#define no_stack_protector	\
+	__attribute__ ((optimize("-O2,-fno-stack-protector,-fno-omit-frame-pointer")))
+
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *		      No error code pushed by hardware
@@ -410,7 +425,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW
  */
 #define DEFINE_IDTENTRY_IST(func)					\
-	DEFINE_IDTENTRY_RAW(func)
+	no_stack_protector DEFINE_IDTENTRY_RAW(func)
 
 /**
  * DEFINE_IDTENTRY_NOIST - Emit code for NOIST entry points which
@@ -440,7 +455,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
  */
 #define DEFINE_IDTENTRY_DF(func)					\
-	DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+	no_stack_protector DEFINE_IDTENTRY_RAW_ERRORCODE(func)
 
 /**
  * DEFINE_IDTENTRY_VC_SAFE_STACK - Emit code for VMM communication handler
@@ -472,7 +487,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * VMM communication handler.
  */
 #define DEFINE_IDTENTRY_VC_SETUP_STACK(func)			\
-	__visible noinstr					\
+	no_stack_protector __visible noinstr			\
 	unsigned long setup_stack_##func(struct pt_regs *regs)
 
 /**
@@ -482,7 +497,7 @@ static __always_inline void __##func(struct pt_regs *regs)
  * Maps to DEFINE_IDTENTRY_RAW_ERRORCODE
  */
 #define DEFINE_IDTENTRY_VC(func)					\
-	DEFINE_IDTENTRY_RAW_ERRORCODE(func)
+	no_stack_protector DEFINE_IDTENTRY_RAW_ERRORCODE(func)
 
 #else	/* CONFIG_X86_64 */
 
@@ -517,7 +532,7 @@ __visible noinstr void func(struct pt_regs *regs,			\
 
 /* C-Code mapping */
 #define DECLARE_IDTENTRY_NMI		DECLARE_IDTENTRY_RAW
-#define DEFINE_IDTENTRY_NMI		DEFINE_IDTENTRY_RAW
+#define DEFINE_IDTENTRY_NMI		no_stack_protector DEFINE_IDTENTRY_RAW
 
 #ifdef CONFIG_X86_64
 #define DECLARE_IDTENTRY_MCE		DECLARE_IDTENTRY_IST
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index be0f654c3095..b6291b683be1 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state);
 static DEFINE_PER_CPU(unsigned long, nmi_cr2);
 static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
-DEFINE_IDTENTRY_RAW(exc_nmi)
+DEFINE_IDTENTRY_NMI(exc_nmi)
 {
 	bool irq_state;
 
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 22/24] x86/entry: Defer paranoid entry/exit to C code
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (20 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 21/24] x86/entry: Disable stack-protector for IST entry C handlers Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 23/24] x86/entry: Remove paranoid_entry and paranoid_exit Alexandre Chartre
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

IST entries from the kernel use paranoid entry and exit
assembly functions to ensure the CR3 and GS registers are
updated with correct values for the kernel. Move the update
of the CR3 and GS registers inside the C code of IST handlers.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/entry_64.S      | 72 ++++++++++------------------------
 arch/x86/kernel/cpu/mce/core.c |  3 ++
 arch/x86/kernel/nmi.c          | 18 +++++++--
 arch/x86/kernel/sev-es.c       | 20 +++++++++-
 arch/x86/kernel/traps.c        | 30 ++++++++++++--
 5 files changed, 83 insertions(+), 60 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 6b88a0eb8975..9ea8187d4405 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -462,16 +462,16 @@ SYM_CODE_START(\asmsym)
 	/* Entry from kernel */
 
 	pushq	$-1			/* ORIG_RAX: no syscall to restart */
-	/* paranoid_entry returns GS information for paranoid_exit in EBX. */
-	call	paranoid_entry
-
+	cld
+	PUSH_AND_CLEAR_REGS
+	ENCODE_FRAME_POINTER
 	UNWIND_HINT_REGS
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
 	call	\cfunc
 
-	jmp	paranoid_exit
+	jmp	restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -507,12 +507,9 @@ SYM_CODE_START(\asmsym)
 	 */
 	ist_entry_user safe_stack_\cfunc, has_error_code=1
 
-	/*
-	 * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
-	 * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
-	 */
-	call	paranoid_entry
-
+	cld
+	PUSH_AND_CLEAR_REGS
+	ENCODE_FRAME_POINTER
 	UNWIND_HINT_REGS
 
 	/*
@@ -538,7 +535,7 @@ SYM_CODE_START(\asmsym)
 	 * identical to the stack in the IRET frame or the VC fall-back stack,
 	 * so it is definitly mapped even with PTI enabled.
 	 */
-	jmp	paranoid_exit
+	jmp	restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -555,8 +552,9 @@ SYM_CODE_START(\asmsym)
 	UNWIND_HINT_IRET_REGS offset=8
 	ASM_CLAC
 
-	/* paranoid_entry returns GS information for paranoid_exit in EBX. */
-	call	paranoid_entry
+	cld
+	PUSH_AND_CLEAR_REGS
+	ENCODE_FRAME_POINTER
 	UNWIND_HINT_REGS
 
 	movq	%rsp, %rdi		/* pt_regs pointer into first argument */
@@ -564,7 +562,7 @@ SYM_CODE_START(\asmsym)
 	movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
 	call	\cfunc
 
-	jmp	paranoid_exit
+	jmp	restore_regs_and_return_to_kernel
 
 _ASM_NOKPROBE(\asmsym)
 SYM_CODE_END(\asmsym)
@@ -1119,10 +1117,6 @@ SYM_CODE_END(error_return)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
- *
- * Registers:
- *	%r14: Used to save/restore the CR3 of the interrupted context
- *	      when PAGE_TABLE_ISOLATION is in use.  Do not clobber.
  */
 SYM_CODE_START(asm_exc_nmi)
 	/*
@@ -1173,7 +1167,7 @@ SYM_CODE_START(asm_exc_nmi)
 	 * We also must not push anything to the stack before switching
 	 * stacks lest we corrupt the "NMI executing" variable.
 	 */
-	ist_entry_user exc_nmi
+	ist_entry_user exc_nmi_user
 
 	/* NMI from kernel */
 
@@ -1346,9 +1340,7 @@ repeat_nmi:
 	 *
 	 * RSP is pointing to "outermost RIP".  gsbase is unknown, but, if
 	 * we're repeating an NMI, gsbase has the same value that it had on
-	 * the first iteration.  paranoid_entry will load the kernel
-	 * gsbase if needed before we call exc_nmi().  "NMI executing"
-	 * is zero.
+	 * the first iteration.  "NMI executing" is zero.
 	 */
 	movq	$1, 10*8(%rsp)		/* Set "NMI executing". */
 
@@ -1372,44 +1364,20 @@ end_repeat_nmi:
 	pushq	$-1				/* ORIG_RAX: no syscall to restart */
 
 	/*
-	 * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
-	 * as we should not be calling schedule in NMI context.
-	 * Even with normal interrupts enabled. An NMI should not be
-	 * setting NEED_RESCHED or anything that normal interrupts and
+	 * We should not be calling schedule in NMI context. Even with
+	 * normal interrupts enabled. An NMI should not be setting
+	 * NEED_RESCHED or anything that normal interrupts and
 	 * exceptions might do.
 	 */
-	call	paranoid_entry
+	cld
+	PUSH_AND_CLEAR_REGS
+	ENCODE_FRAME_POINTER
 	UNWIND_HINT_REGS
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
 	call	exc_nmi
 
-	/* Always restore stashed CR3 value (see paranoid_entry) */
-	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
-
-	/*
-	 * The above invocation of paranoid_entry stored the GSBASE
-	 * related information in R/EBX depending on the availability
-	 * of FSGSBASE.
-	 *
-	 * If FSGSBASE is enabled, restore the saved GSBASE value
-	 * unconditionally, otherwise take the conditional SWAPGS path.
-	 */
-	ALTERNATIVE "jmp nmi_no_fsgsbase", "", X86_FEATURE_FSGSBASE
-
-	wrgsbase	%rbx
-	jmp	nmi_restore
-
-nmi_no_fsgsbase:
-	/* EBX == 0 -> invoke SWAPGS */
-	testl	%ebx, %ebx
-	jnz	nmi_restore
-
-nmi_swapgs:
-	SWAPGS_UNSAFE_STACK
-
-nmi_restore:
 	POP_REGS
 
 	/*
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 9407c3cd9355..827088f981c6 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2022,11 +2022,14 @@ static __always_inline void exc_machine_check_user(struct pt_regs *regs)
 /* MCE hit kernel mode */
 DEFINE_IDTENTRY_MCE(exc_machine_check)
 {
+	struct kernel_entry_state entry_state;
 	unsigned long dr7;
 
+	kernel_paranoid_entry(&entry_state);
 	dr7 = local_db_save();
 	exc_machine_check_kernel(regs);
 	local_db_restore(dr7);
+	kernel_paranoid_exit(&entry_state);
 }
 
 /* The user mode variant. */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index b6291b683be1..23c92ffd58fe 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -473,7 +473,7 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state);
 static DEFINE_PER_CPU(unsigned long, nmi_cr2);
 static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
-DEFINE_IDTENTRY_NMI(exc_nmi)
+static noinstr void handle_nmi(struct pt_regs *regs)
 {
 	bool irq_state;
 
@@ -529,9 +529,21 @@ DEFINE_IDTENTRY_NMI(exc_nmi)
 		write_cr2(this_cpu_read(nmi_cr2));
 	if (this_cpu_dec_return(nmi_state))
 		goto nmi_restart;
+}
+
+DEFINE_IDTENTRY_NMI(exc_nmi)
+{
+	struct kernel_entry_state entry_state;
+
+	kernel_paranoid_entry(&entry_state);
+	handle_nmi(regs);
+	kernel_paranoid_exit(&entry_state);
+}
 
-	if (user_mode(regs))
-		mds_user_clear_cpu_buffers();
+__visible noinstr void exc_nmi_user(struct pt_regs *regs)
+{
+	handle_nmi(regs);
+	mds_user_clear_cpu_buffers();
 }
 
 void stop_nmi(void)
diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
index bd977c917cd6..ef9a8b69c25c 100644
--- a/arch/x86/kernel/sev-es.c
+++ b/arch/x86/kernel/sev-es.c
@@ -1352,13 +1352,25 @@ DEFINE_IDTENTRY_VC_IST(exc_vmm_communication)
 struct exc_vc_frame {
 	/* pt_regs should be first */
 	struct pt_regs regs;
+	/* extra parameters for the handler */
+	struct kernel_entry_state entry_state;
 };
 
 DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
 {
+	struct kernel_entry_state entry_state;
 	struct exc_vc_frame *frame;
 	unsigned long sp;
 
+	/*
+	 * kernel_paranoid_entry() is called first to properly set
+	 * the GS register which is used to access per-cpu variables.
+	 *
+	 * vc_switch_off_ist() uses per-cpu variables so it has to be
+	 * called after kernel_paranoid_entry().
+	 */
+	kernel_paranoid_entry(&entry_state);
+
 	/*
 	 * Switch off the IST stack to make it free for nested exceptions.
 	 * The vc_switch_off_ist() function will switch back to the
@@ -1370,7 +1382,8 @@ DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
 	/*
 	 * Found a safe stack. Set it up as if the entry has happened on
 	 * that stack. This means that we need to have pt_regs at the top
-	 * of the stack.
+	 * of the stack, and we can use the bottom of the stack to pass
+	 * extra parameters (like the kernel entry state) to the handler.
 	 *
 	 * The effective stack switch happens in assembly code before
 	 * the #VC handler is called.
@@ -1379,16 +1392,21 @@ DEFINE_IDTENTRY_VC_SETUP_STACK(exc_vmm_communication)
 
 	frame = (struct exc_vc_frame *)sp;
 	frame->regs = *regs;
+	frame->entry_state = entry_state;
 
 	return sp;
 }
 
 DEFINE_IDTENTRY_VC(exc_vmm_communication)
 {
+	struct exc_vc_frame *frame = (struct exc_vc_frame *)regs;
+
 	if (likely(!on_vc_fallback_stack(regs)))
 		safe_stack_exc_vmm_communication(regs, error_code);
 	else
 		ist_exc_vmm_communication(regs, error_code);
+
+	kernel_paranoid_exit(&frame->entry_state);
 }
 
 bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9a51aa016fb3..1801791748b8 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -344,10 +344,10 @@ __visible void __noreturn handle_stack_overflow(const char *message,
 DEFINE_IDTENTRY_DF(exc_double_fault)
 {
 	static const char str[] = "double fault";
-	struct task_struct *tsk = current;
-
+	struct task_struct *tsk;
+	struct kernel_entry_state entry_state;
 #ifdef CONFIG_VMAP_STACK
-	unsigned long address = read_cr2();
+	unsigned long address;
 #endif
 
 #ifdef CONFIG_X86_ESPFIX64
@@ -371,8 +371,12 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
 		unsigned long *p = (unsigned long *)regs->sp;
+		struct pt_regs *gpregs;
+
+		kernel_paranoid_entry(&entry_state);
+
+		gpregs = (struct pt_regs *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
 
 		/*
 		 * regs->sp points to the failing IRET frame on the
@@ -401,14 +405,28 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
 		regs->ip = (unsigned long)asm_exc_general_protection;
 		regs->sp = (unsigned long)&gpregs->orig_ax;
 
+		kernel_paranoid_exit(&entry_state);
+
 		return;
 	}
 #endif
 
+	/*
+	 * Switch to the kernel page-table. We are on an IST stack, and
+	 * we are going to die so there is no need to switch to the kernel
+	 * stack even if we are coming from userspace.
+	 */
+	kernel_paranoid_entry(&entry_state);
+
+#ifdef CONFIG_VMAP_STACK
+	address = read_cr2();
+#endif
+
 	idtentry_enter_nmi(regs);
 	instrumentation_begin();
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
+	tsk = current;
 	tsk->thread.error_code = error_code;
 	tsk->thread.trap_nr = X86_TRAP_DF;
 
@@ -973,7 +991,11 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,
 /* IST stack entry */
 DEFINE_IDTENTRY_DEBUG(exc_debug)
 {
+	struct kernel_entry_state entry_state;
+
+	kernel_paranoid_entry(&entry_state);
 	exc_debug_kernel(regs, debug_read_clear_dr6());
+	kernel_paranoid_exit(&entry_state);
 }
 
 /* User entry, runs on regular task stack */
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 23/24] x86/entry: Remove paranoid_entry and paranoid_exit
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (21 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 22/24] x86/entry: Defer paranoid entry/exit to C code Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 11:23 ` [RFC][PATCH 24/24] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries Alexandre Chartre
  2020-11-09 14:00 ` [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

The paranoid_entry and paranoid_exit assembly functions have been
replaced by the kernel_paranoid_entry() and kernel_paranoid_exit()
C functions. Now paranoid_entry/exit are not used anymore and can
be removed.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/entry_64.S | 131 --------------------------------------
 1 file changed, 131 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ea8187d4405..797effbe65b6 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -882,137 +882,6 @@ SYM_CODE_START(xen_failsafe_callback)
 SYM_CODE_END(xen_failsafe_callback)
 #endif /* CONFIG_XEN_PV */
 
-/*
- * Save all registers in pt_regs. Return GSBASE related information
- * in EBX depending on the availability of the FSGSBASE instructions:
- *
- * FSGSBASE	R/EBX
- *     N        0 -> SWAPGS on exit
- *              1 -> no SWAPGS on exit
- *
- *     Y        GSBASE value at entry, must be restored in paranoid_exit
- */
-SYM_CODE_START_LOCAL(paranoid_entry)
-	UNWIND_HINT_FUNC
-	cld
-	PUSH_AND_CLEAR_REGS save_ret=1
-	ENCODE_FRAME_POINTER 8
-
-	/*
-	 * Always stash CR3 in %r14.  This value will be restored,
-	 * verbatim, at exit.  Needed if paranoid_entry interrupted
-	 * another entry that already switched to the user CR3 value
-	 * but has not yet returned to userspace.
-	 *
-	 * This is also why CS (stashed in the "iret frame" by the
-	 * hardware at entry) can not be used: this may be a return
-	 * to kernel code, but with a user CR3 value.
-	 *
-	 * Switching CR3 does not depend on kernel GSBASE so it can
-	 * be done before switching to the kernel GSBASE. This is
-	 * required for FSGSBASE because the kernel GSBASE has to
-	 * be retrieved from a kernel internal table.
-	 */
-	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
-
-	/*
-	 * Handling GSBASE depends on the availability of FSGSBASE.
-	 *
-	 * Without FSGSBASE the kernel enforces that negative GSBASE
-	 * values indicate kernel GSBASE. With FSGSBASE no assumptions
-	 * can be made about the GSBASE value when entering from user
-	 * space.
-	 */
-	ALTERNATIVE "jmp .Lparanoid_entry_checkgs", "", X86_FEATURE_FSGSBASE
-
-	/*
-	 * Read the current GSBASE and store it in %rbx unconditionally,
-	 * retrieve and set the current CPUs kernel GSBASE. The stored value
-	 * has to be restored in paranoid_exit unconditionally.
-	 *
-	 * The unconditional write to GS base below ensures that no subsequent
-	 * loads based on a mispredicted GS base can happen, therefore no LFENCE
-	 * is needed here.
-	 */
-	SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx
-	ret
-
-.Lparanoid_entry_checkgs:
-	/* EBX = 1 -> kernel GSBASE active, no restore required */
-	movl	$1, %ebx
-	/*
-	 * The kernel-enforced convention is a negative GSBASE indicates
-	 * a kernel value. No SWAPGS needed on entry and exit.
-	 */
-	movl	$MSR_GS_BASE, %ecx
-	rdmsr
-	testl	%edx, %edx
-	jns	.Lparanoid_entry_swapgs
-	ret
-
-.Lparanoid_entry_swapgs:
-	SWAPGS
-
-	/*
-	 * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an
-	 * unconditional CR3 write, even in the PTI case.  So do an lfence
-	 * to prevent GS speculation, regardless of whether PTI is enabled.
-	 */
-	FENCE_SWAPGS_KERNEL_ENTRY
-
-	/* EBX = 0 -> SWAPGS required on exit */
-	xorl	%ebx, %ebx
-	ret
-SYM_CODE_END(paranoid_entry)
-
-/*
- * "Paranoid" exit path from exception stack.  This is invoked
- * only on return from non-NMI IST interrupts that came
- * from kernel space.
- *
- * We may be returning to very strange contexts (e.g. very early
- * in syscall entry), so checking for preemption here would
- * be complicated.  Fortunately, there's no good reason to try
- * to handle preemption here.
- *
- * R/EBX contains the GSBASE related information depending on the
- * availability of the FSGSBASE instructions:
- *
- * FSGSBASE	R/EBX
- *     N        0 -> SWAPGS on exit
- *              1 -> no SWAPGS on exit
- *
- *     Y        User space GSBASE, must be restored unconditionally
- */
-SYM_CODE_START_LOCAL(paranoid_exit)
-	UNWIND_HINT_REGS
-	/*
-	 * The order of operations is important. RESTORE_CR3 requires
-	 * kernel GSBASE.
-	 *
-	 * NB to anyone to try to optimize this code: this code does
-	 * not execute at all for exceptions from user mode. Those
-	 * exceptions go through error_exit instead.
-	 */
-	RESTORE_CR3	scratch_reg=%rax save_reg=%r14
-
-	/* Handle the three GSBASE cases */
-	ALTERNATIVE "jmp .Lparanoid_exit_checkgs", "", X86_FEATURE_FSGSBASE
-
-	/* With FSGSBASE enabled, unconditionally restore GSBASE */
-	wrgsbase	%rbx
-	jmp		restore_regs_and_return_to_kernel
-
-.Lparanoid_exit_checkgs:
-	/* On non-FSGSBASE systems, conditionally do SWAPGS */
-	testl		%ebx, %ebx
-	jnz		restore_regs_and_return_to_kernel
-
-	/* We are returning to a context with user GSBASE */
-	SWAPGS_UNSAFE_STACK
-	jmp		restore_regs_and_return_to_kernel
-SYM_CODE_END(paranoid_exit)
-
 /*
  * Save all registers in pt_regs, and switch GS if needed.
  */
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC][PATCH 24/24] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (22 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 23/24] x86/entry: Remove paranoid_entry and paranoid_exit Alexandre Chartre
@ 2020-11-09 11:23 ` Alexandre Chartre
  2020-11-09 14:00 ` [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 11:23 UTC (permalink / raw)
  To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, dave.hansen@linux.intel.com,
	luto@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	jroedel@suse.de
  Cc: konrad.wilk@oracle.com, jan.setjeeilers@oracle.com,
	junaids@google.com, oweisse@google.com, rppt@linux.vnet.ibm.com,
	graf@amazon.de, mgross@linux.intel.com, kuzuno@gmail.com,
	alexandre.chartre@oracle.com

With PTI, syscall/interrupt/exception entries switch the CR3 register
to change the page-table in assembly code. Move the CR3 register switch
inside the C code of syscall/interrupt/exception entry handlers.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/common.c             | 15 ++++++++++++---
 arch/x86/entry/entry_64.S           | 23 +++++------------------
 arch/x86/entry/entry_64_compat.S    | 22 ----------------------
 arch/x86/include/asm/entry-common.h | 14 ++++++++++++++
 arch/x86/include/asm/idtentry.h     | 25 ++++++++++++++++++++-----
 arch/x86/kernel/cpu/mce/core.c      |  2 ++
 arch/x86/kernel/nmi.c               |  2 ++
 arch/x86/kernel/traps.c             |  6 ++++++
 arch/x86/mm/fault.c                 |  9 +++++++--
 9 files changed, 68 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ead6a4c72e6a..3f4788dbbde7 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -51,6 +51,7 @@ __visible noinstr void return_from_fork(struct pt_regs *regs,
 		regs->ax = 0;
 	}
 	syscall_exit_to_user_mode(regs);
+	switch_to_user_cr3();
 }
 
 static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
@@ -74,6 +75,7 @@ static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
+	switch_to_kernel_cr3();
 	nr = syscall_enter_from_user_mode(regs, nr);
 
 	instrumentation_begin();
@@ -91,12 +93,14 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 
 	instrumentation_end();
 	syscall_exit_to_user_mode(regs);
+	switch_to_user_cr3();
 }
 #endif
 
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
 static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
 {
+	switch_to_kernel_cr3();
 	if (IS_ENABLED(CONFIG_IA32_EMULATION))
 		current_thread_info()->status |= TS_COMPAT;
 
@@ -131,11 +135,11 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 
 	do_syscall_32_irqs_on(regs, nr);
 	syscall_exit_to_user_mode(regs);
+	switch_to_user_cr3();
 }
 
-static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
+static noinstr bool __do_fast_syscall_32(struct pt_regs *regs, long nr)
 {
-	unsigned int nr = syscall_32_enter(regs);
 	int res;
 
 	/*
@@ -179,6 +183,9 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 /* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
 __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 {
+	unsigned int nr = syscall_32_enter(regs);
+	bool syscall_done;
+
 	/*
 	 * Called using the internal vDSO SYSENTER/SYSCALL32 calling
 	 * convention.  Adjust regs so it looks like we entered using int80.
@@ -194,7 +201,9 @@ __visible noinstr long do_fast_syscall_32(struct pt_regs *regs)
 	regs->ip = landing_pad;
 
 	/* Invoke the syscall. If it failed, keep it simple: use IRET. */
-	if (!__do_fast_syscall_32(regs))
+	syscall_done = __do_fast_syscall_32(regs, nr);
+	switch_to_user_cr3();
+	if (!syscall_done)
 		return 0;
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 797effbe65b6..4be15a5ffe68 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -98,7 +98,6 @@ SYM_CODE_START(entry_SYSCALL_64)
 	swapgs
 	/* tss.sp2 is scratch space. */
 	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
@@ -192,18 +191,14 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	 */
 syscall_return_via_sysret:
 	/* rcx and r11 are already restored (see code above) */
-	POP_REGS pop_rdi=0 skip_r11rcx=1
+	POP_REGS skip_r11rcx=1
 
 	/*
-	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We are on the trampoline stack.  All regs except RSP are live.
 	 * We can do future final exit work right here.
 	 */
 	STACKLEAK_ERASE_NOCLOBBER
 
-	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-
-	popq	%rdi
 	movq	RSP-ORIG_RAX(%rsp), %rsp
 	USERGS_SYSRET64
 SYM_CODE_END(entry_SYSCALL_64)
@@ -321,7 +316,6 @@ SYM_CODE_END(ret_from_fork)
 	swapgs
 	cld
 	FENCE_SWAPGS_USER_ENTRY
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -592,19 +586,15 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 	ud2
 1:
 #endif
-	POP_REGS pop_rdi=0
+	POP_REGS
+	addq	$8, %rsp	/* skip regs->orig_ax */
 
 	/*
-	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We are on the trampoline stack.  All regs are live.
 	 * We can do future final exit work right here.
 	 */
 	STACKLEAK_ERASE_NOCLOBBER
 
-	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
-
-	/* Restore RDI. */
-	popq	%rdi
-	addq	$8, %rsp	/* skip regs->orig_ax */
 	SWAPGS
 	INTERRUPT_RETURN
 
@@ -899,8 +889,6 @@ SYM_CODE_START_LOCAL(error_entry)
 	 */
 	SWAPGS
 	FENCE_SWAPGS_USER_ENTRY
-	/* We have user CR3.  Change to kernel CR3. */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 .Lerror_entry_from_usermode_after_swapgs:
 	/*
@@ -959,11 +947,10 @@ SYM_CODE_START_LOCAL(error_entry)
 .Lerror_bad_iret:
 	/*
 	 * We came from an IRET to user mode, so we have user
-	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
+	 * gsbase and CR3.  Switch to kernel gsbase.
 	 */
 	SWAPGS
 	FENCE_SWAPGS_USER_ENTRY
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 	/*
 	 * Pretend that the exception came from user mode: set up pt_regs
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 541fdaf64045..a6fb5807bf42 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -51,10 +51,6 @@ SYM_CODE_START(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
 	SWAPGS
 
-	pushq	%rax
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
-	popq	%rax
-
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
@@ -204,9 +200,6 @@ SYM_CODE_START(entry_SYSCALL_compat)
 	/* Stash user ESP */
 	movl	%esp, %r8d
 
-	/* Use %rsp as scratch reg. User ESP is stashed in r8 */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
 	/* Switch to the kernel stack */
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
@@ -291,18 +284,6 @@ sysret32_from_system_call:
 	 * code.  We zero R8-R10 to avoid info leaks.
          */
 	movq	RSP-ORIG_RAX(%rsp), %rsp
-
-	/*
-	 * The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
-	 * on the process stack which is not mapped to userspace and
-	 * not readable after we SWITCH_TO_USER_CR3.  Delay the CR3
-	 * switch until after after the last reference to the process
-	 * stack.
-	 *
-	 * %r8/%r9 are zeroed before the sysret, thus safe to clobber.
-	 */
-	SWITCH_TO_USER_CR3_NOSTACK scratch_reg=%r8 scratch_reg2=%r9
-
 	xorl	%r8d, %r8d
 	xorl	%r9d, %r9d
 	xorl	%r10d, %r10d
@@ -357,9 +338,6 @@ SYM_CODE_START(entry_INT80_compat)
 	pushq	%rax			/* pt_regs->orig_ax */
 	pushq	%rdi			/* pt_regs->di */
 
-	/* Need to switch before accessing the thread stack. */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
-
 	/* In the Xen PV case we already run on the thread stack. */
 	ALTERNATIVE "", "jmp .Lint80_keep_stack", X86_FEATURE_XENPV
 
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index b75e9230c990..32e9f3159131 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -157,10 +157,24 @@ static __always_inline void switch_to_user_cr3(void)
 	native_write_cr3(cr3);
 }
 
+static __always_inline void kernel_pgtable_enter(struct pt_regs *regs)
+{
+	if (user_mode(regs))
+		switch_to_kernel_cr3();
+}
+
+static __always_inline void kernel_pgtable_exit(struct pt_regs *regs)
+{
+	if (user_mode(regs))
+		switch_to_user_cr3();
+}
+
 #else /* CONFIG_PAGE_TABLE_ISOLATION */
 
 static inline void switch_to_kernel_cr3(void) {}
 static inline void switch_to_user_cr3(void) {}
+static inline void kernel_pgtable_enter(struct pt_regs *regs) {};
+static inline void kernel_pgtable_exit(struct pt_regs *regs) {};
 
 #endif /* CONFIG_PAGE_TABLE_ISOLATION */
 
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 647af7ea3bf1..d8bfcd8a4db4 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -147,12 +147,15 @@ static __always_inline void __##func(struct pt_regs *regs);		\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	irqentry_state_t state;						\
 									\
+	kernel_pgtable_enter(regs);					\
+	state = irqentry_enter(regs);					\
 	instrumentation_begin();					\
 	run_idt(__##func, regs);					\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
+	kernel_pgtable_exit(regs);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -194,12 +197,15 @@ static __always_inline void __##func(struct pt_regs *regs,		\
 __visible noinstr void func(struct pt_regs *regs,			\
 			    unsigned long error_code)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	irqentry_state_t state;						\
 									\
+	kernel_pgtable_enter(regs);					\
+	state = irqentry_enter(regs);					\
 	instrumentation_begin();					\
 	run_idt_errcode(__##func, regs, error_code);			\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
+	kernel_pgtable_exit(regs);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs,		\
@@ -290,8 +296,10 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector);	\
 __visible noinstr void func(struct pt_regs *regs,			\
 			    unsigned long error_code)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	irqentry_state_t state;						\
 									\
+	kernel_pgtable_enter(regs);					\
+	state = irqentry_enter(regs);					\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
 	kvm_set_cpu_l1tf_flush_l1d();					\
@@ -300,6 +308,7 @@ __visible noinstr void func(struct pt_regs *regs,			\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
+	kernel_pgtable_exit(regs);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -333,8 +342,10 @@ static void __##func(struct pt_regs *regs);				\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	irqentry_state_t state;						\
 									\
+	kernel_pgtable_enter(regs);					\
+	state = irqentry_enter(regs);					\
 	instrumentation_begin();					\
 	irq_enter_rcu();						\
 	kvm_set_cpu_l1tf_flush_l1d();					\
@@ -342,6 +353,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	irq_exit_rcu();							\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
+	kernel_pgtable_exit(regs);					\
 }									\
 									\
 static noinline void __##func(struct pt_regs *regs)
@@ -362,8 +374,10 @@ static __always_inline void __##func(struct pt_regs *regs);		\
 									\
 __visible noinstr void func(struct pt_regs *regs)			\
 {									\
-	irqentry_state_t state = irqentry_enter(regs);			\
+	irqentry_state_t state;						\
 									\
+	kernel_pgtable_enter(regs);					\
+	state = irqentry_enter(regs);					\
 	instrumentation_begin();					\
 	__irq_enter_raw();						\
 	kvm_set_cpu_l1tf_flush_l1d();					\
@@ -371,6 +385,7 @@ __visible noinstr void func(struct pt_regs *regs)			\
 	__irq_exit_raw();						\
 	instrumentation_end();						\
 	irqentry_exit(regs, state);					\
+	kernel_pgtable_exit(regs);					\
 }									\
 									\
 static __always_inline void __##func(struct pt_regs *regs)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 827088f981c6..e1ae901c4925 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2037,9 +2037,11 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
 {
 	unsigned long dr7;
 
+	switch_to_kernel_cr3();
 	dr7 = local_db_save();
 	run_idt(exc_machine_check_user, regs);
 	local_db_restore(dr7);
+	switch_to_user_cr3();
 }
 #else
 /* 32bit unified entry point */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 23c92ffd58fe..063474f5b5fe 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -542,8 +542,10 @@ DEFINE_IDTENTRY_NMI(exc_nmi)
 
 __visible noinstr void exc_nmi_user(struct pt_regs *regs)
 {
+	switch_to_kernel_cr3();
 	handle_nmi(regs);
 	mds_user_clear_cpu_buffers();
+	switch_to_user_cr3();
 }
 
 void stop_nmi(void)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1801791748b8..6c78eeb60d19 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -255,11 +255,13 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
 	if (!user_mode(regs) && handle_bug(regs))
 		return;
 
+	kernel_pgtable_enter(regs);
 	state = irqentry_enter(regs);
 	instrumentation_begin();
 	run_idt(handle_invalid_op, regs);
 	instrumentation_end();
 	irqentry_exit(regs, state);
+	kernel_pgtable_exit(regs);
 }
 
 DEFINE_IDTENTRY(exc_coproc_segment_overrun)
@@ -663,11 +665,13 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 	 * including NMI.
 	 */
 	if (user_mode(regs)) {
+		switch_to_kernel_cr3();
 		irqentry_enter_from_user_mode(regs);
 		instrumentation_begin();
 		run_idt(do_int3_user, regs);
 		instrumentation_end();
 		irqentry_exit_to_user_mode(regs);
+		switch_to_user_cr3();
 	} else {
 		bool irq_state = idtentry_enter_nmi(regs);
 		instrumentation_begin();
@@ -1001,7 +1005,9 @@ DEFINE_IDTENTRY_DEBUG(exc_debug)
 /* User entry, runs on regular task stack */
 DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
 {
+	switch_to_kernel_cr3();
 	run_idt_errcode(exc_debug_user, regs, debug_read_clear_dr6());
+	switch_to_user_cr3();
 }
 #else
 /* 32 bit does not have separate entry points. */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b9d03603d95d..613a864840ab 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1440,9 +1440,11 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
 
 DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 {
-	unsigned long address = read_cr2();
+	unsigned long address;
 	irqentry_state_t state;
 
+	kernel_pgtable_enter(regs);
+	address = read_cr2();
 	prefetchw(&current->mm->mmap_lock);
 
 	/*
@@ -1466,8 +1468,10 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 	 * The async #PF handling code takes care of idtentry handling
 	 * itself.
 	 */
-	if (kvm_handle_async_pf(regs, (u32)address))
+	if (kvm_handle_async_pf(regs, (u32)address)) {
+		kernel_pgtable_exit(regs);
 		return;
+	}
 
 	/*
 	 * Entry handling for valid #PF from kernel mode is slightly
@@ -1486,4 +1490,5 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 	instrumentation_end();
 
 	irqentry_exit(regs, state);
+	kernel_pgtable_exit(regs);
 }
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code
  2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
                   ` (23 preceding siblings ...)
  2020-11-09 11:23 ` [RFC][PATCH 24/24] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries Alexandre Chartre
@ 2020-11-09 14:00 ` Alexandre Chartre
  24 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 14:00 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, dave.hansen, luto, peterz,
	linux-kernel, thomas.lendacky, jroedel


Sorry but it looks like email addresses are messed up in my emails. Our email
server has a new security "feature" which has the good idea to change external
email addresses.

I will resend the patches with the correct addresses once I've found
how to prevent this mess.

alex.

On 11/9/20 12:22 PM, Alexandre Chartre wrote:
> With Page Table Isolation (PTI), syscalls as well as interrupts and
> exceptions occurring in userspace enter the kernel with a user
> page-table. The kernel entry code will then switch the page-table
> from the user page-table to the kernel page-table by updating the
> CR3 control register. This CR3 switch is currently done early in
> the kernel entry sequence using assembly code.
> 
> This RFC proposes to defer the PTI CR3 switch until we reach C code.
> The benefit is that this simplifies the assembly entry code, and make
> the PTI CR3 switch code easier to understand. This also paves the way
> for further possible projects such an easier integration of Address
> Space Isolation (ASI), or the possibilily to execute some selected
> syscall or interrupt handlers without switching to the kernel page-table
> (and thus avoid the PTI page-table switch overhead).
> 
> Deferring CR3 switch to C code means that we need to run more of the
> kernel entry code with the user page-table. To do so, we need to:
> 
>   - map more syscall, interrupt and exception entry code into the user
>     page-table (map all noinstr code);
> 
>   - map additional data used in the entry code (such as stack canary);
> 
>   - run more entry code on the trampoline stack (which is mapped both
>     in the kernel and in the user page-table) until we switch to the
>     kernel page-table and then switch to the kernel stack;
> 
>   - have a per-task trampoline stack instead of a per-cpu trampoline
>     stack, so the task can be scheduled out while it hasn't switched
>     to the kernel stack.
> 
> Note that, for now, the CR3 switch can only be pushed as far as interrupts
> remain disabled in the entry code. This is because the CR3 switch is done
> based on the privilege level from the CS register from the interrupt frame.
> I plan to fix this but that's some extra complication (need to track if the
> user page-table is used or not).
> 
> The proposed patchset is in RFC state to get early feedback about this
> proposal.
> 
> The code survives running a kernel build and LTP. Note that changes are
> only for 64-bit at the moment, I haven't looked at 32-bit yet but I will
> definitively check it.
> 
> Code is based on v5.10-rc3.
> 
> Thanks,
> 
> alex.
> 
> -----
> 
> Alexandre Chartre (24):
>    x86/syscall: Add wrapper for invoking syscall function
>    x86/entry: Update asm_call_on_stack to support more function arguments
>    x86/entry: Consolidate IST entry from userspace
>    x86/sev-es: Define a setup stack function for the VC idtentry
>    x86/entry: Implement ret_from_fork body with C code
>    x86/pti: Provide C variants of PTI switch CR3 macros
>    x86/entry: Fill ESPFIX stack using C code
>    x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
>    x86/entry: Add C version of paranoid_entry/exit
>    x86/pti: Introduce per-task PTI trampoline stack
>    x86/pti: Function to clone page-table entries from a specified mm
>    x86/pti: Function to map per-cpu page-table entry
>    x86/pti: Extend PTI user mappings
>    x86/pti: Use PTI stack instead of trampoline stack
>    x86/pti: Execute syscall functions on the kernel stack
>    x86/pti: Execute IDT handlers on the kernel stack
>    x86/pti: Execute IDT handlers with error code on the kernel stack
>    x86/pti: Execute system vector handlers on the kernel stack
>    x86/pti: Execute page fault handler on the kernel stack
>    x86/pti: Execute NMI handler on the kernel stack
>    x86/entry: Disable stack-protector for IST entry C handlers
>    x86/entry: Defer paranoid entry/exit to C code
>    x86/entry: Remove paranoid_entry and paranoid_exit
>    x86/pti: Defer CR3 switch to C code for non-IST and syscall entries
> 
>   arch/x86/entry/common.c               | 259 ++++++++++++-
>   arch/x86/entry/entry_64.S             | 513 ++++++++------------------
>   arch/x86/entry/entry_64_compat.S      |  22 --
>   arch/x86/include/asm/entry-common.h   | 108 ++++++
>   arch/x86/include/asm/idtentry.h       | 153 +++++++-
>   arch/x86/include/asm/irq_stack.h      |  11 +
>   arch/x86/include/asm/page_64_types.h  |  36 +-
>   arch/x86/include/asm/paravirt.h       |  15 +
>   arch/x86/include/asm/paravirt_types.h |  17 +-
>   arch/x86/include/asm/processor.h      |   3 +
>   arch/x86/include/asm/pti.h            |  18 +
>   arch/x86/include/asm/switch_to.h      |   7 +-
>   arch/x86/include/asm/traps.h          |   2 +-
>   arch/x86/kernel/cpu/mce/core.c        |   7 +-
>   arch/x86/kernel/espfix_64.c           |  41 ++
>   arch/x86/kernel/nmi.c                 |  34 +-
>   arch/x86/kernel/sev-es.c              |  52 +++
>   arch/x86/kernel/traps.c               |  61 +--
>   arch/x86/mm/fault.c                   |  11 +-
>   arch/x86/mm/pti.c                     |  71 ++--
>   kernel/fork.c                         |  22 ++
>   21 files changed, 1002 insertions(+), 461 deletions(-)
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function
  2020-11-09 11:22 ` [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function Alexandre Chartre
@ 2020-11-09 17:25   ` Andy Lutomirski
  2020-11-09 17:45     ` Alexandre Chartre
  0 siblings, 1 reply; 33+ messages in thread
From: Andy Lutomirski @ 2020-11-09 17:25 UTC (permalink / raw)
  To: Alexandre Chartre; +Cc: X86 ML, LKML

Hi Alexander-

You appear to be infected by corporate malware that has inserted the
string "@aserv0122.oracle.com" to the end of all the email addresses
in your to: list.  "luto@kernel.org"@aserv0122.oracle.com, for
example, is not me.  Can you fix this?


On Mon, Nov 9, 2020 at 3:21 AM Alexandre Chartre
<alexandre.chartre@oracle.com> wrote:
>
> Add a wrapper function for invoking a syscall function.

This needs some explanation of why.

>
> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
> ---
>  arch/x86/entry/common.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 870efeec8bda..d222212908ad 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -35,6 +35,15 @@
>  #include <asm/syscall.h>
>  #include <asm/irq_stack.h>
>
> +static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
> +                                       struct pt_regs *regs)
> +{
> +       if (!sysfunc)
> +               return;

What's this for?

> +
> +       regs->ax = sysfunc(regs);
> +}
> +

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings
  2020-11-09 11:23 ` [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings Alexandre Chartre
@ 2020-11-09 17:28   ` Andy Lutomirski
  2020-11-09 17:52     ` Alexandre Chartre
  0 siblings, 1 reply; 33+ messages in thread
From: Andy Lutomirski @ 2020-11-09 17:28 UTC (permalink / raw)
  To: Alexandre Chartre; +Cc: X86 ML, LKML

On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
<alexandre.chartre@oracle.com> wrote:
>
> Extend PTI user mappings so that more kernel entry code can be executed
> with the user page-table. To do so, we need to map syscall and interrupt
> entry code,

Probably fine.

> per cpu offsets (__per_cpu_offset, which is used some in
> entry code),

This likely already leaks due to vulnerable CPUs leaking address space
layout info.

> the stack canary,

That's going to be a very tough sell.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
  2020-11-09 11:23 ` [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK Alexandre Chartre
@ 2020-11-09 17:38   ` Andy Lutomirski
  2020-11-09 18:04     ` Alexandre Chartre
  0 siblings, 1 reply; 33+ messages in thread
From: Andy Lutomirski @ 2020-11-09 17:38 UTC (permalink / raw)
  To: Alexandre Chartre; +Cc: X86 ML, LKML

On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
<alexandre.chartre@oracle.com> wrote:
>
> SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions
> of these macros (swapgs() and swapgs_unsafe_stack()).

This needs a very good justification.  It also needs some kind of
static verification that these helpers are only used by noinstr code,
and they need to be __always_inline.  And I cannot fathom how C code
could possibly use SWAPGS_UNSAFE_STACK in a meaningful way.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function
  2020-11-09 17:25   ` Andy Lutomirski
@ 2020-11-09 17:45     ` Alexandre Chartre
  0 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 17:45 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: X86 ML, LKML



On 11/9/20 6:25 PM, Andy Lutomirski wrote:
> Hi Alexander-
> 
> You appear to be infected by corporate malware that has inserted the
> string "@aserv0122.oracle.com" to the end of all the email addresses
> in your to: list.  "luto@kernel.org"@aserv0122.oracle.com, for
> example, is not me.  Can you fix this?
>

I known, I messed up :-(
I have already resent the entire RFC with correct addresses.
Sorry about that.

alex.

> 
> On Mon, Nov 9, 2020 at 3:21 AM Alexandre Chartre
> <alexandre.chartre@oracle.com> wrote:
>>
>> Add a wrapper function for invoking a syscall function.
> 
> This needs some explanation of why.
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings
  2020-11-09 17:28   ` Andy Lutomirski
@ 2020-11-09 17:52     ` Alexandre Chartre
  0 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 17:52 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: X86 ML, LKML


On 11/9/20 6:28 PM, Andy Lutomirski wrote:
> On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
> <alexandre.chartre@oracle.com> wrote:
>>
>> Extend PTI user mappings so that more kernel entry code can be executed
>> with the user page-table. To do so, we need to map syscall and interrupt
>> entry code,
> 
> Probably fine.
> 
>> per cpu offsets (__per_cpu_offset, which is used some in
>> entry code),
> 
> This likely already leaks due to vulnerable CPUs leaking address space
> layout info.

I forgot to update the comment, I am not mapping __per_cpu_offset anymore.

However, if we do map __per_cpu_offset then we don't need to enforce the
ordering in paranoid_entry to switch CR3 before GS.

> 
>> the stack canary,
> 
> That's going to be a very tough sell.
> 

I can get rid of this, but this will require to disable stack-protector for
any function that we can call while using the user page-table, like already
done in patch 21 (x86/entry: Disable stack-protector for IST entry C handlers).

alex.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK
  2020-11-09 17:38   ` Andy Lutomirski
@ 2020-11-09 18:04     ` Alexandre Chartre
  0 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 18:04 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: X86 ML, LKML


On 11/9/20 6:38 PM, Andy Lutomirski wrote:
> On Mon, Nov 9, 2020 at 3:22 AM Alexandre Chartre
> <alexandre.chartre@oracle.com> wrote:
>>
>> SWAPGS and SWAPGS_UNSAFE_STACK are assembly macros. Add C versions
>> of these macros (swapgs() and swapgs_unsafe_stack()).
> 
> This needs a very good justification.  It also needs some kind of
> static verification that these helpers are only used by noinstr code,
> and they need to be __always_inline.  And I cannot fathom how C code
> could possibly use SWAPGS_UNSAFE_STACK in a meaningful way.
> 

You're right, I probably need to revisit the usage of SWAPGS_UNSAFE_STACK
in C code, that doesn't make sense. Looks like only SWAPGS is then needed.

Or maybe we can just use native_swapgs() instead?

I have added a C version of SWAPGS for moving paranoid_entry() to C because,
in this function, we need to switch CR3 before doing the updating GS. But I
really wonder if we need a paravirt swapgs here, and we can probably just use
native_swapgs().

Also, if we map the per cpu offsets (__per_cpu_offset) in the user page-table
then we will be able to update GS before switching CR3. That way we can keep the
GS update in assembly code, and just do the CR3 switch in C code. This would also
avoid having to disable stack-protector (patch 21).

alex.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function
  2020-11-09 14:44 Alexandre Chartre
@ 2020-11-09 14:44 ` Alexandre Chartre
  0 siblings, 0 replies; 33+ messages in thread
From: Alexandre Chartre @ 2020-11-09 14:44 UTC (permalink / raw)
  To: tglx, mingo, bp, hpa, x86, dave.hansen, luto, peterz,
	linux-kernel, thomas.lendacky, jroedel
  Cc: konrad.wilk, jan.setjeeilers, junaids, oweisse, rppt, graf,
	mgross, kuzuno, alexandre.chartre

Add a wrapper function for invoking a syscall function.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
---
 arch/x86/entry/common.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..d222212908ad 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,15 @@
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
 
+static __always_inline void run_syscall(sys_call_ptr_t sysfunc,
+					struct pt_regs *regs)
+{
+	if (!sysfunc)
+		return;
+
+	regs->ax = sysfunc(regs);
+}
+
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
@@ -43,15 +52,16 @@ __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 	instrumentation_begin();
 	if (likely(nr < NR_syscalls)) {
 		nr = array_index_nospec(nr, NR_syscalls);
-		regs->ax = sys_call_table[nr](regs);
+		run_syscall(sys_call_table[nr], regs);
 #ifdef CONFIG_X86_X32_ABI
 	} else if (likely((nr & __X32_SYSCALL_BIT) &&
 			  (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
 		nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
 					X32_NR_syscalls);
-		regs->ax = x32_sys_call_table[nr](regs);
+		run_syscall(x32_sys_call_table[nr], regs);
 #endif
 	}
+
 	instrumentation_end();
 	syscall_exit_to_user_mode(regs);
 }
@@ -75,7 +85,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
 	if (likely(nr < IA32_NR_syscalls)) {
 		instrumentation_begin();
 		nr = array_index_nospec(nr, IA32_NR_syscalls);
-		regs->ax = ia32_sys_call_table[nr](regs);
+		run_syscall(ia32_sys_call_table[nr], regs);
 		instrumentation_end();
 	}
 }
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2020-11-09 18:02 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-09 11:22 [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
2020-11-09 11:22 ` [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function Alexandre Chartre
2020-11-09 17:25   ` Andy Lutomirski
2020-11-09 17:45     ` Alexandre Chartre
2020-11-09 11:22 ` [RFC][PATCH 02/24] x86/entry: Update asm_call_on_stack to support more function arguments Alexandre Chartre
2020-11-09 11:22 ` [RFC][PATCH 03/24] x86/entry: Consolidate IST entry from userspace Alexandre Chartre
2020-11-09 11:22 ` [RFC][PATCH 04/24] x86/sev-es: Define a setup stack function for the VC idtentry Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 05/24] x86/entry: Implement ret_from_fork body with C code Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 06/24] x86/pti: Provide C variants of PTI switch CR3 macros Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 07/24] x86/entry: Fill ESPFIX stack using C code Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 08/24] x86/entry: Add C version of SWAPGS and SWAPGS_UNSAFE_STACK Alexandre Chartre
2020-11-09 17:38   ` Andy Lutomirski
2020-11-09 18:04     ` Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 09/24] x86/entry: Add C version of paranoid_entry/exit Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 10/24] x86/pti: Introduce per-task PTI trampoline stack Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 11/24] x86/pti: Function to clone page-table entries from a specified mm Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 12/24] x86/pti: Function to map per-cpu page-table entry Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 13/24] x86/pti: Extend PTI user mappings Alexandre Chartre
2020-11-09 17:28   ` Andy Lutomirski
2020-11-09 17:52     ` Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 14/24] x86/pti: Use PTI stack instead of trampoline stack Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 15/24] x86/pti: Execute syscall functions on the kernel stack Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 16/24] x86/pti: Execute IDT handlers " Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 17/24] x86/pti: Execute IDT handlers with error code " Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 18/24] x86/pti: Execute system vector handlers " Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 19/24] x86/pti: Execute page fault handler " Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 20/24] x86/pti: Execute NMI " Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 21/24] x86/entry: Disable stack-protector for IST entry C handlers Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 22/24] x86/entry: Defer paranoid entry/exit to C code Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 23/24] x86/entry: Remove paranoid_entry and paranoid_exit Alexandre Chartre
2020-11-09 11:23 ` [RFC][PATCH 24/24] x86/pti: Defer CR3 switch to C code for non-IST and syscall entries Alexandre Chartre
2020-11-09 14:00 ` [RFC][PATCH 00/24] x86/pti: Defer CR3 switch to C code Alexandre Chartre
2020-11-09 14:44 Alexandre Chartre
2020-11-09 14:44 ` [RFC][PATCH 01/24] x86/syscall: Add wrapper for invoking syscall function Alexandre Chartre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).