linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/34 v3] PTI support for x32
@ 2018-03-05 10:25 Joerg Roedel
  2018-03-05 10:25 ` [PATCH 01/34] x86/asm-offsets: Move TSS_sp0 and TSS_sp1 to asm-offsets.c Joerg Roedel
                   ` (33 more replies)
  0 siblings, 34 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

Hi,

here is an updated version of my PTI patches for x86-32. I
worked in the review comments and fixed a few bugs that were
found during review of v2.

In particular, the changes to v2 are:

	* Switched from movsb to movsl for stack copy
	* Simplified sysexit path to not do a full pt_regs
	  copy, copies now only 2 dwords to entry stack
	* Renamed pti_set_user_pgd() to pti_set_user_pgtbl()
	* Added a warning in case someone boots a 32 bit
	  kernel on a PCIE capable machine
	* Added switch to user-cr3 to the paranoid exit path
	  in case we entered the kernel from kernel-mode
	  with user-cr3
	* Simplified the debug handler now that all its
	  special cases are handled in the stack/cr3-switch
	  code
	* Fixed PGD_PAE_PHYS_MASK for XEN-PV
	
Here is a link to my post of the v2 patches:

	https://marc.info/?l=linux-kernel&m=151816914932088&w=2

And I also pushed these out to:

	git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git pti-x32-v3

For easier testing. Please review.

Thanks,

	Joerg

Joerg Roedel (34):
  x86/asm-offsets: Move TSS_sp0 and TSS_sp1 to asm-offsets.c
  x86/entry/32: Rename TSS_sysenter_sp0 to TSS_entry_stack
  x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler
  x86/entry/32: Put ESPFIX code into a macro
  x86/entry/32: Unshare NMI return path
  x86/entry/32: Split off return-to-kernel path
  x86/entry/32: Restore segments before int registers
  x86/entry/32: Enter the kernel via trampoline stack
  x86/entry/32: Leave the kernel via trampoline stack
  x86/entry/32: Introduce SAVE_ALL_NMI and RESTORE_ALL_NMI
  x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack
  x86/entry/32: Simplify debug entry point
  x86/entry/32: Add PTI cr3 switches to NMI handler code
  x86/entry/32: Add PTI cr3 switch to non-NMI entry/exit points
  x86/pgtable: Rename pti_set_user_pgd to pti_set_user_pgtbl
  x86/pgtable/pae: Unshare kernel PMDs when PTI is enabled
  x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
  x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
  x86/pgtable: Move pti_set_user_pgtbl() to pgtable.h
  x86/pgtable: Move two more functions from pgtable_64.h to pgtable.h
  x86/mm/pae: Populate valid user PGD entries
  x86/mm/pae: Populate the user page-table with user pgd's
  x86/mm/legacy: Populate the user page-table with user pgd's
  x86/mm/pti: Add an overflow check to pti_clone_pmds()
  x86/mm/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
  x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
  x86/mm/dump_pagetables: Define INIT_PGD
  x86/pgtable/pae: Use separate kernel PMDs for user page-table
  x86/ldt: Reserve address-space range on 32 bit for the LDT
  x86/ldt: Define LDT_END_ADDR
  x86/ldt: Split out sanity check in map_ldt_struct()
  x86/ldt: Enable LDT user-mapping for PAE
  x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
  x86/mm/pti: Add Warning when booting on a PCIE capable CPU

 arch/x86/entry/entry_32.S                   | 668 ++++++++++++++++++++++------
 arch/x86/include/asm/mmu_context.h          |   4 -
 arch/x86/include/asm/pgtable-2level.h       |   9 +
 arch/x86/include/asm/pgtable-2level_types.h |   3 +
 arch/x86/include/asm/pgtable-3level.h       |   7 +
 arch/x86/include/asm/pgtable-3level_types.h |   6 +-
 arch/x86/include/asm/pgtable.h              |  88 ++++
 arch/x86/include/asm/pgtable_32_types.h     |   9 +-
 arch/x86/include/asm/pgtable_64.h           |  89 +---
 arch/x86/include/asm/pgtable_64_types.h     |   4 +
 arch/x86/include/asm/pgtable_types.h        |  28 +-
 arch/x86/include/asm/processor-flags.h      |   8 +-
 arch/x86/include/asm/switch_to.h            |   6 +-
 arch/x86/kernel/asm-offsets.c               |   5 +
 arch/x86/kernel/asm-offsets_32.c            |   2 +-
 arch/x86/kernel/asm-offsets_64.c            |   2 -
 arch/x86/kernel/cpu/common.c                |   5 +-
 arch/x86/kernel/head_32.S                   |  20 +-
 arch/x86/kernel/ldt.c                       | 137 ++++--
 arch/x86/kernel/process.c                   |   2 -
 arch/x86/kernel/process_32.c                |  10 +-
 arch/x86/mm/dump_pagetables.c               |  21 +-
 arch/x86/mm/pgtable.c                       | 105 ++++-
 arch/x86/mm/pti.c                           |  42 +-
 security/Kconfig                            |   2 +-
 25 files changed, 963 insertions(+), 319 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 01/34] x86/asm-offsets: Move TSS_sp0 and TSS_sp1 to asm-offsets.c
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 02/34] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_entry_stack Joerg Roedel
                   ` (32 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

These offsets will be used in 32 bit assembly code as well,
so make them available for all of x86 code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/asm-offsets.c    | 4 ++++
 arch/x86/kernel/asm-offsets_64.c | 2 --
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 76417a9..232152c 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -103,4 +103,8 @@ void common(void) {
 	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
 	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
+
+	/* Offset for sp0 and sp1 into the tss_struct */
+	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
+	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 }
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index bf51e51..d2eba73 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -65,8 +65,6 @@ int main(void)
 #undef ENTRY
 
 	OFFSET(TSS_ist, tss_struct, x86_tss.ist);
-	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
-	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	BLANK();
 
 #ifdef CONFIG_CC_STACKPROTECTOR
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 02/34] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_entry_stack
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
  2018-03-05 10:25 ` [PATCH 01/34] x86/asm-offsets: Move TSS_sp0 and TSS_sp1 to asm-offsets.c Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 03/34] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler Joerg Roedel
                   ` (31 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

The stack address doesn't need to be stored in tss.sp0 if
we switch manually like on sysenter. Rename the offset so
that it still makes sense when we change its location.

We will also use this stack for all kernel-entry points, not
just sysenter. Reflect that in the name as well.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S        | 2 +-
 arch/x86/kernel/asm-offsets_32.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 2a35b1e..e659776 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -413,7 +413,7 @@ ENTRY(xen_sysenter_target)
  * 0(%ebp) arg6
  */
 ENTRY(entry_SYSENTER_32)
-	movl	TSS_sysenter_sp0(%esp), %esp
+	movl	TSS_entry_stack(%esp), %esp
 .Lsysenter_past_esp:
 	pushl	$__USER_DS		/* pt_regs->ss */
 	pushl	%ebp			/* pt_regs->sp (stashed in bp) */
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index fa1261e..f452bfd 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,7 +47,7 @@ void foo(void)
 	BLANK();
 
 	/* Offset from the sysenter stack to tss.sp0 */
-	DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+	DEFINE(TSS_entry_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
 	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
 
 #ifdef CONFIG_CC_STACKPROTECTOR
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 03/34] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
  2018-03-05 10:25 ` [PATCH 01/34] x86/asm-offsets: Move TSS_sp0 and TSS_sp1 to asm-offsets.c Joerg Roedel
  2018-03-05 10:25 ` [PATCH 02/34] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_entry_stack Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 04/34] x86/entry/32: Put ESPFIX code into a macro Joerg Roedel
                   ` (30 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

We want x86_tss.sp0 point to the entry stack later to use
it as a trampoline stack for other kernel entry points
besides SYSENTER.

So store the task stack pointer in x86_tss.sp1, which is
otherwise unused by the hardware, as Linux doesn't make use
of Ring 1.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/asm-offsets_32.c | 2 +-
 arch/x86/kernel/process_32.c     | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index f452bfd..b97e48c 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,7 +47,7 @@ void foo(void)
 	BLANK();
 
 	/* Offset from the sysenter stack to tss.sp0 */
-	DEFINE(TSS_entry_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+	DEFINE(TSS_entry_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
 	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
 
 #ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 5224c60..097d36a 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -292,6 +292,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	this_cpu_write(cpu_current_top_of_stack,
 		       (unsigned long)task_stack_page(next_p) +
 		       THREAD_SIZE);
+	/* SYSENTER reads the task-stack from tss.sp1 */
+	this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
 
 	/*
 	 * Restore %gs if needed (which is common)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 04/34] x86/entry/32: Put ESPFIX code into a macro
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (2 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 03/34] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 05/34] x86/entry/32: Unshare NMI return path Joerg Roedel
                   ` (29 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

This makes it easier to split up the shared iret code path.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 97 ++++++++++++++++++++++++-----------------------
 1 file changed, 49 insertions(+), 48 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index e659776..0289bde 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -221,6 +221,54 @@
 	POP_GS_EX
 .endm
 
+.macro CHECK_AND_APPLY_ESPFIX
+#ifdef CONFIG_X86_ESPFIX32
+#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8)
+
+	ALTERNATIVE	"jmp .Lend_\@", "", X86_BUG_ESPFIX
+
+	movl	PT_EFLAGS(%esp), %eax		# mix EFLAGS, SS and CS
+	/*
+	 * Warning: PT_OLDSS(%esp) contains the wrong/random values if we
+	 * are returning to the kernel.
+	 * See comments in process.c:copy_thread() for details.
+	 */
+	movb	PT_OLDSS(%esp), %ah
+	movb	PT_CS(%esp), %al
+	andl	$(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax
+	cmpl	$((SEGMENT_LDT << 8) | USER_RPL), %eax
+	jne	.Lend_\@	# returning to user-space with LDT SS
+
+	/*
+	 * Setup and switch to ESPFIX stack
+	 *
+	 * We're returning to userspace with a 16 bit stack. The CPU will not
+	 * restore the high word of ESP for us on executing iret... This is an
+	 * "official" bug of all the x86-compatible CPUs, which we can work
+	 * around to make dosemu and wine happy. We do this by preloading the
+	 * high word of ESP with the high word of the userspace ESP while
+	 * compensating for the offset by changing to the ESPFIX segment with
+	 * a base address that matches for the difference.
+	 */
+	mov	%esp, %edx			/* load kernel esp */
+	mov	PT_OLDESP(%esp), %eax		/* load userspace esp */
+	mov	%dx, %ax			/* eax: new kernel esp */
+	sub	%eax, %edx			/* offset (low word is 0) */
+	shr	$16, %edx
+	mov	%dl, GDT_ESPFIX_SS + 4		/* bits 16..23 */
+	mov	%dh, GDT_ESPFIX_SS + 7		/* bits 24..31 */
+	pushl	$__ESPFIX_SS
+	pushl	%eax				/* new kernel esp */
+	/*
+	 * Disable interrupts, but do not irqtrace this section: we
+	 * will soon execute iret and the tracer was already set to
+	 * the irqstate after the IRET:
+	 */
+	DISABLE_INTERRUPTS(CLBR_ANY)
+	lss	(%esp), %esp			/* switch to espfix segment */
+.Lend_\@:
+#endif /* CONFIG_X86_ESPFIX32 */
+.endm
 /*
  * %eax: prev task
  * %edx: next task
@@ -548,21 +596,7 @@ ENTRY(entry_INT80_32)
 restore_all:
 	TRACE_IRQS_IRET
 .Lrestore_all_notrace:
-#ifdef CONFIG_X86_ESPFIX32
-	ALTERNATIVE	"jmp .Lrestore_nocheck", "", X86_BUG_ESPFIX
-
-	movl	PT_EFLAGS(%esp), %eax		# mix EFLAGS, SS and CS
-	/*
-	 * Warning: PT_OLDSS(%esp) contains the wrong/random values if we
-	 * are returning to the kernel.
-	 * See comments in process.c:copy_thread() for details.
-	 */
-	movb	PT_OLDSS(%esp), %ah
-	movb	PT_CS(%esp), %al
-	andl	$(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax
-	cmpl	$((SEGMENT_LDT << 8) | USER_RPL), %eax
-	je .Lldt_ss				# returning to user-space with LDT SS
-#endif
+	CHECK_AND_APPLY_ESPFIX
 .Lrestore_nocheck:
 	RESTORE_REGS 4				# skip orig_eax/error_code
 .Lirq_return:
@@ -575,39 +609,6 @@ ENTRY(iret_exc	)
 	jmp	common_exception
 .previous
 	_ASM_EXTABLE(.Lirq_return, iret_exc)
-
-#ifdef CONFIG_X86_ESPFIX32
-.Lldt_ss:
-/*
- * Setup and switch to ESPFIX stack
- *
- * We're returning to userspace with a 16 bit stack. The CPU will not
- * restore the high word of ESP for us on executing iret... This is an
- * "official" bug of all the x86-compatible CPUs, which we can work
- * around to make dosemu and wine happy. We do this by preloading the
- * high word of ESP with the high word of the userspace ESP while
- * compensating for the offset by changing to the ESPFIX segment with
- * a base address that matches for the difference.
- */
-#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8)
-	mov	%esp, %edx			/* load kernel esp */
-	mov	PT_OLDESP(%esp), %eax		/* load userspace esp */
-	mov	%dx, %ax			/* eax: new kernel esp */
-	sub	%eax, %edx			/* offset (low word is 0) */
-	shr	$16, %edx
-	mov	%dl, GDT_ESPFIX_SS + 4		/* bits 16..23 */
-	mov	%dh, GDT_ESPFIX_SS + 7		/* bits 24..31 */
-	pushl	$__ESPFIX_SS
-	pushl	%eax				/* new kernel esp */
-	/*
-	 * Disable interrupts, but do not irqtrace this section: we
-	 * will soon execute iret and the tracer was already set to
-	 * the irqstate after the IRET:
-	 */
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	lss	(%esp), %esp			/* switch to espfix segment */
-	jmp	.Lrestore_nocheck
-#endif
 ENDPROC(entry_INT80_32)
 
 .macro FIXUP_ESPFIX_STACK
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 05/34] x86/entry/32: Unshare NMI return path
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (3 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 04/34] x86/entry/32: Put ESPFIX code into a macro Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 06/34] x86/entry/32: Split off return-to-kernel path Joerg Roedel
                   ` (28 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

NMI will no longer use most of the shared return path,
because NMI needs special handling when the CR3 switches for
PTI are added. This patch prepares for that.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 0289bde..00ae759 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1007,7 +1007,7 @@ ENTRY(nmi)
 
 	/* Not on SYSENTER stack. */
 	call	do_nmi
-	jmp	.Lrestore_all_notrace
+	jmp	.Lnmi_return
 
 .Lnmi_from_sysenter_stack:
 	/*
@@ -1018,7 +1018,11 @@ ENTRY(nmi)
 	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esp
 	call	do_nmi
 	movl	%ebx, %esp
-	jmp	.Lrestore_all_notrace
+
+.Lnmi_return:
+	CHECK_AND_APPLY_ESPFIX
+	RESTORE_REGS 4
+	jmp	.Lirq_return
 
 #ifdef CONFIG_X86_ESPFIX32
 .Lnmi_espfix_stack:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 06/34] x86/entry/32: Split off return-to-kernel path
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (4 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 05/34] x86/entry/32: Unshare NMI return path Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 07/34] x86/entry/32: Restore segments before int registers Joerg Roedel
                   ` (27 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Use a separate return path when we know we are returning to
the kernel. This allows us to put the PTI cr3-switch and the
switch to the entry-stack into the return-to-user path
without further checking.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 00ae759..9bd7718 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -65,7 +65,7 @@
 # define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
 # define preempt_stop(clobbers)
-# define resume_kernel		restore_all
+# define resume_kernel		restore_all_kernel
 #endif
 
 .macro TRACE_IRQS_IRET
@@ -400,9 +400,9 @@ ENTRY(resume_kernel)
 	DISABLE_INTERRUPTS(CLBR_ANY)
 .Lneed_resched:
 	cmpl	$0, PER_CPU_VAR(__preempt_count)
-	jnz	restore_all
+	jnz	restore_all_kernel
 	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
-	jz	restore_all
+	jz	restore_all_kernel
 	call	preempt_schedule_irq
 	jmp	.Lneed_resched
 END(resume_kernel)
@@ -602,6 +602,11 @@ restore_all:
 .Lirq_return:
 	INTERRUPT_RETURN
 
+restore_all_kernel:
+	TRACE_IRQS_IRET
+	RESTORE_REGS 4
+	jmp	.Lirq_return
+
 .section .fixup, "ax"
 ENTRY(iret_exc	)
 	pushl	$0				# no error code
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (5 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 06/34] x86/entry/32: Split off return-to-kernel path Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 12:17   ` Linus Torvalds
  2018-03-05 10:25 ` [PATCH 08/34] x86/entry/32: Enter the kernel via trampoline stack Joerg Roedel
                   ` (26 subsequent siblings)
  33 siblings, 1 reply; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Restoring the segments can cause exceptions that need to be
handled. With PTI enabled, we still need to be on kernel cr3
when the exception happens. For the cr3-switch we need
at least one integer scratch register, so we can't switch
with the user integer registers already loaded.

Avoid a push/pop cycle to free a register for the cr3 switch
by restoring the segments first. That way the integer
registers are not live yet and we can use them for the cr3
switch.

This also helps in the NMI path, where we need to leave with
the same cr3 as we entered. There we still have the
callee-saved registers live when switching cr3s.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 50 ++++++++++++++++++++---------------------------
 1 file changed, 21 insertions(+), 29 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 9bd7718..b39c5e2 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -92,11 +92,6 @@
 .macro PUSH_GS
 	pushl	$0
 .endm
-.macro POP_GS pop=0
-	addl	$(4 + \pop), %esp
-.endm
-.macro POP_GS_EX
-.endm
 
  /* all the rest are no-op */
 .macro PTGS_TO_GS
@@ -116,20 +111,6 @@
 	pushl	%gs
 .endm
 
-.macro POP_GS pop=0
-98:	popl	%gs
-  .if \pop <> 0
-	add	$\pop, %esp
-  .endif
-.endm
-.macro POP_GS_EX
-.pushsection .fixup, "ax"
-99:	movl	$0, (%esp)
-	jmp	98b
-.popsection
-	_ASM_EXTABLE(98b, 99b)
-.endm
-
 .macro PTGS_TO_GS
 98:	mov	PT_GS(%esp), %gs
 .endm
@@ -201,24 +182,35 @@
 	popl	%eax
 .endm
 
-.macro RESTORE_REGS pop=0
-	RESTORE_INT_REGS
-1:	popl	%ds
-2:	popl	%es
-3:	popl	%fs
-	POP_GS \pop
+.macro RESTORE_SEGMENTS
+1:	mov	PT_DS(%esp), %ds
+2:	mov	PT_ES(%esp), %es
+3:	mov	PT_FS(%esp), %fs
+	PTGS_TO_GS
 .pushsection .fixup, "ax"
-4:	movl	$0, (%esp)
+4:	movl	$0, PT_DS(%esp)
 	jmp	1b
-5:	movl	$0, (%esp)
+5:	movl	$0, PT_ES(%esp)
 	jmp	2b
-6:	movl	$0, (%esp)
+6:	movl	$0, PT_FS(%esp)
 	jmp	3b
 .popsection
 	_ASM_EXTABLE(1b, 4b)
 	_ASM_EXTABLE(2b, 5b)
 	_ASM_EXTABLE(3b, 6b)
-	POP_GS_EX
+	PTGS_TO_GS_EX
+.endm
+
+.macro RESTORE_SKIP_SEGMENTS pop=0
+	/* Jump over the segments stored on stack */
+	addl	$((4 * 4) + \pop), %esp
+.endm
+
+.macro RESTORE_REGS pop=0
+	RESTORE_SEGMENTS
+	RESTORE_INT_REGS
+	/* Skip over already restored segment registers */
+	RESTORE_SKIP_SEGMENTS \pop
 .endm
 
 .macro CHECK_AND_APPLY_ESPFIX
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 08/34] x86/entry/32: Enter the kernel via trampoline stack
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (6 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 07/34] x86/entry/32: Restore segments before int registers Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 09/34] x86/entry/32: Leave " Joerg Roedel
                   ` (25 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Use the entry-stack as a trampoline to enter the kernel. The
entry-stack is already in the cpu_entry_area and will be
mapped to userspace when PTI is enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S        | 136 +++++++++++++++++++++++++++++++--------
 arch/x86/include/asm/switch_to.h |   6 +-
 arch/x86/kernel/asm-offsets.c    |   1 +
 arch/x86/kernel/cpu/common.c     |   5 +-
 arch/x86/kernel/process.c        |   2 -
 arch/x86/kernel/process_32.c     |  10 +--
 6 files changed, 121 insertions(+), 39 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index b39c5e2..1737da2 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -135,25 +135,36 @@
 
 #endif /* CONFIG_X86_32_LAZY_GS */
 
-.macro SAVE_ALL pt_regs_ax=%eax
+.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
 	cld
+	/* Push segment registers and %eax */
 	PUSH_GS
 	pushl	%fs
 	pushl	%es
 	pushl	%ds
 	pushl	\pt_regs_ax
+
+	/* Load kernel segments */
+	movl	$(__USER_DS), %eax
+	movl	%eax, %ds
+	movl	%eax, %es
+	movl	$(__KERNEL_PERCPU), %eax
+	movl	%eax, %fs
+	SET_KERNEL_GS %eax
+
+	/* Push integer registers and complete PT_REGS */
 	pushl	%ebp
 	pushl	%edi
 	pushl	%esi
 	pushl	%edx
 	pushl	%ecx
 	pushl	%ebx
-	movl	$(__USER_DS), %edx
-	movl	%edx, %ds
-	movl	%edx, %es
-	movl	$(__KERNEL_PERCPU), %edx
-	movl	%edx, %fs
-	SET_KERNEL_GS %edx
+
+	/* Switch to kernel stack if necessary */
+.if \switch_stacks > 0
+	SWITCH_TO_KERNEL_STACK
+.endif
+
 .endm
 
 /*
@@ -261,6 +272,72 @@
 .Lend_\@:
 #endif /* CONFIG_X86_ESPFIX32 */
 .endm
+
+
+/*
+ * Called with pt_regs fully populated and kernel segments loaded,
+ * so we can access PER_CPU and use the integer registers.
+ *
+ * We need to be very careful here with the %esp switch, because an NMI
+ * can happen everywhere. If the NMI handler finds itself on the
+ * entry-stack, it will overwrite the task-stack and everything we
+ * copied there. So allocate the stack-frame on the task-stack and
+ * switch to it before we do any copying.
+ */
+.macro SWITCH_TO_KERNEL_STACK
+
+	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
+
+	/* Are we on the entry stack? Bail out if not! */
+	movl	PER_CPU_VAR(cpu_entry_area), %edi
+	addl	$CPU_ENTRY_AREA_entry_stack, %edi
+	cmpl	%esp, %edi
+	jae	.Lend_\@
+
+	/* Load stack pointer into %esi and %edi */
+	movl	%esp, %esi
+	movl	%esi, %edi
+
+	/* Move %edi to the top of the entry stack */
+	andl	$(MASK_entry_stack), %edi
+	addl	$(SIZEOF_entry_stack), %edi
+
+	/* Load top of task-stack into %edi */
+	movl	TSS_entry_stack(%edi), %edi
+
+	/* Bytes to copy */
+	movl	$PTREGS_SIZE, %ecx
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, PT_EFLAGS(%esi)
+	jz	.Lcopy_pt_regs_\@
+
+	/*
+	 * Stack-frame contains 4 additional segment registers when
+	 * coming from VM86 mode
+	 */
+	addl	$(4 * 4), %ecx
+
+.Lcopy_pt_regs_\@:
+#endif
+
+	/* Allocate frame on task-stack */
+	subl	%ecx, %edi
+
+	/* Switch to task-stack */
+	movl	%edi, %esp
+
+	/*
+	 * We are now on the task-stack and can safely copy over the
+	 * stack-frame
+	 */
+	shrl	$2, %ecx
+	cld
+	rep movsl
+
+.Lend_\@:
+.endm
+
 /*
  * %eax: prev task
  * %edx: next task
@@ -454,6 +531,7 @@ ENTRY(xen_sysenter_target)
  */
 ENTRY(entry_SYSENTER_32)
 	movl	TSS_entry_stack(%esp), %esp
+
 .Lsysenter_past_esp:
 	pushl	$__USER_DS		/* pt_regs->ss */
 	pushl	%ebp			/* pt_regs->sp (stashed in bp) */
@@ -462,7 +540,7 @@ ENTRY(entry_SYSENTER_32)
 	pushl	$__USER_CS		/* pt_regs->cs */
 	pushl	$0			/* pt_regs->ip = 0 (placeholder) */
 	pushl	%eax			/* pt_regs->orig_ax */
-	SAVE_ALL pt_regs_ax=$-ENOSYS	/* save rest */
+	SAVE_ALL pt_regs_ax=$-ENOSYS	/* save rest, stack already switched */
 
 	/*
 	 * SYSENTER doesn't filter flags, so we need to clear NT, AC
@@ -573,7 +651,8 @@ ENDPROC(entry_SYSENTER_32)
 ENTRY(entry_INT80_32)
 	ASM_CLAC
 	pushl	%eax			/* pt_regs->orig_ax */
-	SAVE_ALL pt_regs_ax=$-ENOSYS	/* save rest */
+
+	SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1	/* save rest */
 
 	/*
 	 * User mode is traced as though IRQs are on, and the interrupt gate
@@ -665,7 +744,8 @@ END(irq_entries_start)
 common_interrupt:
 	ASM_CLAC
 	addl	$-0x80, (%esp)			/* Adjust vector into the [-256, -1] range */
-	SAVE_ALL
+
+	SAVE_ALL switch_stacks=1
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
 	movl	%esp, %eax
@@ -673,16 +753,16 @@ common_interrupt:
 	jmp	ret_from_intr
 ENDPROC(common_interrupt)
 
-#define BUILD_INTERRUPT3(name, nr, fn)	\
-ENTRY(name)				\
-	ASM_CLAC;			\
-	pushl	$~(nr);			\
-	SAVE_ALL;			\
-	ENCODE_FRAME_POINTER;		\
-	TRACE_IRQS_OFF			\
-	movl	%esp, %eax;		\
-	call	fn;			\
-	jmp	ret_from_intr;		\
+#define BUILD_INTERRUPT3(name, nr, fn)			\
+ENTRY(name)						\
+	ASM_CLAC;					\
+	pushl	$~(nr);					\
+	SAVE_ALL switch_stacks=1;			\
+	ENCODE_FRAME_POINTER;				\
+	TRACE_IRQS_OFF					\
+	movl	%esp, %eax;				\
+	call	fn;					\
+	jmp	ret_from_intr;				\
 ENDPROC(name)
 
 #define BUILD_INTERRUPT(name, nr)		\
@@ -908,16 +988,20 @@ common_exception:
 	pushl	%es
 	pushl	%ds
 	pushl	%eax
+	movl	$(__USER_DS), %eax
+	movl	%eax, %ds
+	movl	%eax, %es
+	movl	$(__KERNEL_PERCPU), %eax
+	movl	%eax, %fs
 	pushl	%ebp
 	pushl	%edi
 	pushl	%esi
 	pushl	%edx
 	pushl	%ecx
 	pushl	%ebx
+	SWITCH_TO_KERNEL_STACK
 	ENCODE_FRAME_POINTER
 	cld
-	movl	$(__KERNEL_PERCPU), %ecx
-	movl	%ecx, %fs
 	UNWIND_ESPFIX_STACK
 	GS_TO_REG %ecx
 	movl	PT_GS(%esp), %edi		# get the function address
@@ -925,9 +1009,6 @@ common_exception:
 	movl	$-1, PT_ORIG_EAX(%esp)		# no syscall to restart
 	REG_TO_PTGS %ecx
 	SET_KERNEL_GS %ecx
-	movl	$(__USER_DS), %ecx
-	movl	%ecx, %ds
-	movl	%ecx, %es
 	TRACE_IRQS_OFF
 	movl	%esp, %eax			# pt_regs pointer
 	CALL_NOSPEC %edi
@@ -946,6 +1027,7 @@ ENTRY(debug)
 	 */
 	ASM_CLAC
 	pushl	$-1				# mark this as an int
+
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
 	xorl	%edx, %edx			# error code 0
@@ -981,6 +1063,7 @@ END(debug)
  */
 ENTRY(nmi)
 	ASM_CLAC
+
 #ifdef CONFIG_X86_ESPFIX32
 	pushl	%eax
 	movl	%ss, %eax
@@ -1048,7 +1131,8 @@ END(nmi)
 ENTRY(int3)
 	ASM_CLAC
 	pushl	$-1				# mark this as an int
-	SAVE_ALL
+
+	SAVE_ALL switch_stacks=1
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
 	xorl	%edx, %edx			# zero error code
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index eb5f799..20e5f7ab 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
 /* This is used when switching tasks or entering/exiting vm86 mode. */
 static inline void update_sp0(struct task_struct *task)
 {
-	/* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
-#ifdef CONFIG_X86_32
-	load_sp0(task->thread.sp0);
-#else
+	/* sp0 always points to the entry trampoline stack, which is constant: */
 	if (static_cpu_has(X86_FEATURE_XENPV))
 		load_sp0(task_top_of_stack(task));
-#endif
 }
 
 #endif /* _ASM_X86_SWITCH_TO_H */
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 232152c..86f06e8 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -103,6 +103,7 @@ void common(void) {
 	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
 	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
+	DEFINE(MASK_entry_stack, (~(sizeof(struct entry_stack) - 1)));
 
 	/* Offset for sp0 and sp1 into the tss_struct */
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d63f4b57..a0ed348 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1709,11 +1709,12 @@ void cpu_init(void)
 	enter_lazy_tlb(&init_mm, curr);
 
 	/*
-	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
-	 * task never enters user mode.
+	 * Initialize the TSS.  sp0 points to the entry trampoline stack
+	 * regardless of what task is running.
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
+	load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index cb368c2..7fab4fa 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -57,14 +57,12 @@ __visible DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw) = {
 		 */
 		.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
 
-#ifdef CONFIG_X86_64
 		/*
 		 * .sp1 is cpu_current_top_of_stack.  The init task never
 		 * runs user code, but cpu_current_top_of_stack should still
 		 * be well defined before the first context switch.
 		 */
 		.sp1 = TOP_OF_INIT_STACK,
-#endif
 
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 097d36a..3f3a8c6 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -289,10 +289,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 */
 	update_sp0(next_p);
 	refresh_sysenter_cs(next);
-	this_cpu_write(cpu_current_top_of_stack,
-		       (unsigned long)task_stack_page(next_p) +
-		       THREAD_SIZE);
-	/* SYSENTER reads the task-stack from tss.sp1 */
+	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
+	/*
+	 * TODO: Find a way to let cpu_current_top_of_stack point to
+	 * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
+	 * iret exceptions.
+	 */
 	this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 09/34] x86/entry/32: Leave the kernel via trampoline stack
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (7 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 08/34] x86/entry/32: Enter the kernel via trampoline stack Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 10/34] x86/entry/32: Introduce SAVE_ALL_NMI and RESTORE_ALL_NMI Joerg Roedel
                   ` (24 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Switch back to the trampoline stack before returning to
userspace.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 79 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 77 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 1737da2..1b5656d 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -339,6 +339,60 @@
 .endm
 
 /*
+ * Switch back from the kernel stack to the entry stack.
+ *
+ * The %esp register must point to pt_regs on the task stack. It will
+ * first calculate the size of the stack-frame to copy, depending on
+ * whether we return to VM86 mode or not. With that it uses 'rep movsl'
+ * to copy the contents of the stack over to the entry stack.
+ *
+ * We must be very careful here, as we can't trust the contents of the
+ * task-stack once we switched to the entry-stack. When an NMI happens
+ * while on the entry-stack, the NMI handler will switch back to the top
+ * of the task stack, overwriting our stack-frame we are about to copy.
+ * Therefore we switch the stack only after everything is copied over.
+ */
+.macro SWITCH_TO_ENTRY_STACK
+
+	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
+
+	/* Bytes to copy */
+	movl	$PTREGS_SIZE, %ecx
+
+#ifdef CONFIG_VM86
+	testl	$(X86_EFLAGS_VM), PT_EFLAGS(%esp)
+	jz	.Lcopy_pt_regs_\@
+
+	/* Additional 4 registers to copy when returning to VM86 mode */
+	addl    $(4 * 4), %ecx
+
+.Lcopy_pt_regs_\@:
+#endif
+
+	/* Initialize source and destination for movsb */
+	movl	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi
+	subl	%ecx, %edi
+	movl	%esp, %esi
+
+	/* Save future stack pointer in %ebx */
+	movl	%edi, %ebx
+
+	/* Copy over the stack-frame */
+	shrl	$2, %ecx
+	cld
+	rep movsl
+
+	/*
+	 * Switch to entry-stack - needs to happen after everything is
+	 * copied because the NMI handler will overwrite the task-stack
+	 * when on entry-stack
+	 */
+	movl	%ebx, %esp
+
+.Lend_\@:
+.endm
+
+/*
  * %eax: prev task
  * %edx: next task
  */
@@ -579,25 +633,45 @@ ENTRY(entry_SYSENTER_32)
 
 /* Opportunistic SYSEXIT */
 	TRACE_IRQS_ON			/* User mode traces as IRQs on. */
+
+	/*
+	 * Setup entry stack - we keep the pointer in %eax and do the
+	 * switch after almost all user-state is restored.
+	 */
+
+	/* Load entry stack pointer and allocate frame for eflags/eax */ 
+	movl	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %eax
+	subl	$(2*4), %eax
+
+	/* Copy eflags and eax to entry stack */
+	movl	PT_EFLAGS(%esp), %edi
+	movl	PT_EAX(%esp), %esi
+	movl	%edi, (%eax)
+	movl	%esi, 4(%eax)
+
+	/* Restore user registers and segments */
 	movl	PT_EIP(%esp), %edx	/* pt_regs->ip */
 	movl	PT_OLDESP(%esp), %ecx	/* pt_regs->sp */
 1:	mov	PT_FS(%esp), %fs
 	PTGS_TO_GS
+
 	popl	%ebx			/* pt_regs->bx */
 	addl	$2*4, %esp		/* skip pt_regs->cx and pt_regs->dx */
 	popl	%esi			/* pt_regs->si */
 	popl	%edi			/* pt_regs->di */
 	popl	%ebp			/* pt_regs->bp */
-	popl	%eax			/* pt_regs->ax */
+
+	/* Switch to entry stack */
+	movl	%eax, %esp
 
 	/*
 	 * Restore all flags except IF. (We restore IF separately because
 	 * STI gives a one-instruction window in which we won't be interrupted,
 	 * whereas POPF does not.)
 	 */
-	addl	$PT_EFLAGS-PT_DS, %esp	/* point esp at pt_regs->flags */
 	btr	$X86_EFLAGS_IF_BIT, (%esp)
 	popfl
+	popl	%eax
 
 	/*
 	 * Return back to the vDSO, which will pop ecx and edx.
@@ -666,6 +740,7 @@ ENTRY(entry_INT80_32)
 
 restore_all:
 	TRACE_IRQS_IRET
+	SWITCH_TO_ENTRY_STACK
 .Lrestore_all_notrace:
 	CHECK_AND_APPLY_ESPFIX
 .Lrestore_nocheck:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 10/34] x86/entry/32: Introduce SAVE_ALL_NMI and RESTORE_ALL_NMI
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (8 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 09/34] x86/entry/32: Leave " Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack Joerg Roedel
                   ` (23 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

These macros will be used in the NMI handler code and
replace plain SAVE_ALL and RESTORE_REGS there. We will add
the NMI-specific CR3-switch to these macros later.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 1b5656d..bb0bd896 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -167,6 +167,9 @@
 
 .endm
 
+.macro SAVE_ALL_NMI
+	SAVE_ALL
+.endm
 /*
  * This is a sneaky trick to help the unwinder find pt_regs on the stack.  The
  * frame pointer is replaced with an encoded pointer to pt_regs.  The encoding
@@ -224,6 +227,18 @@
 	RESTORE_SKIP_SEGMENTS \pop
 .endm
 
+.macro RESTORE_ALL_NMI pop=0
+	/*
+	 * Restore segments - might cause exceptions when loading
+	 * user-space segments
+	 */
+	RESTORE_SEGMENTS
+
+	/* Restore integer registers and unwind stack to iret frame */
+	RESTORE_INT_REGS
+	RESTORE_SKIP_SEGMENTS \pop
+.endm
+
 .macro CHECK_AND_APPLY_ESPFIX
 #ifdef CONFIG_X86_ESPFIX32
 #define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8)
@@ -1148,7 +1163,7 @@ ENTRY(nmi)
 #endif
 
 	pushl	%eax				# pt_regs->orig_ax
-	SAVE_ALL
+	SAVE_ALL_NMI
 	ENCODE_FRAME_POINTER
 	xorl	%edx, %edx			# zero error code
 	movl	%esp, %eax			# pt_regs pointer
@@ -1176,7 +1191,7 @@ ENTRY(nmi)
 
 .Lnmi_return:
 	CHECK_AND_APPLY_ESPFIX
-	RESTORE_REGS 4
+	RESTORE_ALL_NMI pop=4
 	jmp	.Lirq_return
 
 #ifdef CONFIG_X86_ESPFIX32
@@ -1192,12 +1207,12 @@ ENTRY(nmi)
 	pushl	16(%esp)
 	.endr
 	pushl	%eax
-	SAVE_ALL
+	SAVE_ALL_NMI
 	ENCODE_FRAME_POINTER
 	FIXUP_ESPFIX_STACK			# %eax == %esp
 	xorl	%edx, %edx			# zero error code
 	call	do_nmi
-	RESTORE_REGS
+	RESTORE_ALL_NMI
 	lss	12+4(%esp), %esp		# back to espfix stack
 	jmp	.Lirq_return
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (9 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 10/34] x86/entry/32: Introduce SAVE_ALL_NMI and RESTORE_ALL_NMI Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 16:41   ` Brian Gerst
  2018-03-05 10:25 ` [PATCH 12/34] x86/entry/32: Simplify debug entry point Joerg Roedel
                   ` (22 subsequent siblings)
  33 siblings, 1 reply; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

It can happen that we enter the kernel from kernel-mode and
on the entry-stack. The most common way this happens is when
we get an exception while loading the user-space segment
registers on the kernel-to-userspace exit path.

The segment loading needs to be done after the entry-stack
switch, because the stack-switch needs kernel %fs for
per_cpu access.

When this happens, we need to make sure that we leave the
kernel with the entry-stack again, so that the interrupted
code-path runs on the right stack when switching to the
user-cr3.

We do this by detecting this condition on kernel-entry by
checking CS.RPL and %esp, and if it happens, we copy over
the complete content of the entry stack to the task-stack.
This needs to be done because once we enter the exception
handlers we might be scheduled out or even migrated to a
different CPU, so that we can't rely on the entry-stack
contents. We also leave a marker in the stack-frame to
detect this condition on the exit path.

On the exit path the copy is reversed, we copy all of the
remaining task-stack back to the entry-stack and switch
to it.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 110 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 109 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index bb0bd896..3a84945 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -299,6 +299,9 @@
  * copied there. So allocate the stack-frame on the task-stack and
  * switch to it before we do any copying.
  */
+
+#define CS_FROM_ENTRY_STACK	(1 << 31)
+
 .macro SWITCH_TO_KERNEL_STACK
 
 	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
@@ -320,6 +323,10 @@
 	/* Load top of task-stack into %edi */
 	movl	TSS_entry_stack(%edi), %edi
 
+	/* Special case - entry from kernel mode via entry stack */
+	testl	$SEGMENT_RPL_MASK, PT_CS(%esp)
+	jz	.Lentry_from_kernel_\@
+
 	/* Bytes to copy */
 	movl	$PTREGS_SIZE, %ecx
 
@@ -333,8 +340,8 @@
 	 */
 	addl	$(4 * 4), %ecx
 
-.Lcopy_pt_regs_\@:
 #endif
+.Lcopy_pt_regs_\@:
 
 	/* Allocate frame on task-stack */
 	subl	%ecx, %edi
@@ -350,6 +357,56 @@
 	cld
 	rep movsl
 
+	jmp .Lend_\@
+
+.Lentry_from_kernel_\@:
+
+	/*
+	 * This handles the case when we enter the kernel from
+	 * kernel-mode and %esp points to the entry-stack. When this
+	 * happens we need to switch to the task-stack to run C code,
+	 * but switch back to the entry-stack again when we approach
+	 * iret and return to the interrupted code-path. This usually
+	 * happens when we hit an exception while restoring user-space
+	 * segment registers on the way back to user-space.
+	 *
+	 * When we switch to the task-stack here, we can't trust the
+	 * contents of the entry-stack anymore, as the exception handler
+	 * might be scheduled out or moved to another CPU. Therefore we
+	 * copy the complete entry-stack to the task-stack and set a
+	 * marker in the iret-frame (bit 31 of the CS dword) to detect
+	 * what we've done on the iret path.
+	 *
+	 * On the iret path we copy everything back and switch to the
+	 * entry-stack, so that the interrupted kernel code-path
+	 * continues on the same stack it was interrupted with.
+	 *
+	 * Be aware that an NMI can happen anytime in this code.
+	 *
+	 * %esi: Entry-Stack pointer (same as %esp)
+	 * %edi: Top of the task stack
+	 */
+
+	/* Calculate number of bytes on the entry stack in %ecx */
+	movl	%esi, %ecx
+
+	/* %ecx to the top of entry-stack */
+	andl	$(MASK_entry_stack), %ecx
+	addl	$(SIZEOF_entry_stack), %ecx
+
+	/* Number of bytes on the entry stack to %ecx */
+	sub	%esi, %ecx
+
+	/* Mark stackframe as coming from entry stack */
+	orl	$CS_FROM_ENTRY_STACK, PT_CS(%esp)
+
+	/*
+	 * %esi and %edi are unchanged, %ecx contains the number of
+	 * bytes to copy. The code at .Lcopy_pt_regs_\@ will allocate
+	 * the stack-frame on task-stack and copy everything over
+	 */
+	jmp .Lcopy_pt_regs_\@
+
 .Lend_\@:
 .endm
 
@@ -408,6 +465,56 @@
 .endm
 
 /*
+ * This macro handles the case when we return to kernel-mode on the iret
+ * path and have to switch back to the entry stack.
+ *
+ * See the comments below the .Lentry_from_kernel_\@ label in the
+ * SWITCH_TO_KERNEL_STACK macro for more details.
+ */
+.macro PARANOID_EXIT_TO_KERNEL_MODE
+
+	/*
+	 * Test if we entered the kernel with the entry-stack. Most
+	 * likely we did not, because this code only runs on the
+	 * return-to-kernel path.
+	 */
+	testl	$CS_FROM_ENTRY_STACK, PT_CS(%esp)
+	jz	.Lend_\@
+
+	/* Unlikely slow-path */
+
+	/* Clear marker from stack-frame */
+	andl	$(~CS_FROM_ENTRY_STACK), PT_CS(%esp)
+
+	/* Copy the remaining task-stack contents to entry-stack */
+	movl	%esp, %esi
+	movl	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi
+
+	/* Bytes on the task-stack to ecx */
+	movl	PER_CPU_VAR(cpu_current_top_of_stack), %ecx
+	subl	%esi, %ecx
+
+	/* Allocate stack-frame on entry-stack */
+	subl	%ecx, %edi
+
+	/*
+	 * Save future stack-pointer, we must not switch until the
+	 * copy is done, otherwise the NMI handler could destroy the
+	 * contents of the task-stack we are about to copy.
+	 */
+	movl	%edi, %ebx
+
+	/* Do the copy */
+	shrl	$2, %ecx
+	cld
+	rep movsl
+
+	/* Safe to switch to entry-stack now */
+	movl	%ebx, %esp
+
+.Lend_\@:
+.endm
+/*
  * %eax: prev task
  * %edx: next task
  */
@@ -765,6 +872,7 @@ restore_all:
 
 restore_all_kernel:
 	TRACE_IRQS_IRET
+	PARANOID_EXIT_TO_KERNEL_MODE
 	RESTORE_REGS 4
 	jmp	.Lirq_return
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 12/34] x86/entry/32: Simplify debug entry point
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (10 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 13/34] x86/entry/32: Add PTI cr3 switches to NMI handler code Joerg Roedel
                   ` (21 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

The common exception entry code now handles the
entry-from-sysenter stack situation and makes sure to leave
with the same stack as it entered the kernel.

So there is no need anymore for the special handling in the
debug entry code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 35 +++--------------------------------
 1 file changed, 3 insertions(+), 32 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 3a84945..b1a5f34ee 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1215,41 +1215,12 @@ END(common_exception)
 
 ENTRY(debug)
 	/*
-	 * #DB can happen at the first instruction of
-	 * entry_SYSENTER_32 or in Xen's SYSENTER prologue.  If this
-	 * happens, then we will be running on a very small stack.  We
-	 * need to detect this condition and switch to the thread
-	 * stack before calling any C code at all.
-	 *
-	 * If you edit this code, keep in mind that NMIs can happen in here.
+	 * Entry from sysenter is now handled in common_exception
 	 */
 	ASM_CLAC
 	pushl	$-1				# mark this as an int
-
-	SAVE_ALL
-	ENCODE_FRAME_POINTER
-	xorl	%edx, %edx			# error code 0
-	movl	%esp, %eax			# pt_regs pointer
-
-	/* Are we currently on the SYSENTER stack? */
-	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
-	subl	%eax, %ecx	/* ecx = (end of entry_stack) - esp */
-	cmpl	$SIZEOF_entry_stack, %ecx
-	jb	.Ldebug_from_sysenter_stack
-
-	TRACE_IRQS_OFF
-	call	do_debug
-	jmp	ret_from_exception
-
-.Ldebug_from_sysenter_stack:
-	/* We're on the SYSENTER stack.  Switch off. */
-	movl	%esp, %ebx
-	movl	PER_CPU_VAR(cpu_current_top_of_stack), %esp
-	TRACE_IRQS_OFF
-	call	do_debug
-	movl	%ebx, %esp
-	jmp	ret_from_exception
+	pushl	$do_debug
+	jmp	common_exception
 END(debug)
 
 /*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 13/34] x86/entry/32: Add PTI cr3 switches to NMI handler code
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (11 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 12/34] x86/entry/32: Simplify debug entry point Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 14/34] x86/entry/32: Add PTI cr3 switch to non-NMI entry/exit points Joerg Roedel
                   ` (20 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

The NMI handler is special, as it needs to leave with the
same cr3 as it was entered with. We need to do this because
we could enter the NMI handler from kernel code with
user-cr3 already loaded.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 52 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index b1a5f34ee..35379e5 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -77,6 +77,8 @@
 #endif
 .endm
 
+#define PTI_SWITCH_MASK         (1 << PAGE_SHIFT)
+
 /*
  * User gs save/restore
  *
@@ -167,8 +169,30 @@
 
 .endm
 
-.macro SAVE_ALL_NMI
+.macro SAVE_ALL_NMI cr3_reg:req
 	SAVE_ALL
+
+	/*
+	 * Now switch the CR3 when PTI is enabled.
+	 *
+	 * We can enter with either user or kernel cr3, the code will
+	 * store the old cr3 in \cr3_reg and switches to the kernel cr3
+	 * if necessary.
+	 */
+	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+	movl	%cr3, \cr3_reg
+	testl	$PTI_SWITCH_MASK, \cr3_reg
+	jz	.Lend_\@	/* Already on kernel cr3 */
+
+	/* On user cr3 - write new kernel cr3 */
+	andl	$(~PTI_SWITCH_MASK), \cr3_reg
+	movl	\cr3_reg, %cr3
+
+	/* Restore user cr3 value */
+	orl	$PTI_SWITCH_MASK, \cr3_reg
+
+.Lend_\@:
 .endm
 /*
  * This is a sneaky trick to help the unwinder find pt_regs on the stack.  The
@@ -227,13 +251,29 @@
 	RESTORE_SKIP_SEGMENTS \pop
 .endm
 
-.macro RESTORE_ALL_NMI pop=0
+.macro RESTORE_ALL_NMI cr3_reg:req pop=0
 	/*
 	 * Restore segments - might cause exceptions when loading
 	 * user-space segments
 	 */
 	RESTORE_SEGMENTS
 
+	/*
+	 * Now switch the CR3 when PTI is enabled.
+	 *
+	 * We enter with kernel cr3 and switch the cr3 to the value
+	 * stored on \cr3_reg, which is either a user or a kernel cr3.
+	 */
+	ALTERNATIVE "jmp .Lswitched_\@", "", X86_FEATURE_PTI
+
+	testl	$PTI_SWITCH_MASK, \cr3_reg
+	jz	.Lswitched_\@
+
+	/* User cr3 in \cr3_reg - write it to hardware cr3 */
+	movl	\cr3_reg, %cr3
+
+.Lswitched_\@:
+
 	/* Restore integer registers and unwind stack to iret frame */
 	RESTORE_INT_REGS
 	RESTORE_SKIP_SEGMENTS \pop
@@ -1242,7 +1282,7 @@ ENTRY(nmi)
 #endif
 
 	pushl	%eax				# pt_regs->orig_ax
-	SAVE_ALL_NMI
+	SAVE_ALL_NMI cr3_reg=%edi
 	ENCODE_FRAME_POINTER
 	xorl	%edx, %edx			# zero error code
 	movl	%esp, %eax			# pt_regs pointer
@@ -1270,7 +1310,7 @@ ENTRY(nmi)
 
 .Lnmi_return:
 	CHECK_AND_APPLY_ESPFIX
-	RESTORE_ALL_NMI pop=4
+	RESTORE_ALL_NMI cr3_reg=%edi pop=4
 	jmp	.Lirq_return
 
 #ifdef CONFIG_X86_ESPFIX32
@@ -1286,12 +1326,12 @@ ENTRY(nmi)
 	pushl	16(%esp)
 	.endr
 	pushl	%eax
-	SAVE_ALL_NMI
+	SAVE_ALL_NMI cr3_reg=%edi
 	ENCODE_FRAME_POINTER
 	FIXUP_ESPFIX_STACK			# %eax == %esp
 	xorl	%edx, %edx			# zero error code
 	call	do_nmi
-	RESTORE_ALL_NMI
+	RESTORE_ALL_NMI cr3_reg=%edi
 	lss	12+4(%esp), %esp		# back to espfix stack
 	jmp	.Lirq_return
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 14/34] x86/entry/32: Add PTI cr3 switch to non-NMI entry/exit points
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (12 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 13/34] x86/entry/32: Add PTI cr3 switches to NMI handler code Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 15/34] x86/pgtable: Rename pti_set_user_pgd to pti_set_user_pgtbl Joerg Roedel
                   ` (19 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Add unconditional cr3 switches between user and kernel cr3
to all non-NMI entry and exit points.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 91 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 88 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 35379e5..8f78abc 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -328,6 +328,30 @@
 #endif /* CONFIG_X86_ESPFIX32 */
 .endm
 
+/* Unconditionally switch to user cr3 */
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+	movl	%cr3, \scratch_reg
+	orl	$PTI_SWITCH_MASK, \scratch_reg
+	movl	\scratch_reg, %cr3
+.Lend_\@:
+.endm
+
+/* Unconditionally switch to kernel cr3 */
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+	movl	%cr3, \scratch_reg
+	/* Test if we are already on kernel CR3 */
+	testl	$PTI_SWITCH_MASK, \scratch_reg
+	jz	.Lend_\@
+	andl	$(~PTI_SWITCH_MASK), \scratch_reg
+	movl	\scratch_reg, %cr3
+	/* Return original CR3 in \scratch_reg */
+	orl	$PTI_SWITCH_MASK, \scratch_reg
+.Lend_\@:
+.endm
+
 
 /*
  * Called with pt_regs fully populated and kernel segments loaded,
@@ -341,11 +365,19 @@
  */
 
 #define CS_FROM_ENTRY_STACK	(1 << 31)
+#define CS_FROM_USER_CR3	(1 << 30)
 
 .macro SWITCH_TO_KERNEL_STACK
 
 	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
 
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%eax
+
+	/*
+	 * %eax now contains the entry cr3 and we carry it forward in
+	 * that register for the time this macro runs
+	 */
+
 	/* Are we on the entry stack? Bail out if not! */
 	movl	PER_CPU_VAR(cpu_entry_area), %edi
 	addl	$CPU_ENTRY_AREA_entry_stack, %edi
@@ -408,7 +440,8 @@
 	 * but switch back to the entry-stack again when we approach
 	 * iret and return to the interrupted code-path. This usually
 	 * happens when we hit an exception while restoring user-space
-	 * segment registers on the way back to user-space.
+	 * segment registers on the way back to user-space or when the
+	 * sysenter handler runs with eflags.tf set.
 	 *
 	 * When we switch to the task-stack here, we can't trust the
 	 * contents of the entry-stack anymore, as the exception handler
@@ -425,6 +458,7 @@
 	 *
 	 * %esi: Entry-Stack pointer (same as %esp)
 	 * %edi: Top of the task stack
+	 * %eax: CR3 on kernel entry
 	 */
 
 	/* Calculate number of bytes on the entry stack in %ecx */
@@ -441,6 +475,14 @@
 	orl	$CS_FROM_ENTRY_STACK, PT_CS(%esp)
 
 	/*
+	 * Test the cr3 used to enter the kernel and add a marker
+	 * so that we can switch back to it before iret.
+	 */
+	testl	$PTI_SWITCH_MASK, %eax
+	jz	.Lcopy_pt_regs_\@
+	orl	$CS_FROM_USER_CR3, PT_CS(%esp)
+
+	/*
 	 * %esi and %edi are unchanged, %ecx contains the number of
 	 * bytes to copy. The code at .Lcopy_pt_regs_\@ will allocate
 	 * the stack-frame on task-stack and copy everything over
@@ -506,7 +548,7 @@
 
 /*
  * This macro handles the case when we return to kernel-mode on the iret
- * path and have to switch back to the entry stack.
+ * path and have to switch back to the entry stack and/or user-cr3
  *
  * See the comments below the .Lentry_from_kernel_\@ label in the
  * SWITCH_TO_KERNEL_STACK macro for more details.
@@ -552,6 +594,18 @@
 	/* Safe to switch to entry-stack now */
 	movl	%ebx, %esp
 
+	/*
+	 * We came from entry-stack and need to check if we also need to
+	 * switch back to user cr3.
+	 */
+	testl	$CS_FROM_USER_CR3, PT_CS(%esp)
+	jz	.Lend_\@
+
+	/* Clear marker from stack-frame */
+	andl	$(~CS_FROM_USER_CR3), PT_CS(%esp)
+
+	SWITCH_TO_USER_CR3 scratch_reg=%eax
+
 .Lend_\@:
 .endm
 /*
@@ -746,6 +800,18 @@ ENTRY(xen_sysenter_target)
  * 0(%ebp) arg6
  */
 ENTRY(entry_SYSENTER_32)
+	/*
+	 * On entry-stack with all userspace-regs live - save and
+	 * restore eflags and %eax to use it as scratch-reg for the cr3
+	 * switch.
+	 */
+	pushfl
+	pushl	%eax
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%eax
+	popl	%eax
+	popfl
+
+	/* Stack empty again, switch to task stack */
 	movl	TSS_entry_stack(%esp), %esp
 
 .Lsysenter_past_esp:
@@ -826,6 +892,9 @@ ENTRY(entry_SYSENTER_32)
 	/* Switch to entry stack */
 	movl	%eax, %esp
 
+	/* Now ready to switch the cr3 */
+	SWITCH_TO_USER_CR3 scratch_reg=%eax
+
 	/*
 	 * Restore all flags except IF. (We restore IF separately because
 	 * STI gives a one-instruction window in which we won't be interrupted,
@@ -906,7 +975,23 @@ restore_all:
 .Lrestore_all_notrace:
 	CHECK_AND_APPLY_ESPFIX
 .Lrestore_nocheck:
-	RESTORE_REGS 4				# skip orig_eax/error_code
+	/*
+	 * First restore user segments. This can cause exceptions, so we
+	 * run it with kernel cr3.
+	 */
+	RESTORE_SEGMENTS
+
+	/*
+	 * Segments are restored - no more exceptions from here on except on
+	 * iret, but that handled safely.
+	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%eax
+
+	/* Restore rest */
+	RESTORE_INT_REGS
+
+	/* Unwind stack to the iret frame */
+	RESTORE_SKIP_SEGMENTS 4			# skip orig_eax/error_code
 .Lirq_return:
 	INTERRUPT_RETURN
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 15/34] x86/pgtable: Rename pti_set_user_pgd to pti_set_user_pgtbl
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (13 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 14/34] x86/entry/32: Add PTI cr3 switch to non-NMI entry/exit points Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 16/34] x86/pgtable/pae: Unshare kernel PMDs when PTI is enabled Joerg Roedel
                   ` (18 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

With the way page-table folding is implemented on 32 bit, we
are not only setting PGDs with this functions, but also PUDs
and even PMDs. Give the function a more generic name to
reflect that.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable_64.h | 12 ++++++------
 arch/x86/mm/pti.c                 |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 81462e9..b68bda5 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -195,21 +195,21 @@ static inline bool pgdp_maps_userspace(void *__ptr)
 }
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
-pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
+pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd);
 
 /*
  * Take a PGD location (pgdp) and a pgd value that needs to be set there.
  * Populates the user and returns the resulting PGD that must be set in
  * the kernel copy of the page tables.
  */
-static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
 {
 	if (!static_cpu_has(X86_FEATURE_PTI))
 		return pgd;
-	return __pti_set_user_pgd(pgdp, pgd);
+	return __pti_set_user_pgtbl(pgdp, pgd);
 }
 #else
-static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
 {
 	return pgd;
 }
@@ -218,7 +218,7 @@ static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
 #if defined(CONFIG_PAGE_TABLE_ISOLATION) && !defined(CONFIG_X86_5LEVEL)
-	p4dp->pgd = pti_set_user_pgd(&p4dp->pgd, p4d.pgd);
+	p4dp->pgd = pti_set_user_pgtbl(&p4dp->pgd, p4d.pgd);
 #else
 	*p4dp = p4d;
 #endif
@@ -236,7 +236,7 @@ static inline void native_p4d_clear(p4d_t *p4d)
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
-	*pgdp = pti_set_user_pgd(pgdp, pgd);
+	*pgdp = pti_set_user_pgtbl(pgdp, pgd);
 #else
 	*pgdp = pgd;
 #endif
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ce38f16..8f53d21 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -102,7 +102,7 @@ void __init pti_check_boottime_disable(void)
 	setup_force_cpu_cap(X86_FEATURE_PTI);
 }
 
-pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
 {
 	/*
 	 * Changes to the high (kernel) portion of the kernelmode page
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 16/34] x86/pgtable/pae: Unshare kernel PMDs when PTI is enabled
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (14 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 15/34] x86/pgtable: Rename pti_set_user_pgd to pti_set_user_pgtbl Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 17/34] x86/pgtable/32: Allocate 8k page-tables " Joerg Roedel
                   ` (17 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

With PTI we need to map the per-process LDT into the kernel
address-space for each process, so we need separate kernel
PMDs per PGD.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable-3level_types.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index 876b4c7..ed8a200 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -21,9 +21,10 @@ typedef union {
 #endif	/* !__ASSEMBLY__ */
 
 #ifdef CONFIG_PARAVIRT
-#define SHARED_KERNEL_PMD	(pv_info.shared_kernel_pmd)
+#define SHARED_KERNEL_PMD	((!static_cpu_has(X86_FEATURE_PTI) &&	\
+				 (pv_info.shared_kernel_pmd)))
 #else
-#define SHARED_KERNEL_PMD	1
+#define SHARED_KERNEL_PMD	(!static_cpu_has(X86_FEATURE_PTI))
 #endif
 
 /*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 17/34] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (15 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 16/34] x86/pgtable/pae: Unshare kernel PMDs when PTI is enabled Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 18/34] x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h Joerg Roedel
                   ` (16 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Allocate a kernel and a user page-table root when PTI is
enabled. Also allocate a full page per root for PAE because
otherwise the bit to flip in cr3 to switch between them
would be non-constant, which creates a lot of hassle.
Keep that for a later optimization.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head_32.S | 20 +++++++++++++++-----
 arch/x86/mm/pgtable.c     |  5 +++--
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index c290209..1f35d60 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -512,11 +512,18 @@ ENTRY(initial_code)
 ENTRY(setup_once_ref)
 	.long setup_once
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+#define	PGD_ALIGN	(2 * PAGE_SIZE)
+#define PTI_USER_PGD_FILL	1024
+#else
+#define	PGD_ALIGN	(PAGE_SIZE)
+#define PTI_USER_PGD_FILL	0
+#endif
 /*
  * BSS section
  */
 __PAGE_ALIGNED_BSS
-	.align PAGE_SIZE
+	.align PGD_ALIGN
 #ifdef CONFIG_X86_PAE
 .globl initial_pg_pmd
 initial_pg_pmd:
@@ -526,14 +533,17 @@ initial_pg_pmd:
 initial_page_table:
 	.fill 1024,4,0
 #endif
+	.align PGD_ALIGN
 initial_pg_fixmap:
 	.fill 1024,4,0
-.globl empty_zero_page
-empty_zero_page:
-	.fill 4096,1,0
 .globl swapper_pg_dir
+	.align PGD_ALIGN
 swapper_pg_dir:
 	.fill 1024,4,0
+	.fill PTI_USER_PGD_FILL,4,0
+.globl empty_zero_page
+empty_zero_page:
+	.fill 4096,1,0
 EXPORT_SYMBOL(empty_zero_page)
 
 /*
@@ -542,7 +552,7 @@ EXPORT_SYMBOL(empty_zero_page)
 #ifdef CONFIG_X86_PAE
 __PAGE_ALIGNED_DATA
 	/* Page-aligned for the benefit of paravirt? */
-	.align PAGE_SIZE
+	.align PGD_ALIGN
 ENTRY(initial_page_table)
 	.long	pa(initial_pg_pmd+PGD_IDENT_ATTR),0	/* low identity map */
 # if KPMDS == 3
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 004abf9..a81d42e 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -338,7 +338,8 @@ static inline pgd_t *_pgd_alloc(void)
 	 * We allocate one page for pgd.
 	 */
 	if (!SHARED_KERNEL_PMD)
-		return (pgd_t *)__get_free_page(PGALLOC_GFP);
+		return (pgd_t *)__get_free_pages(PGALLOC_GFP,
+						 PGD_ALLOCATION_ORDER);
 
 	/*
 	 * Now PAE kernel is not running as a Xen domain. We can allocate
@@ -350,7 +351,7 @@ static inline pgd_t *_pgd_alloc(void)
 static inline void _pgd_free(pgd_t *pgd)
 {
 	if (!SHARED_KERNEL_PMD)
-		free_page((unsigned long)pgd);
+		free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
 	else
 		kmem_cache_free(pgd_cache, pgd);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 18/34] x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (16 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 17/34] x86/pgtable/32: Allocate 8k page-tables " Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 19/34] x86/pgtable: Move pti_set_user_pgtbl() " Joerg Roedel
                   ` (15 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Make them available on 32 bit and clone_pgd_range() happy.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable.h    | 49 +++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h | 49 ---------------------------------------
 2 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e42b894..0a9f746 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1109,6 +1109,55 @@ static inline int pud_write(pud_t pud)
 	return pud_flags(pud) & _PAGE_RW;
 }
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
+ * (8k-aligned and 8k in size).  The kernel one is at the beginning 4k and
+ * the user one is in the last 4k.  To switch between them, you
+ * just need to flip the 12th bit in their addresses.
+ */
+#define PTI_PGTABLE_SWITCH_BIT	PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr |= BIT(bit);
+	return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr &= ~BIT(bit);
+	return (void *)__ptr;
+}
+
+static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
+{
+	return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
+{
+	return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
+{
+	return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
+{
+	return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index b68bda5..5e68083 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,55 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 #endif
 }
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-/*
- * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
- * (8k-aligned and 8k in size).  The kernel one is at the beginning 4k and
- * the user one is in the last 4k.  To switch between them, you
- * just need to flip the 12th bit in their addresses.
- */
-#define PTI_PGTABLE_SWITCH_BIT	PAGE_SHIFT
-
-/*
- * This generates better code than the inline assembly in
- * __set_bit().
- */
-static inline void *ptr_set_bit(void *ptr, int bit)
-{
-	unsigned long __ptr = (unsigned long)ptr;
-
-	__ptr |= BIT(bit);
-	return (void *)__ptr;
-}
-static inline void *ptr_clear_bit(void *ptr, int bit)
-{
-	unsigned long __ptr = (unsigned long)ptr;
-
-	__ptr &= ~BIT(bit);
-	return (void *)__ptr;
-}
-
-static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
-{
-	return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
-{
-	return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
-{
-	return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
-{
-	return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-#endif /* CONFIG_PAGE_TABLE_ISOLATION */
-
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 19/34] x86/pgtable: Move pti_set_user_pgtbl() to pgtable.h
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (17 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 18/34] x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 20/34] x86/pgtable: Move two more functions from pgtable_64.h " Joerg Roedel
                   ` (14 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

There it is also usable from 32 bit code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable.h    | 23 +++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h | 21 ---------------------
 2 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0a9f746..1e900c1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -618,8 +618,31 @@ static inline int is_new_memtype_allowed(u64 paddr, unsigned long size,
 
 pmd_t *populate_extra_pmd(unsigned long vaddr);
 pte_t *populate_extra_pte(unsigned long vaddr);
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd);
+
+/*
+ * Take a PGD location (pgdp) and a pgd value that needs to be set there.
+ * Populates the user and returns the resulting PGD that must be set in
+ * the kernel copy of the page tables.
+ */
+static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
+{
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return pgd;
+	return __pti_set_user_pgtbl(pgdp, pgd);
+}
+#else   /* CONFIG_PAGE_TABLE_ISOLATION */
+static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
+{
+	return pgd;
+}
+#endif  /* CONFIG_PAGE_TABLE_ISOLATION */
+
 #endif	/* __ASSEMBLY__ */
 
+
 #ifdef CONFIG_X86_32
 # include <asm/pgtable_32.h>
 #else
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 5e68083..4cbf517 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -145,27 +145,6 @@ static inline bool pgdp_maps_userspace(void *__ptr)
 	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
 }
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-pgd_t __pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd);
-
-/*
- * Take a PGD location (pgdp) and a pgd value that needs to be set there.
- * Populates the user and returns the resulting PGD that must be set in
- * the kernel copy of the page tables.
- */
-static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
-{
-	if (!static_cpu_has(X86_FEATURE_PTI))
-		return pgd;
-	return __pti_set_user_pgtbl(pgdp, pgd);
-}
-#else
-static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
-{
-	return pgd;
-}
-#endif
-
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
 #if defined(CONFIG_PAGE_TABLE_ISOLATION) && !defined(CONFIG_X86_5LEVEL)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 20/34] x86/pgtable: Move two more functions from pgtable_64.h to pgtable.h
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (18 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 19/34] x86/pgtable: Move pti_set_user_pgtbl() " Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 21/34] x86/mm/pae: Populate valid user PGD entries Joerg Roedel
                   ` (13 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

These two functions are required for PTI on 32 bit:

	* pgdp_maps_userspace()
	* pgd_large()

Also re-implement pgdp_maps_userspace() so that it will work
on 64 and 32 bit kernels.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable-2level_types.h |  3 +++
 arch/x86/include/asm/pgtable-3level_types.h |  1 +
 arch/x86/include/asm/pgtable.h              | 16 ++++++++++++++++
 arch/x86/include/asm/pgtable_64.h           | 15 ---------------
 arch/x86/include/asm/pgtable_64_types.h     |  2 ++
 5 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-2level_types.h b/arch/x86/include/asm/pgtable-2level_types.h
index f982ef8..6deb6cd 100644
--- a/arch/x86/include/asm/pgtable-2level_types.h
+++ b/arch/x86/include/asm/pgtable-2level_types.h
@@ -35,4 +35,7 @@ typedef union {
 
 #define PTRS_PER_PTE	1024
 
+/* This covers all VMSPLIT_* and VMSPLIT_*_OPT variants */
+#define PGD_KERNEL_START	(CONFIG_PAGE_OFFSET >> PGDIR_SHIFT)
+
 #endif /* _ASM_X86_PGTABLE_2LEVEL_DEFS_H */
diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index ed8a200..925ac1b 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -45,5 +45,6 @@ typedef union {
  */
 #define PTRS_PER_PTE	512
 
+#define PGD_KERNEL_START	(CONFIG_PAGE_OFFSET >> PGDIR_SHIFT)
 
 #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e900c1..981e49a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1132,6 +1132,22 @@ static inline int pud_write(pud_t pud)
 	return pud_flags(pud) & _PAGE_RW;
 }
 
+/*
+ * Page table pages are page-aligned.  The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
+ */
+static inline bool pgdp_maps_userspace(void *__ptr)
+{
+	unsigned long ptr = (unsigned long)__ptr;
+
+	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < PGD_KERNEL_START);
+}
+
+static inline int pgd_large(pgd_t pgd) { return 0; }
+
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 /*
  * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 4cbf517..e11b925 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,20 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 #endif
 }
 
-/*
- * Page table pages are page-aligned.  The lower half of the top
- * level is used for userspace and the top half for the kernel.
- *
- * Returns true for parts of the PGD that map userspace and
- * false for the parts that map the kernel.
- */
-static inline bool pgdp_maps_userspace(void *__ptr)
-{
-	unsigned long ptr = (unsigned long)__ptr;
-
-	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
-}
-
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
 #if defined(CONFIG_PAGE_TABLE_ISOLATION) && !defined(CONFIG_X86_5LEVEL)
@@ -187,7 +173,6 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
 /*
  * Level 4 access.
  */
-static inline int pgd_large(pgd_t pgd) { return 0; }
 #define mk_kernel_pgd(address) __pgd((address) | _KERNPG_TABLE)
 
 /* PUD - Level3 access */
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 6b8f73d..e57003a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -124,4 +124,6 @@ typedef struct { pteval_t pte; } pte_t;
 
 #define EARLY_DYNAMIC_PAGE_TABLES	64
 
+#define PGD_KERNEL_START	((PAGE_SIZE / 2) / sizeof(pgd_t))
+
 #endif /* _ASM_X86_PGTABLE_64_DEFS_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 21/34] x86/mm/pae: Populate valid user PGD entries
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (19 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 20/34] x86/pgtable: Move two more functions from pgtable_64.h " Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 22/34] x86/mm/pae: Populate the user page-table with user pgd's Joerg Roedel
                   ` (12 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Generic page-table code populates all non-leaf entries with
_KERNPG_TABLE bits set. This is fine for all paging modes
except PAE.

In PAE mode only a subset of the bits is allowed to be set.
Make sure we only set allowed bits by masking out the
reserved bits.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable_types.h | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 3696398..48fc70b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -50,6 +50,7 @@
 #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
 #define _PAGE_SOFTW1	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1)
 #define _PAGE_SOFTW2	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW2)
+#define _PAGE_SOFTW3	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW3)
 #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
@@ -267,14 +268,37 @@ typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
 
 typedef struct { pgdval_t pgd; } pgd_t;
 
+#ifdef CONFIG_X86_PAE
+
+/*
+ * PHYSICAL_PAGE_MASK might be non-constant when SME is compiled in, so we can't
+ * use it here.
+ */
+
+#define PGD_PAE_PAGE_MASK	((signed long)PAGE_MASK)
+#define PGD_PAE_PHYS_MASK	(((1ULL << __PHYSICAL_MASK_SHIFT)-1) & PGD_PAE_PAGE_MASK)
+
+/*
+ * PAE allows Base Address, P, PWT, PCD and AVL bits to be set in PGD entries.
+ * All other bits are Reserved MBZ
+ */
+#define PGD_ALLOWED_BITS	(PGD_PAE_PHYS_MASK | _PAGE_PRESENT | \
+				 _PAGE_PWT | _PAGE_PCD | \
+				 _PAGE_SOFTW1 | _PAGE_SOFTW2 | _PAGE_SOFTW3 )
+
+#else
+/* No need to mask any bits for !PAE */
+#define PGD_ALLOWED_BITS	(~0ULL)
+#endif
+
 static inline pgd_t native_make_pgd(pgdval_t val)
 {
-	return (pgd_t) { val };
+	return (pgd_t) { val & PGD_ALLOWED_BITS };
 }
 
 static inline pgdval_t native_pgd_val(pgd_t pgd)
 {
-	return pgd.pgd;
+	return pgd.pgd & PGD_ALLOWED_BITS;
 }
 
 static inline pgdval_t pgd_flags(pgd_t pgd)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 22/34] x86/mm/pae: Populate the user page-table with user pgd's
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (20 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 21/34] x86/mm/pae: Populate valid user PGD entries Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 23/34] x86/mm/legacy: " Joerg Roedel
                   ` (11 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

When we populate a PGD entry, make sure we populate it in
the user page-table too.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable-3level.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index bc4af54..ab2aa44 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -98,6 +98,9 @@ static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 
 static inline void native_set_pud(pud_t *pudp, pud_t pud)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pud.p4d.pgd = pti_set_user_pgtbl(&pudp->p4d.pgd, pud.p4d.pgd);
+#endif
 	set_64bit((unsigned long long *)(pudp), native_pud_val(pud));
 }
 
@@ -194,6 +197,10 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
 {
 	union split_pud res, *orig = (union split_pud *)pudp;
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pti_set_user_pgtbl(&pudp->p4d.pgd, __pgd(0));
+#endif
+
 	/* xchg acts as a barrier before setting of the high bits */
 	res.pud_low = xchg(&orig->pud_low, 0);
 	res.pud_high = orig->pud_high;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 23/34] x86/mm/legacy: Populate the user page-table with user pgd's
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (21 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 22/34] x86/mm/pae: Populate the user page-table with user pgd's Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 24/34] x86/mm/pti: Add an overflow check to pti_clone_pmds() Joerg Roedel
                   ` (10 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Also populate the user-spage pgd's in the user page-table.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable-2level.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
index 685ffe8..c399ea5 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -19,6 +19,9 @@ static inline void native_set_pte(pte_t *ptep , pte_t pte)
 
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pmd.pud.p4d.pgd = pti_set_user_pgtbl(&pmdp->pud.p4d.pgd, pmd.pud.p4d.pgd);
+#endif
 	*pmdp = pmd;
 }
 
@@ -58,6 +61,9 @@ static inline pte_t native_ptep_get_and_clear(pte_t *xp)
 #ifdef CONFIG_SMP
 static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pti_set_user_pgtbl(&xp->pud.p4d.pgd, __pgd(0));
+#endif
 	return __pmd(xchg((pmdval_t *)xp, 0));
 }
 #else
@@ -67,6 +73,9 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
 #ifdef CONFIG_SMP
 static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pti_set_user_pgtbl(&xp->p4d.pgd, __pgd(0));
+#endif
 	return __pud(xchg((pudval_t *)xp, 0));
 }
 #else
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 24/34] x86/mm/pti: Add an overflow check to pti_clone_pmds()
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (22 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 23/34] x86/mm/legacy: " Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 25/34] x86/mm/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32 Joerg Roedel
                   ` (9 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

The addr counter will overflow if we clone the last PMD of
the address space, resulting in an endless loop.

Check for that and bail out of the loop when it happens.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/pti.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 8f53d21..96a690e 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -282,6 +282,10 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
 		p4d_t *p4d;
 		pud_t *pud;
 
+		/* Overflow check */
+		if (addr < start)
+			break;
+
 		pgd = pgd_offset_k(addr);
 		if (WARN_ON(pgd_none(*pgd)))
 			return;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 25/34] x86/mm/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (23 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 24/34] x86/mm/pti: Add an overflow check to pti_clone_pmds() Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 26/34] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level " Joerg Roedel
                   ` (8 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Move it out of the X86_64 specific processor defines so
that its visible for 32bit too.

Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/processor-flags.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 625a52a..02c2cbd 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -39,10 +39,6 @@
 #define CR3_PCID_MASK	0xFFFull
 #define CR3_NOFLUSH	BIT_ULL(63)
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-# define X86_CR3_PTI_PCID_USER_BIT	11
-#endif
-
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -53,4 +49,8 @@
 #define CR3_NOFLUSH	0
 #endif
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+# define X86_CR3_PTI_PCID_USER_BIT	11
+#endif
+
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 26/34] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (24 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 25/34] x86/mm/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32 Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 27/34] x86/mm/dump_pagetables: Define INIT_PGD Joerg Roedel
                   ` (7 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Cloning on the P4D level would clone the complete kernel
address space into the user-space page-tables for PAE
kernels. Cloning on PMD level is fine for PAE and legacy
paging.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/pti.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 96a690e..3ffd923 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -312,6 +312,7 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
 	}
 }
 
+#ifdef CONFIG_X86_64
 /*
  * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
  * next-level entry on 5-level systems.
@@ -335,6 +336,25 @@ static void __init pti_clone_user_shared(void)
 	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
 }
 
+#else /* CONFIG_X86_64 */
+
+/*
+ * On 32 bit PAE systems with 1GB of Kernel address space there is only
+ * one pgd/p4d for the whole kernel. Cloning that would map the whole
+ * address space into the user page-tables, making PTI useless. So clone
+ * the page-table on the PMD level to prevent that.
+ */
+static void __init pti_clone_user_shared(void)
+{
+	unsigned long start, end;
+
+	start = CPU_ENTRY_AREA_BASE;
+	end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
+
+	pti_clone_pmds(start, end, _PAGE_GLOBAL);
+}
+#endif /* CONFIG_X86_64 */
+
 /*
  * Clone the ESPFIX P4D into the user space visinble page table
  */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 27/34] x86/mm/dump_pagetables: Define INIT_PGD
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (25 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 26/34] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level " Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 28/34] x86/pgtable/pae: Use separate kernel PMDs for user page-table Joerg Roedel
                   ` (6 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Define INIT_PGD to point to the correct initial page-table
for 32 and 64 bit and use it where needed. This fixes the
build on 32 bit with CONFIG_PAGE_TABLE_ISOLATION enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/dump_pagetables.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 2a4849e..2151ebb 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -105,6 +105,8 @@ static struct addr_marker address_markers[] = {
 	[END_OF_SPACE_NR]	= { -1,			NULL }
 };
 
+#define INIT_PGD	((pgd_t *) &init_top_pgt)
+
 #else /* CONFIG_X86_64 */
 
 enum address_markers_idx {
@@ -133,6 +135,8 @@ static struct addr_marker address_markers[] = {
 	[END_OF_SPACE_NR]	= { -1,			NULL }
 };
 
+#define INIT_PGD	(swapper_pg_dir)
+
 #endif /* !CONFIG_X86_64 */
 
 /* Multipliers for offsets within the PTEs */
@@ -478,11 +482,7 @@ static inline bool is_hypervisor_range(int idx)
 static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				       bool checkwx, bool dmesg)
 {
-#ifdef CONFIG_X86_64
-	pgd_t *start = (pgd_t *) &init_top_pgt;
-#else
-	pgd_t *start = swapper_pg_dir;
-#endif
+	pgd_t *start = INIT_PGD;
 	pgprotval_t prot;
 	int i;
 	struct pg_state st = {};
@@ -543,7 +543,7 @@ EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
 static void ptdump_walk_user_pgd_level_checkwx(void)
 {
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
-	pgd_t *pgd = (pgd_t *) &init_top_pgt;
+	pgd_t *pgd = INIT_PGD;
 
 	if (!static_cpu_has(X86_FEATURE_PTI))
 		return;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 28/34] x86/pgtable/pae: Use separate kernel PMDs for user page-table
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (26 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 27/34] x86/mm/dump_pagetables: Define INIT_PGD Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 29/34] x86/ldt: Reserve address-space range on 32 bit for the LDT Joerg Roedel
                   ` (5 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

We need separate kernel PMDs in the user page-table when PTI
is enabled to map the per-process LDT for user-space.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/pgtable.c | 100 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 81 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index a81d42e..d95bc7b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -177,6 +177,14 @@ static void pgd_dtor(pgd_t *pgd)
  */
 #define PREALLOCATED_PMDS	UNSHARED_PTRS_PER_PGD
 
+/*
+ * We allocate separate PMDs for the kernel part of the user page-table
+ * when PTI is enabled. We need them to map the per-process LDT into the
+ * user-space page-table.
+ */
+#define PREALLOCATED_USER_PMDS	 (static_cpu_has(X86_FEATURE_PTI) ? \
+					KERNEL_PGD_PTRS : 0)
+
 void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
 {
 	paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
@@ -197,14 +205,14 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
 
 /* No need to prepopulate any pagetable entries in non-PAE modes. */
 #define PREALLOCATED_PMDS	0
-
+#define PREALLOCATED_USER_PMDS	 0
 #endif	/* CONFIG_X86_PAE */
 
-static void free_pmds(struct mm_struct *mm, pmd_t *pmds[])
+static void free_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
 {
 	int i;
 
-	for(i = 0; i < PREALLOCATED_PMDS; i++)
+	for(i = 0; i < count; i++)
 		if (pmds[i]) {
 			pgtable_pmd_page_dtor(virt_to_page(pmds[i]));
 			free_page((unsigned long)pmds[i]);
@@ -212,7 +220,7 @@ static void free_pmds(struct mm_struct *mm, pmd_t *pmds[])
 		}
 }
 
-static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[])
+static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
 {
 	int i;
 	bool failed = false;
@@ -221,7 +229,7 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[])
 	if (mm == &init_mm)
 		gfp &= ~__GFP_ACCOUNT;
 
-	for(i = 0; i < PREALLOCATED_PMDS; i++) {
+	for(i = 0; i < count; i++) {
 		pmd_t *pmd = (pmd_t *)__get_free_page(gfp);
 		if (!pmd)
 			failed = true;
@@ -236,7 +244,7 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[])
 	}
 
 	if (failed) {
-		free_pmds(mm, pmds);
+		free_pmds(mm, pmds, count);
 		return -ENOMEM;
 	}
 
@@ -249,23 +257,38 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[])
  * preallocate which never got a corresponding vma will need to be
  * freed manually.
  */
+static void mop_up_one_pmd(struct mm_struct *mm, pgd_t *pgdp)
+{
+	pgd_t pgd = *pgdp;
+
+	if (pgd_val(pgd) != 0) {
+		pmd_t *pmd = (pmd_t *)pgd_page_vaddr(pgd);
+
+		*pgdp = native_make_pgd(0);
+
+		paravirt_release_pmd(pgd_val(pgd) >> PAGE_SHIFT);
+		pmd_free(mm, pmd);
+		mm_dec_nr_pmds(mm);
+	}
+}
+
 static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)
 {
 	int i;
 
-	for(i = 0; i < PREALLOCATED_PMDS; i++) {
-		pgd_t pgd = pgdp[i];
+	for(i = 0; i < PREALLOCATED_PMDS; i++)
+		mop_up_one_pmd(mm, &pgdp[i]);
 
-		if (pgd_val(pgd) != 0) {
-			pmd_t *pmd = (pmd_t *)pgd_page_vaddr(pgd);
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
 
-			pgdp[i] = native_make_pgd(0);
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
 
-			paravirt_release_pmd(pgd_val(pgd) >> PAGE_SHIFT);
-			pmd_free(mm, pmd);
-			mm_dec_nr_pmds(mm);
-		}
-	}
+	pgdp = kernel_to_user_pgdp(pgdp);
+
+	for (i = 0; i < PREALLOCATED_USER_PMDS; i++)
+		mop_up_one_pmd(mm, &pgdp[i + KERNEL_PGD_BOUNDARY]);
+#endif
 }
 
 static void pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmds[])
@@ -291,6 +314,38 @@ static void pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmds[])
 	}
 }
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static void pgd_prepopulate_user_pmd(struct mm_struct *mm,
+				     pgd_t *k_pgd, pmd_t *pmds[])
+{
+	pgd_t *s_pgd = kernel_to_user_pgdp(swapper_pg_dir);
+	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+	p4d_t *u_p4d;
+	pud_t *u_pud;
+	int i;
+
+	u_p4d = p4d_offset(u_pgd, 0);
+	u_pud = pud_offset(u_p4d, 0);
+
+	s_pgd += KERNEL_PGD_BOUNDARY;
+	u_pud += KERNEL_PGD_BOUNDARY;
+
+	for (i = 0; i < PREALLOCATED_USER_PMDS; i++, u_pud++, s_pgd++) {
+		pmd_t *pmd = pmds[i];
+
+		memcpy(pmd, (pmd_t *)pgd_page_vaddr(*s_pgd),
+		       sizeof(pmd_t) * PTRS_PER_PMD);
+
+		pud_populate(mm, u_pud, pmd);
+	}
+
+}
+#else
+static void pgd_prepopulate_user_pmd(struct mm_struct *mm,
+				     pgd_t *k_pgd, pmd_t *pmds[])
+{
+}
+#endif
 /*
  * Xen paravirt assumes pgd table should be in one page. 64 bit kernel also
  * assumes that pgd should be in one page.
@@ -371,6 +426,7 @@ static inline void _pgd_free(pgd_t *pgd)
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	pgd_t *pgd;
+	pmd_t *u_pmds[PREALLOCATED_USER_PMDS];
 	pmd_t *pmds[PREALLOCATED_PMDS];
 
 	pgd = _pgd_alloc();
@@ -380,12 +436,15 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 
 	mm->pgd = pgd;
 
-	if (preallocate_pmds(mm, pmds) != 0)
+	if (preallocate_pmds(mm, pmds, PREALLOCATED_PMDS) != 0)
 		goto out_free_pgd;
 
-	if (paravirt_pgd_alloc(mm) != 0)
+	if (preallocate_pmds(mm, u_pmds, PREALLOCATED_USER_PMDS) != 0)
 		goto out_free_pmds;
 
+	if (paravirt_pgd_alloc(mm) != 0)
+		goto out_free_user_pmds;
+
 	/*
 	 * Make sure that pre-populating the pmds is atomic with
 	 * respect to anything walking the pgd_list, so that they
@@ -395,13 +454,16 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 
 	pgd_ctor(mm, pgd);
 	pgd_prepopulate_pmd(mm, pgd, pmds);
+	pgd_prepopulate_user_pmd(mm, pgd, u_pmds);
 
 	spin_unlock(&pgd_lock);
 
 	return pgd;
 
+out_free_user_pmds:
+	free_pmds(mm, u_pmds, PREALLOCATED_USER_PMDS);
 out_free_pmds:
-	free_pmds(mm, pmds);
+	free_pmds(mm, pmds, PREALLOCATED_PMDS);
 out_free_pgd:
 	_pgd_free(pgd);
 out:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 29/34] x86/ldt: Reserve address-space range on 32 bit for the LDT
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (27 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 28/34] x86/pgtable/pae: Use separate kernel PMDs for user page-table Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:25 ` [PATCH 30/34] x86/ldt: Define LDT_END_ADDR Joerg Roedel
                   ` (4 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Reserve 2MB/4MB of address-space for mapping the LDT to
user-space on 32 bit PTI kernels.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable_32_types.h | 7 +++++--
 arch/x86/mm/dump_pagetables.c           | 9 +++++++++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h b/arch/x86/include/asm/pgtable_32_types.h
index 0777e18..eb2e97a 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -48,13 +48,16 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
 	((FIXADDR_TOT_START - PAGE_SIZE * (CPU_ENTRY_AREA_PAGES + 1))   \
 	 & PMD_MASK)
 
-#define PKMAP_BASE		\
+#define LDT_BASE_ADDR		\
 	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
 
+#define PKMAP_BASE		\
+	((LDT_BASE_ADDR - PAGE_SIZE) & PMD_MASK)
+
 #ifdef CONFIG_HIGHMEM
 # define VMALLOC_END	(PKMAP_BASE - 2 * PAGE_SIZE)
 #else
-# define VMALLOC_END	(CPU_ENTRY_AREA_BASE - 2 * PAGE_SIZE)
+# define VMALLOC_END	(LDT_BASE_ADDR - 2 * PAGE_SIZE)
 #endif
 
 #define MODULES_VADDR	VMALLOC_START
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 2151ebb..fdefdf0 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -117,6 +117,9 @@ enum address_markers_idx {
 #ifdef CONFIG_HIGHMEM
 	PKMAP_BASE_NR,
 #endif
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+	LDT_NR,
+#endif
 	CPU_ENTRY_AREA_NR,
 	FIXADDR_START_NR,
 	END_OF_SPACE_NR,
@@ -130,6 +133,9 @@ static struct addr_marker address_markers[] = {
 #ifdef CONFIG_HIGHMEM
 	[PKMAP_BASE_NR]		= { 0UL,		"Persistent kmap() Area" },
 #endif
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+	[LDT_NR]		= { 0UL,		"LDT remap" },
+#endif
 	[CPU_ENTRY_AREA_NR]	= { 0UL,		"CPU entry area" },
 	[FIXADDR_START_NR]	= { 0UL,		"Fixmap area" },
 	[END_OF_SPACE_NR]	= { -1,			NULL }
@@ -579,6 +585,9 @@ static int __init pt_dump_init(void)
 # endif
 	address_markers[FIXADDR_START_NR].start_address = FIXADDR_START;
 	address_markers[CPU_ENTRY_AREA_NR].start_address = CPU_ENTRY_AREA_BASE;
+# ifdef CONFIG_MODIFY_LDT_SYSCALL
+	address_markers[LDT_NR].start_address = LDT_BASE_ADDR;
+# endif
 #endif
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 30/34] x86/ldt: Define LDT_END_ADDR
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (28 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 29/34] x86/ldt: Reserve address-space range on 32 bit for the LDT Joerg Roedel
@ 2018-03-05 10:25 ` Joerg Roedel
  2018-03-05 10:26 ` [PATCH 31/34] x86/ldt: Split out sanity check in map_ldt_struct() Joerg Roedel
                   ` (3 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:25 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

It marks the end of the address-space range reserved for the
LDT. The LDT-code will use it when unmapping the LDT for
user-space.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable_32_types.h | 2 ++
 arch/x86/include/asm/pgtable_64_types.h | 2 ++
 arch/x86/kernel/ldt.c                   | 2 +-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h b/arch/x86/include/asm/pgtable_32_types.h
index eb2e97a..02bd445 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -51,6 +51,8 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
 #define LDT_BASE_ADDR		\
 	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
 
+#define LDT_END_ADDR		(LDT_BASE_ADDR + PMD_SIZE)
+
 #define PKMAP_BASE		\
 	((LDT_BASE_ADDR - PAGE_SIZE) & PMD_MASK)
 
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index e57003a..15188baa 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -90,12 +90,14 @@ typedef struct { pteval_t pte; } pte_t;
 # define __VMEMMAP_BASE		_AC(0xffd4000000000000, UL)
 # define LDT_PGD_ENTRY		_AC(-112, UL)
 # define LDT_BASE_ADDR		(LDT_PGD_ENTRY << PGDIR_SHIFT)
+#define  LDT_END_ADDR		(LDT_BASE_ADDR + PGDIR_SIZE)
 #else
 # define VMALLOC_SIZE_TB	_AC(32, UL)
 # define __VMALLOC_BASE		_AC(0xffffc90000000000, UL)
 # define __VMEMMAP_BASE		_AC(0xffffea0000000000, UL)
 # define LDT_PGD_ENTRY		_AC(-3, UL)
 # define LDT_BASE_ADDR		(LDT_PGD_ENTRY << PGDIR_SHIFT)
+#define  LDT_END_ADDR		(LDT_BASE_ADDR + PGDIR_SIZE)
 #endif
 
 #ifdef CONFIG_RANDOMIZE_MEMORY
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 26d713e..f3c2fbf 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -202,7 +202,7 @@ static void free_ldt_pgtables(struct mm_struct *mm)
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 	struct mmu_gather tlb;
 	unsigned long start = LDT_BASE_ADDR;
-	unsigned long end = start + (1UL << PGDIR_SHIFT);
+	unsigned long end = LDT_END_ADDR;
 
 	if (!static_cpu_has(X86_FEATURE_PTI))
 		return;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 31/34] x86/ldt: Split out sanity check in map_ldt_struct()
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (29 preceding siblings ...)
  2018-03-05 10:25 ` [PATCH 30/34] x86/ldt: Define LDT_END_ADDR Joerg Roedel
@ 2018-03-05 10:26 ` Joerg Roedel
  2018-03-05 10:26 ` [PATCH 32/34] x86/ldt: Enable LDT user-mapping for PAE Joerg Roedel
                   ` (2 subsequent siblings)
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

This splits out the mapping sanity check and the actual
mapping of the LDT to user-space from the map_ldt_struct()
function in a way so that it is re-usable for PAE paging.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/ldt.c | 82 ++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 58 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index f3c2fbf..8ab7df9 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -100,6 +100,49 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	return new_ldt;
 }
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+static void do_sanity_check(struct mm_struct *mm,
+			    bool had_kernel_mapping,
+			    bool had_user_mapping)
+{
+	if (mm->context.ldt) {
+		/*
+		 * We already had an LDT.  The top-level entry should already
+		 * have been allocated and synchronized with the usermode
+		 * tables.
+		 */
+		WARN_ON(!had_kernel_mapping);
+		if (static_cpu_has(X86_FEATURE_PTI))
+			WARN_ON(!had_user_mapping);
+	} else {
+		/*
+		 * This is the first time we're mapping an LDT for this process.
+		 * Sync the pgd to the usermode tables.
+		 */
+		WARN_ON(had_kernel_mapping);
+		if (static_cpu_has(X86_FEATURE_PTI))
+			WARN_ON(had_user_mapping);
+	}
+}
+
+static void map_ldt_struct_to_user(struct mm_struct *mm)
+{
+	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
+
+	if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
+		set_pgd(kernel_to_user_pgdp(pgd), *pgd);
+}
+
+static void sanity_check_ldt_mapping(struct mm_struct *mm)
+{
+	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
+	bool had_kernel = (pgd->pgd != 0);
+	bool had_user   = (kernel_to_user_pgdp(pgd)->pgd != 0);
+
+	do_sanity_check(mm, had_kernel, had_user);
+}
+
 /*
  * If PTI is enabled, this maps the LDT into the kernelmode and
  * usermode tables for the given mm.
@@ -115,9 +158,8 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 static int
 map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
 {
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-	bool is_vmalloc, had_top_level_entry;
 	unsigned long va;
+	bool is_vmalloc;
 	spinlock_t *ptl;
 	pgd_t *pgd;
 	int i;
@@ -131,13 +173,15 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
 	 */
 	WARN_ON(ldt->slot != -1);
 
+	/* Check if the current mappings are sane */
+	sanity_check_ldt_mapping(mm);
+
 	/*
 	 * Did we already have the top level entry allocated?  We can't
 	 * use pgd_none() for this because it doens't do anything on
 	 * 4-level page table kernels.
 	 */
 	pgd = pgd_offset(mm, LDT_BASE_ADDR);
-	had_top_level_entry = (pgd->pgd != 0);
 
 	is_vmalloc = is_vmalloc_addr(ldt->entries);
 
@@ -168,35 +212,25 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
 		pte_unmap_unlock(ptep, ptl);
 	}
 
-	if (mm->context.ldt) {
-		/*
-		 * We already had an LDT.  The top-level entry should already
-		 * have been allocated and synchronized with the usermode
-		 * tables.
-		 */
-		WARN_ON(!had_top_level_entry);
-		if (static_cpu_has(X86_FEATURE_PTI))
-			WARN_ON(!kernel_to_user_pgdp(pgd)->pgd);
-	} else {
-		/*
-		 * This is the first time we're mapping an LDT for this process.
-		 * Sync the pgd to the usermode tables.
-		 */
-		WARN_ON(had_top_level_entry);
-		if (static_cpu_has(X86_FEATURE_PTI)) {
-			WARN_ON(kernel_to_user_pgdp(pgd)->pgd);
-			set_pgd(kernel_to_user_pgdp(pgd), *pgd);
-		}
-	}
+	/* Propagate LDT mapping to the user page-table */
+	map_ldt_struct_to_user(mm);
 
 	va = (unsigned long)ldt_slot_va(slot);
 	flush_tlb_mm_range(mm, va, va + LDT_SLOT_STRIDE, 0);
 
 	ldt->slot = slot;
-#endif
 	return 0;
 }
 
+#else /* !CONFIG_PAGE_TABLE_ISOLATION */
+
+static int
+map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
+{
+	return 0;
+}
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
 static void free_ldt_pgtables(struct mm_struct *mm)
 {
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 32/34] x86/ldt: Enable LDT user-mapping for PAE
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (30 preceding siblings ...)
  2018-03-05 10:26 ` [PATCH 31/34] x86/ldt: Split out sanity check in map_ldt_struct() Joerg Roedel
@ 2018-03-05 10:26 ` Joerg Roedel
  2018-03-05 10:26 ` [PATCH 33/34] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32 Joerg Roedel
  2018-03-05 10:26 ` [PATCH 34/34] x86/mm/pti: Add Warning when booting on a PCIE capable CPU Joerg Roedel
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

This adds the needed special case for PAE to get the LDT
mapped into the user page-table when PTI is enabled. The big
difference to the other paging modes is that we don't have a
full top-level PGD entry available for the LDT, but only PMD
entry.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/mmu_context.h |  4 ---
 arch/x86/kernel/ldt.c              | 53 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index c931b88..af96cfb 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -70,11 +70,7 @@ struct ldt_struct {
 
 static inline void *ldt_slot_va(int slot)
 {
-#ifdef CONFIG_X86_64
 	return (void *)(LDT_BASE_ADDR + LDT_SLOT_STRIDE * slot);
-#else
-	BUG();
-#endif
 }
 
 /*
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 8ab7df9..7787451 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -126,6 +126,57 @@ static void do_sanity_check(struct mm_struct *mm,
 	}
 }
 
+#ifdef CONFIG_X86_PAE
+
+static pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va)
+{
+	p4d_t *p4d;
+	pud_t *pud;
+
+	if (pgd->pgd == 0)
+		return NULL;
+
+	p4d = p4d_offset(pgd, va);
+	if (p4d_none(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, va);
+	if (pud_none(*pud))
+		return NULL;
+
+	return pmd_offset(pud, va);
+}
+
+static void map_ldt_struct_to_user(struct mm_struct *mm)
+{
+	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
+	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+	pmd_t *k_pmd, *u_pmd;
+
+	k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
+	u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
+
+	if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
+		set_pmd(u_pmd, *k_pmd);
+}
+
+static void sanity_check_ldt_mapping(struct mm_struct *mm)
+{
+	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
+	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+	bool had_kernel, had_user;
+	pmd_t *k_pmd, *u_pmd;
+
+	k_pmd      = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
+	u_pmd      = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
+	had_kernel = (k_pmd->pmd != 0);
+	had_user   = (u_pmd->pmd != 0);
+
+	do_sanity_check(mm, had_kernel, had_user);
+}
+
+#else /* !CONFIG_X86_PAE */
+
 static void map_ldt_struct_to_user(struct mm_struct *mm)
 {
 	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
@@ -143,6 +194,8 @@ static void sanity_check_ldt_mapping(struct mm_struct *mm)
 	do_sanity_check(mm, had_kernel, had_user);
 }
 
+#endif /* CONFIG_X86_PAE */
+
 /*
  * If PTI is enabled, this maps the LDT into the kernelmode and
  * usermode tables for the given mm.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 33/34] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (31 preceding siblings ...)
  2018-03-05 10:26 ` [PATCH 32/34] x86/ldt: Enable LDT user-mapping for PAE Joerg Roedel
@ 2018-03-05 10:26 ` Joerg Roedel
  2018-03-05 10:26 ` [PATCH 34/34] x86/mm/pti: Add Warning when booting on a PCIE capable CPU Joerg Roedel
  33 siblings, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Allow PTI to be compiled on x86_32.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 security/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/Kconfig b/security/Kconfig
index b0cb9a5..93d85fd 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -57,7 +57,7 @@ config SECURITY_NETWORK
 config PAGE_TABLE_ISOLATION
 	bool "Remove the kernel mapping in user mode"
 	default y
-	depends on X86_64 && !UML
+	depends on X86 && !UML
 	help
 	  This feature reduces the number of hardware side channels by
 	  ensuring that the majority of kernel addresses are not mapped
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 34/34] x86/mm/pti: Add Warning when booting on a PCIE capable CPU
  2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
                   ` (32 preceding siblings ...)
  2018-03-05 10:26 ` [PATCH 33/34] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32 Joerg Roedel
@ 2018-03-05 10:26 ` Joerg Roedel
  2018-03-05 13:39   ` Waiman Long
  2018-03-05 16:09   ` Denys Vlasenko
  33 siblings, 2 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 10:26 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Warn the user in case the performance can be significantly
improved by switching to a 64-bit kernel.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/pti.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 3ffd923..8f5aa0d 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -385,6 +385,22 @@ void __init pti_init(void)
 
 	pr_info("enabled\n");
 
+#ifdef CONFIG_X86_32
+	if (boot_cpu_has(X86_FEATURE_PCID)) {
+		/* Use printk to work around pr_fmt() */
+		printk(KERN_WARNING "\n");
+		printk(KERN_WARNING "************************************************************\n");
+		printk(KERN_WARNING "** WARNING! WARNING! WARNING! WARNING! WARNING! WARNING!  **\n");
+		printk(KERN_WARNING "**                                                        **\n");
+		printk(KERN_WARNING "** You are using 32-bit PTI on a 64-bit PCID-capable CPU. **\n");
+		printk(KERN_WARNING "** Your performance will increase dramatically if you     **\n");
+		printk(KERN_WARNING "** switch to a 64-bit kernel!                             **\n");
+		printk(KERN_WARNING "**                                                        **\n");
+		printk(KERN_WARNING "** WARNING! WARNING! WARNING! WARNING! WARNING! WARNING!  **\n");
+		printk(KERN_WARNING "************************************************************\n");
+	}
+#endif
+
 	pti_clone_user_shared();
 	pti_clone_entry_text();
 	pti_setup_espfix64();
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 10:25 ` [PATCH 07/34] x86/entry/32: Restore segments before int registers Joerg Roedel
@ 2018-03-05 12:17   ` Linus Torvalds
  2018-03-05 13:12     ` Joerg Roedel
  0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2018-03-05 12:17 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 1444 bytes --]

[ On mobile, sorry for html ]

On Mar 5, 2018 02:26, "Joerg Roedel" <joro@8bytes.org> wrote:

From: Joerg Roedel <jroedel@suse.de>

Restoring the segments can cause exceptions that need to be
handled. With PTI enabled, we still need to be on kernel cr3
when the exception happens. For the cr3-switch we need
at least one integer scratch register, so we can't switch
with the user integer registers already loaded.


This fundamentally seems wrong.

The things is, we *know* that we will restore two segment registers with
the user cr3 already loaded: CS and SS get restored with the final iret.

And yes, the final iret can fault due to CS/SS no longer being valid,
either because of ptrace or because the ldt was changed.

So making it be a "rule" that segment registers be restored with the kernel
cr3 active seems bogus. It just means that you're making a rule that cannot
possibly be generic.

So has this been tested with

 - single-stepping through sysenter

   This takes a DB fault in the first kernel instruction. We're in kernel
mode, but with user cr3.

 - ptracing and setting CS/SS to something bad

   That should test the "exception on iret" case - again in kernel mode,
but with user cr3 restored for the return.

I didn't look closely at the whole series, so maybe this is all fine. I
mainly reacted to the "With PTI enabled, we still need to be on kernel cr3
when the exception happens" part of the explanation..

      Linus

[-- Attachment #2: Type: text/html, Size: 2775 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 12:17   ` Linus Torvalds
@ 2018-03-05 13:12     ` Joerg Roedel
  2018-03-05 14:51       ` Brian Gerst
  2018-03-05 18:23       ` Linus Torvalds
  0 siblings, 2 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 13:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 05, 2018 at 04:17:45AM -0800, Linus Torvalds wrote:
>     Restoring the segments can cause exceptions that need to be
>     handled. With PTI enabled, we still need to be on kernel cr3
>     when the exception happens. For the cr3-switch we need
>     at least one integer scratch register, so we can't switch
>     with the user integer registers already loaded.
> 
> 
> This fundamentally seems wrong.

Okay, right, with v3 it is wrong, in v2 I still thought I could get away
without remembering the entry-cr3, but didn't think about the #DB case
then.

In v3 I added code which remembers the entry-cr3 and handles the
entry-from-kernel-mode-with-user-cr3 case for all exceptions including
#DB.

> The things is, we *know* that we will restore two segment registers with the
> user cr3 already loaded: CS and SS get restored with the final iret.

Yeah, I know, but the iret-exception path is fine because it will
deliver a SIGILL and doesn't return to the faulting iret.

Anyway, I will remove these restore-reorderings, they are not needed
anymore.

> So has this been tested with
> 
>  - single-stepping through sysenter
> 
>    This takes a DB fault in the first kernel instruction. We're in kernel mode,
> but with user cr3.
> 
>  - ptracing and setting CS/SS to something bad
> 
>    That should test the "exception on iret" case - again in kernel mode, but
> with user cr3 restored for the return.

The iret-exception case is tested by the ldt_gdt selftest (the
do_multicpu_tests subtest). But I didn't actually tested single-stepping
through sysenter yet. I just re-ran the same tests I did with v2 on this
patch-set.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 34/34] x86/mm/pti: Add Warning when booting on a PCIE capable CPU
  2018-03-05 10:26 ` [PATCH 34/34] x86/mm/pti: Add Warning when booting on a PCIE capable CPU Joerg Roedel
@ 2018-03-05 13:39   ` Waiman Long
  2018-03-05 16:09   ` Denys Vlasenko
  1 sibling, 0 replies; 56+ messages in thread
From: Waiman Long @ 2018-03-05 13:39 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel

On 03/05/2018 05:26 AM, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Warn the user in case the performance can be significantly
> improved by switching to a 64-bit kernel.
>
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/mm/pti.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index 3ffd923..8f5aa0d 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -385,6 +385,22 @@ void __init pti_init(void)
>  
>  	pr_info("enabled\n");
>  
> +#ifdef CONFIG_X86_32
> +	if (boot_cpu_has(X86_FEATURE_PCID)) {
> +		/* Use printk to work around pr_fmt() */
> +		printk(KERN_WARNING "\n");
> +		printk(KERN_WARNING "************************************************************\n");
> +		printk(KERN_WARNING "** WARNING! WARNING! WARNING! WARNING! WARNING! WARNING!  **\n");
> +		printk(KERN_WARNING "**                                                        **\n");
> +		printk(KERN_WARNING "** You are using 32-bit PTI on a 64-bit PCID-capable CPU. **\n");
> +		printk(KERN_WARNING "** Your performance will increase dramatically if you     **\n");
> +		printk(KERN_WARNING "** switch to a 64-bit kernel!                             **\n");
> +		printk(KERN_WARNING "**                                                        **\n");
> +		printk(KERN_WARNING "** WARNING! WARNING! WARNING! WARNING! WARNING! WARNING!  **\n");
> +		printk(KERN_WARNING "************************************************************\n");
> +	}
> +#endif
> +
>  	pti_clone_user_shared();
>  	pti_clone_entry_text();
>  	pti_setup_espfix64();

Typo in the patch title: PCIE => PCID.

-Longman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 13:12     ` Joerg Roedel
@ 2018-03-05 14:51       ` Brian Gerst
  2018-03-05 16:44         ` Joerg Roedel
  2018-03-05 18:23       ` Linus Torvalds
  1 sibling, 1 reply; 56+ messages in thread
From: Brian Gerst @ 2018-03-05 14:51 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Linus Torvalds, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 5, 2018 at 8:12 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Mon, Mar 05, 2018 at 04:17:45AM -0800, Linus Torvalds wrote:
>>     Restoring the segments can cause exceptions that need to be
>>     handled. With PTI enabled, we still need to be on kernel cr3
>>     when the exception happens. For the cr3-switch we need
>>     at least one integer scratch register, so we can't switch
>>     with the user integer registers already loaded.
>>
>>
>> This fundamentally seems wrong.
>
> Okay, right, with v3 it is wrong, in v2 I still thought I could get away
> without remembering the entry-cr3, but didn't think about the #DB case
> then.
>
> In v3 I added code which remembers the entry-cr3 and handles the
> entry-from-kernel-mode-with-user-cr3 case for all exceptions including
> #DB.
>
>> The things is, we *know* that we will restore two segment registers with the
>> user cr3 already loaded: CS and SS get restored with the final iret.
>
> Yeah, I know, but the iret-exception path is fine because it will
> deliver a SIGILL and doesn't return to the faulting iret.
>
> Anyway, I will remove these restore-reorderings, they are not needed
> anymore.
>
>> So has this been tested with
>>
>>  - single-stepping through sysenter
>>
>>    This takes a DB fault in the first kernel instruction. We're in kernel mode,
>> but with user cr3.
>>
>>  - ptracing and setting CS/SS to something bad
>>
>>    That should test the "exception on iret" case - again in kernel mode, but
>> with user cr3 restored for the return.
>
> The iret-exception case is tested by the ldt_gdt selftest (the
> do_multicpu_tests subtest). But I didn't actually tested single-stepping
> through sysenter yet. I just re-ran the same tests I did with v2 on this
> patch-set.
>
> Regards,
>
>         Joerg
>

For the IRET fault case you will still need to catch it in the
exception code.  See the 64-bit code (.Lerror_bad_iret) for example.
For 32-bit, you could just expand that check to cover the whole exit
prologue after the CR3 switch, including the data segment loads.

I do wonder though, how expensive is a CR3 read?  The SDM implies that
only writes are serializing.  It may be simpler to just
unconditionally check it.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 34/34] x86/mm/pti: Add Warning when booting on a PCIE capable CPU
  2018-03-05 10:26 ` [PATCH 34/34] x86/mm/pti: Add Warning when booting on a PCIE capable CPU Joerg Roedel
  2018-03-05 13:39   ` Waiman Long
@ 2018-03-05 16:09   ` Denys Vlasenko
  1 sibling, 0 replies; 56+ messages in thread
From: Denys Vlasenko @ 2018-03-05 16:09 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, jroedel

On 03/05/2018 11:26 AM, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Warn the user in case the performance can be significantly
> improved by switching to a 64-bit kernel.
> 
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>   arch/x86/mm/pti.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index 3ffd923..8f5aa0d 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -385,6 +385,22 @@ void __init pti_init(void)
>   
>   	pr_info("enabled\n");
>   
> +#ifdef CONFIG_X86_32
> +	if (boot_cpu_has(X86_FEATURE_PCID)) {
> +		/* Use printk to work around pr_fmt() */
> +		printk(KERN_WARNING "\n");
> +		printk(KERN_WARNING "************************************************************\n");
> +		printk(KERN_WARNING "** WARNING! WARNING! WARNING! WARNING! WARNING! WARNING!  **\n");
> +		printk(KERN_WARNING "**                                                        **\n");
> +		printk(KERN_WARNING "** You are using 32-bit PTI on a 64-bit PCID-capable CPU. **\n");
> +		printk(KERN_WARNING "** Your performance will increase dramatically if you     **\n");
> +		printk(KERN_WARNING "** switch to a 64-bit kernel!                             **\n");
> +		printk(KERN_WARNING "**                                                        **\n");
> +		printk(KERN_WARNING "** WARNING! WARNING! WARNING! WARNING! WARNING! WARNING!  **\n");
> +		printk(KERN_WARNING "************************************************************\n");

Isn't it a bit too dramatic? Not one, but two lines of big fat warnings?

There are people who run 32-bit kernels on purpose, not because they
did not yet realize 64 bits are upon us.

E.g. industrial setups with strict regulations and licensing requirements.
In many such cases they already are more than satisfied with CPU speeds,
thus not interested in 64-bit migration for performance reasons,
and avoid it because it would incur mountains of paperwork
with no tangible gains.

The big fat warning on every boot would be irritating.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack
  2018-03-05 10:25 ` [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack Joerg Roedel
@ 2018-03-05 16:41   ` Brian Gerst
  2018-03-05 18:25     ` Joerg Roedel
  2018-03-06 12:27     ` Joerg Roedel
  0 siblings, 2 replies; 56+ messages in thread
From: Brian Gerst @ 2018-03-05 16:41 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, Linux-MM,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 5, 2018 at 5:25 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> It can happen that we enter the kernel from kernel-mode and
> on the entry-stack. The most common way this happens is when
> we get an exception while loading the user-space segment
> registers on the kernel-to-userspace exit path.
>
> The segment loading needs to be done after the entry-stack
> switch, because the stack-switch needs kernel %fs for
> per_cpu access.
>
> When this happens, we need to make sure that we leave the
> kernel with the entry-stack again, so that the interrupted
> code-path runs on the right stack when switching to the
> user-cr3.
>
> We do this by detecting this condition on kernel-entry by
> checking CS.RPL and %esp, and if it happens, we copy over
> the complete content of the entry stack to the task-stack.
> This needs to be done because once we enter the exception
> handlers we might be scheduled out or even migrated to a
> different CPU, so that we can't rely on the entry-stack
> contents. We also leave a marker in the stack-frame to
> detect this condition on the exit path.
>
> On the exit path the copy is reversed, we copy all of the
> remaining task-stack back to the entry-stack and switch
> to it.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/entry/entry_32.S | 110 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 109 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index bb0bd896..3a84945 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -299,6 +299,9 @@
>   * copied there. So allocate the stack-frame on the task-stack and
>   * switch to it before we do any copying.
>   */
> +
> +#define CS_FROM_ENTRY_STACK    (1 << 31)
> +
>  .macro SWITCH_TO_KERNEL_STACK
>
>         ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
> @@ -320,6 +323,10 @@
>         /* Load top of task-stack into %edi */
>         movl    TSS_entry_stack(%edi), %edi
>
> +       /* Special case - entry from kernel mode via entry stack */
> +       testl   $SEGMENT_RPL_MASK, PT_CS(%esp)
> +       jz      .Lentry_from_kernel_\@
> +
>         /* Bytes to copy */
>         movl    $PTREGS_SIZE, %ecx
>
> @@ -333,8 +340,8 @@
>          */
>         addl    $(4 * 4), %ecx
>
> -.Lcopy_pt_regs_\@:
>  #endif
> +.Lcopy_pt_regs_\@:
>
>         /* Allocate frame on task-stack */
>         subl    %ecx, %edi
> @@ -350,6 +357,56 @@
>         cld
>         rep movsl
>
> +       jmp .Lend_\@
> +
> +.Lentry_from_kernel_\@:
> +
> +       /*
> +        * This handles the case when we enter the kernel from
> +        * kernel-mode and %esp points to the entry-stack. When this
> +        * happens we need to switch to the task-stack to run C code,
> +        * but switch back to the entry-stack again when we approach
> +        * iret and return to the interrupted code-path. This usually
> +        * happens when we hit an exception while restoring user-space
> +        * segment registers on the way back to user-space.
> +        *
> +        * When we switch to the task-stack here, we can't trust the
> +        * contents of the entry-stack anymore, as the exception handler
> +        * might be scheduled out or moved to another CPU. Therefore we
> +        * copy the complete entry-stack to the task-stack and set a
> +        * marker in the iret-frame (bit 31 of the CS dword) to detect
> +        * what we've done on the iret path.

We don't need to worry about preemption changing the entry stack.  The
faults that IRET or segment loads can generate just run the exception
fixup handler and return.  Interrupts were disabled when the fault
occurred, so the kernel cannot be preempted.  The other case to watch
is #DB on SYSENTER, but that simply returns and doesn't sleep either.

We can keep the same process as the existing debug/NMI handlers -
leave the current exception pt_regs on the entry stack and just switch
to the task stack for the call to the handler.  Then switch back to
the entry stack and continue.  No copying needed.

> +        *
> +        * On the iret path we copy everything back and switch to the
> +        * entry-stack, so that the interrupted kernel code-path
> +        * continues on the same stack it was interrupted with.
> +        *
> +        * Be aware that an NMI can happen anytime in this code.
> +        *
> +        * %esi: Entry-Stack pointer (same as %esp)
> +        * %edi: Top of the task stack
> +        */
> +
> +       /* Calculate number of bytes on the entry stack in %ecx */
> +       movl    %esi, %ecx
> +
> +       /* %ecx to the top of entry-stack */
> +       andl    $(MASK_entry_stack), %ecx
> +       addl    $(SIZEOF_entry_stack), %ecx
> +
> +       /* Number of bytes on the entry stack to %ecx */
> +       sub     %esi, %ecx
> +
> +       /* Mark stackframe as coming from entry stack */
> +       orl     $CS_FROM_ENTRY_STACK, PT_CS(%esp)

Not all 32-bit processors will zero-extend segment pushes.  You will
need to explicitly clear the bit in the case where we didn't switch
CR3.

> +
> +       /*
> +        * %esi and %edi are unchanged, %ecx contains the number of
> +        * bytes to copy. The code at .Lcopy_pt_regs_\@ will allocate
> +        * the stack-frame on task-stack and copy everything over
> +        */
> +       jmp .Lcopy_pt_regs_\@
> +
>  .Lend_\@:
>  .endm
>
> @@ -408,6 +465,56 @@
>  .endm
>
>  /*
> + * This macro handles the case when we return to kernel-mode on the iret
> + * path and have to switch back to the entry stack.
> + *
> + * See the comments below the .Lentry_from_kernel_\@ label in the
> + * SWITCH_TO_KERNEL_STACK macro for more details.
> + */
> +.macro PARANOID_EXIT_TO_KERNEL_MODE
> +
> +       /*
> +        * Test if we entered the kernel with the entry-stack. Most
> +        * likely we did not, because this code only runs on the
> +        * return-to-kernel path.
> +        */
> +       testl   $CS_FROM_ENTRY_STACK, PT_CS(%esp)
> +       jz      .Lend_\@
> +
> +       /* Unlikely slow-path */
> +
> +       /* Clear marker from stack-frame */
> +       andl    $(~CS_FROM_ENTRY_STACK), PT_CS(%esp)
> +
> +       /* Copy the remaining task-stack contents to entry-stack */
> +       movl    %esp, %esi
> +       movl    PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi
> +
> +       /* Bytes on the task-stack to ecx */
> +       movl    PER_CPU_VAR(cpu_current_top_of_stack), %ecx
> +       subl    %esi, %ecx
> +
> +       /* Allocate stack-frame on entry-stack */
> +       subl    %ecx, %edi
> +
> +       /*
> +        * Save future stack-pointer, we must not switch until the
> +        * copy is done, otherwise the NMI handler could destroy the
> +        * contents of the task-stack we are about to copy.
> +        */
> +       movl    %edi, %ebx
> +
> +       /* Do the copy */
> +       shrl    $2, %ecx
> +       cld
> +       rep movsl
> +
> +       /* Safe to switch to entry-stack now */
> +       movl    %ebx, %esp
> +
> +.Lend_\@:
> +.endm
> +/*
>   * %eax: prev task
>   * %edx: next task
>   */
> @@ -765,6 +872,7 @@ restore_all:
>
>  restore_all_kernel:
>         TRACE_IRQS_IRET
> +       PARANOID_EXIT_TO_KERNEL_MODE
>         RESTORE_REGS 4
>         jmp     .Lirq_return
>
> --
> 2.7.4
>

--
Brian Gerst

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 14:51       ` Brian Gerst
@ 2018-03-05 16:44         ` Joerg Roedel
  2018-03-05 17:21           ` Brian Gerst
  0 siblings, 1 reply; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 16:44 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Linus Torvalds, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 05, 2018 at 09:51:29AM -0500, Brian Gerst wrote:
> For the IRET fault case you will still need to catch it in the
> exception code.  See the 64-bit code (.Lerror_bad_iret) for example.
> For 32-bit, you could just expand that check to cover the whole exit
> prologue after the CR3 switch, including the data segment loads.

I had a look at the 64 bit code and the exception-in-kernel case seems
to be handled differently than on 32 bit. The 64 bit entry code has
checks for certain kinds of errors like iret exceptions.

On 32 bit this is implemented via the standard exception tables which
get an entry for every EIP that might fault (usually segment loading
operations, but also iret).

So, unless I am missing something, all the exception entry code has to
do is to remember the stack and the cr3 with which it was entered (if
entered from kernel mode) and restore those before iret. And this is
what I implemented in v3 of this patch-set.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 16:44         ` Joerg Roedel
@ 2018-03-05 17:21           ` Brian Gerst
  0 siblings, 0 replies; 56+ messages in thread
From: Brian Gerst @ 2018-03-05 17:21 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Linus Torvalds, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 5, 2018 at 11:44 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Mon, Mar 05, 2018 at 09:51:29AM -0500, Brian Gerst wrote:
>> For the IRET fault case you will still need to catch it in the
>> exception code.  See the 64-bit code (.Lerror_bad_iret) for example.
>> For 32-bit, you could just expand that check to cover the whole exit
>> prologue after the CR3 switch, including the data segment loads.
>
> I had a look at the 64 bit code and the exception-in-kernel case seems
> to be handled differently than on 32 bit. The 64 bit entry code has
> checks for certain kinds of errors like iret exceptions.
>
> On 32 bit this is implemented via the standard exception tables which
> get an entry for every EIP that might fault (usually segment loading
> operations, but also iret).
>
> So, unless I am missing something, all the exception entry code has to
> do is to remember the stack and the cr3 with which it was entered (if
> entered from kernel mode) and restore those before iret. And this is
> what I implemented in v3 of this patch-set.

I also noticed that 32-bit will raise SIGILL for all IRET faults,
while 64-bit will raise SIGBUS (#NP/#SS) or SIGSEGV (#GP).  The 64-bit
code is better since it doesn't lose the original fault type, whereas
SIGILL is wrong for this case (illegal opcode).

--
Brian Gerst

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 13:12     ` Joerg Roedel
  2018-03-05 14:51       ` Brian Gerst
@ 2018-03-05 18:23       ` Linus Torvalds
  2018-03-05 18:36         ` Joerg Roedel
  2018-03-05 20:38         ` Brian Gerst
  1 sibling, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2018-03-05 18:23 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 5, 2018 at 5:12 AM, Joerg Roedel <joro@8bytes.org> wrote:
>
>> The things is, we *know* that we will restore two segment registers with the
>> user cr3 already loaded: CS and SS get restored with the final iret.
>
> Yeah, I know, but the iret-exception path is fine because it will
> deliver a SIGILL and doesn't return to the faulting iret.

That's not so much my worry, as just getting %cr3 wrong. The fact is,
we still take the exception, and we still have to handle it, and that
still needs to get the user<->kernel cr3 right.

So then the whole "restore segments early" must be wrong, because
*that* path must get it all right too, no?

And it appears that the code *does* get it right, and you can just
avoid this patch entirely?

> The iret-exception case is tested by the ldt_gdt selftest (the
> do_multicpu_tests subtest). But I didn't actually tested single-stepping
> through sysenter yet. I just re-ran the same tests I did with v2 on this
> patch-set.

Ok. Maybe we should have a test for the "take DB on first instruction
of sysenter".

           Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack
  2018-03-05 16:41   ` Brian Gerst
@ 2018-03-05 18:25     ` Joerg Roedel
  2018-03-05 20:32       ` Brian Gerst
  2018-03-06 12:27     ` Joerg Roedel
  1 sibling, 1 reply; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 18:25 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, Linux-MM,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

Hi Brian,

thanks for your review and helpful input.

On Mon, Mar 05, 2018 at 11:41:01AM -0500, Brian Gerst wrote:
> On Mon, Mar 5, 2018 at 5:25 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > +.Lentry_from_kernel_\@:
> > +
> > +       /*
> > +        * This handles the case when we enter the kernel from
> > +        * kernel-mode and %esp points to the entry-stack. When this
> > +        * happens we need to switch to the task-stack to run C code,
> > +        * but switch back to the entry-stack again when we approach
> > +        * iret and return to the interrupted code-path. This usually
> > +        * happens when we hit an exception while restoring user-space
> > +        * segment registers on the way back to user-space.
> > +        *
> > +        * When we switch to the task-stack here, we can't trust the
> > +        * contents of the entry-stack anymore, as the exception handler
> > +        * might be scheduled out or moved to another CPU. Therefore we
> > +        * copy the complete entry-stack to the task-stack and set a
> > +        * marker in the iret-frame (bit 31 of the CS dword) to detect
> > +        * what we've done on the iret path.
> 
> We don't need to worry about preemption changing the entry stack.  The
> faults that IRET or segment loads can generate just run the exception
> fixup handler and return.  Interrupts were disabled when the fault
> occurred, so the kernel cannot be preempted.  The other case to watch
> is #DB on SYSENTER, but that simply returns and doesn't sleep either.
> 
> We can keep the same process as the existing debug/NMI handlers -
> leave the current exception pt_regs on the entry stack and just switch
> to the task stack for the call to the handler.  Then switch back to
> the entry stack and continue.  No copying needed.

Okay, I'll look into that. Will it even be true for fully preemptible
and RT kernels that there can't be any preemption of these handlers?

> > +       /* Mark stackframe as coming from entry stack */
> > +       orl     $CS_FROM_ENTRY_STACK, PT_CS(%esp)
> 
> Not all 32-bit processors will zero-extend segment pushes.  You will
> need to explicitly clear the bit in the case where we didn't switch
> CR3.

Okay, thanks, will add that.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 18:23       ` Linus Torvalds
@ 2018-03-05 18:36         ` Joerg Roedel
  2018-03-05 20:38         ` Brian Gerst
  1 sibling, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 18:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 05, 2018 at 10:23:59AM -0800, Linus Torvalds wrote:
> On Mon, Mar 5, 2018 at 5:12 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >
> >> The things is, we *know* that we will restore two segment registers with the
> >> user cr3 already loaded: CS and SS get restored with the final iret.
> >
> > Yeah, I know, but the iret-exception path is fine because it will
> > deliver a SIGILL and doesn't return to the faulting iret.
> 
> That's not so much my worry, as just getting %cr3 wrong. The fact is,
> we still take the exception, and we still have to handle it, and that
> still needs to get the user<->kernel cr3 right.

Right, as I said, up to v2 of this series I thought I could avoid the
whole from-kernel-with-user-cr3 game, but that turned out to be wrong.
Now I added the necessary check and handling for it, as at least the
#DB handler needs it.

> So then the whole "restore segments early" must be wrong, because
> *that* path must get it all right too, no?
> 
> And it appears that the code *does* get it right, and you can just
> avoid this patch entirely?

Right, I will drop this patch.

> 
> > The iret-exception case is tested by the ldt_gdt selftest (the
> > do_multicpu_tests subtest). But I didn't actually tested single-stepping
> > through sysenter yet. I just re-ran the same tests I did with v2 on this
> > patch-set.
> 
> Ok. Maybe we should have a test for the "take DB on first instruction
> of sysenter".

I put a selftest for that on my list of things to look into. I'll have
no idea how difficult this will be, but I certainly find out :)


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack
  2018-03-05 18:25     ` Joerg Roedel
@ 2018-03-05 20:32       ` Brian Gerst
  0 siblings, 0 replies; 56+ messages in thread
From: Brian Gerst @ 2018-03-05 20:32 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, Linux-MM,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 5, 2018 at 1:25 PM, Joerg Roedel <joro@8bytes.org> wrote:
> Hi Brian,
>
> thanks for your review and helpful input.
>
> On Mon, Mar 05, 2018 at 11:41:01AM -0500, Brian Gerst wrote:
>> On Mon, Mar 5, 2018 at 5:25 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > +.Lentry_from_kernel_\@:
>> > +
>> > +       /*
>> > +        * This handles the case when we enter the kernel from
>> > +        * kernel-mode and %esp points to the entry-stack. When this
>> > +        * happens we need to switch to the task-stack to run C code,
>> > +        * but switch back to the entry-stack again when we approach
>> > +        * iret and return to the interrupted code-path. This usually
>> > +        * happens when we hit an exception while restoring user-space
>> > +        * segment registers on the way back to user-space.
>> > +        *
>> > +        * When we switch to the task-stack here, we can't trust the
>> > +        * contents of the entry-stack anymore, as the exception handler
>> > +        * might be scheduled out or moved to another CPU. Therefore we
>> > +        * copy the complete entry-stack to the task-stack and set a
>> > +        * marker in the iret-frame (bit 31 of the CS dword) to detect
>> > +        * what we've done on the iret path.
>>
>> We don't need to worry about preemption changing the entry stack.  The
>> faults that IRET or segment loads can generate just run the exception
>> fixup handler and return.  Interrupts were disabled when the fault
>> occurred, so the kernel cannot be preempted.  The other case to watch
>> is #DB on SYSENTER, but that simply returns and doesn't sleep either.
>>
>> We can keep the same process as the existing debug/NMI handlers -
>> leave the current exception pt_regs on the entry stack and just switch
>> to the task stack for the call to the handler.  Then switch back to
>> the entry stack and continue.  No copying needed.
>
> Okay, I'll look into that. Will it even be true for fully preemptible
> and RT kernels that there can't be any preemption of these handlers?

See resume_kernel in the 32-bit entry for how preemption is handled on
return to kernel mode.  Looking at the RT patches, they still respect
disabling interrupts also disabling preemption.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 18:23       ` Linus Torvalds
  2018-03-05 18:36         ` Joerg Roedel
@ 2018-03-05 20:38         ` Brian Gerst
  2018-03-05 20:50           ` Linus Torvalds
  1 sibling, 1 reply; 56+ messages in thread
From: Brian Gerst @ 2018-03-05 20:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 5, 2018 at 1:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Mar 5, 2018 at 5:12 AM, Joerg Roedel <joro@8bytes.org> wrote:
>>
>>> The things is, we *know* that we will restore two segment registers with the
>>> user cr3 already loaded: CS and SS get restored with the final iret.
>>
>> Yeah, I know, but the iret-exception path is fine because it will
>> deliver a SIGILL and doesn't return to the faulting iret.
>
> That's not so much my worry, as just getting %cr3 wrong. The fact is,
> we still take the exception, and we still have to handle it, and that
> still needs to get the user<->kernel cr3 right.
>
> So then the whole "restore segments early" must be wrong, because
> *that* path must get it all right too, no?
>
> And it appears that the code *does* get it right, and you can just
> avoid this patch entirely?
>
>> The iret-exception case is tested by the ldt_gdt selftest (the
>> do_multicpu_tests subtest). But I didn't actually tested single-stepping
>> through sysenter yet. I just re-ran the same tests I did with v2 on this
>> patch-set.
>
> Ok. Maybe we should have a test for the "take DB on first instruction
> of sysenter".
>
>            Linus

There already is a test: single_step_syscall.c

--
Brian Gerst

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 20:38         ` Brian Gerst
@ 2018-03-05 20:50           ` Linus Torvalds
  2018-03-05 21:35             ` Joerg Roedel
  0 siblings, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2018-03-05 20:50 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 5, 2018 at 12:38 PM, Brian Gerst <brgerst@gmail.com> wrote:
>
> There already is a test: single_step_syscall.c

Ahh, good. So presumably Joerg actually did check it, just didn't even notice ;)

            Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 20:50           ` Linus Torvalds
@ 2018-03-05 21:35             ` Joerg Roedel
  2018-03-05 21:58               ` Linus Torvalds
  0 siblings, 1 reply; 56+ messages in thread
From: Joerg Roedel @ 2018-03-05 21:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Brian Gerst, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 05, 2018 at 12:50:33PM -0800, Linus Torvalds wrote:
> On Mon, Mar 5, 2018 at 12:38 PM, Brian Gerst <brgerst@gmail.com> wrote:
> >
> > There already is a test: single_step_syscall.c
> 
> Ahh, good. So presumably Joerg actually did check it, just didn't even notice ;)

Yeah, sort of. I ran the test, but it didn't catch the failure case in
previous versions which was return to user with kernel-cr3 :)

I could probably add some debug instrumentation to check for that in my
future testing, as there is no NX protection in the user address-range
for the kernel-cr3.


	Joerg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 21:35             ` Joerg Roedel
@ 2018-03-05 21:58               ` Linus Torvalds
  2018-03-05 22:03                 ` H. Peter Anvin
  2018-03-06  8:38                 ` Joerg Roedel
  0 siblings, 2 replies; 56+ messages in thread
From: Linus Torvalds @ 2018-03-05 21:58 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Brian Gerst, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 5, 2018 at 1:35 PM, Joerg Roedel <joro@8bytes.org> wrote:
> On Mon, Mar 05, 2018 at 12:50:33PM -0800, Linus Torvalds wrote:
>>
>> Ahh, good. So presumably Joerg actually did check it, just didn't even notice ;)
>
> Yeah, sort of. I ran the test, but it didn't catch the failure case in
> previous versions which was return to user with kernel-cr3 :)

Ahh. Yes, that's bad. The NX protection to guarantee that you don't
return to user mode was really good on x86-64.

So some other case could slip through, because user code can happily
run with the kernel page tables.

> I could probably add some debug instrumentation to check for that in my
> future testing, as there is no NX protection in the user address-range
> for the kernel-cr3.

Does not NX work with PAE?

Oh, it looks like the NX bit is marked as "RSVD (must be 0)" in the
PDPDT. Oh well.

                  Linux

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 21:58               ` Linus Torvalds
@ 2018-03-05 22:03                 ` H. Peter Anvin
  2018-03-06  7:04                   ` Ingo Molnar
  2018-03-06  8:38                 ` Joerg Roedel
  1 sibling, 1 reply; 56+ messages in thread
From: H. Peter Anvin @ 2018-03-05 22:03 UTC (permalink / raw)
  To: Linus Torvalds, Joerg Roedel
  Cc: Brian Gerst, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On 03/05/18 13:58, Linus Torvalds wrote:
> On Mon, Mar 5, 2018 at 1:35 PM, Joerg Roedel <joro@8bytes.org> wrote:
>> On Mon, Mar 05, 2018 at 12:50:33PM -0800, Linus Torvalds wrote:
>>>
>>> Ahh, good. So presumably Joerg actually did check it, just didn't even notice ;)
>>
>> Yeah, sort of. I ran the test, but it didn't catch the failure case in
>> previous versions which was return to user with kernel-cr3 :)
> 
> Ahh. Yes, that's bad. The NX protection to guarantee that you don't
> return to user mode was really good on x86-64.
> 
> So some other case could slip through, because user code can happily
> run with the kernel page tables.
> 
>> I could probably add some debug instrumentation to check for that in my
>> future testing, as there is no NX protection in the user address-range
>> for the kernel-cr3.
> 
> Does not NX work with PAE?
> 
> Oh, it looks like the NX bit is marked as "RSVD (must be 0)" in the
> PDPDT. Oh well.
> 

On NX-enabled hardware NX works with PDE, but the PDPDT in general
doesn't have permission bits (it's really more of a set of four CR3s
than a page table level.)

	-hpa

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 22:03                 ` H. Peter Anvin
@ 2018-03-06  7:04                   ` Ingo Molnar
  2018-03-06 13:45                     ` Dave Hansen
  0 siblings, 1 reply; 56+ messages in thread
From: Ingo Molnar @ 2018-03-06  7:04 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Joerg Roedel, Brian Gerst, Thomas Gleixner,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel


* H. Peter Anvin <hpa@zytor.com> wrote:

> On NX-enabled hardware NX works with PDE, but the PDPDT in general doesn't
> have permission bits (it's really more of a set of four CR3s than a page
> table level.)

The 4 PDPDT entries are also shadowed in the CPU and are only refreshed
on CR3 loads, not spontaneously reloaded from memory during TLB walk
like regular page table entries, right?

This too strengthens the notion that the third page table level of PAE is more 
like a special in-memory CR3[4] array.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-05 21:58               ` Linus Torvalds
  2018-03-05 22:03                 ` H. Peter Anvin
@ 2018-03-06  8:38                 ` Joerg Roedel
  1 sibling, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-06  8:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Brian Gerst, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On Mon, Mar 05, 2018 at 01:58:32PM -0800, Linus Torvalds wrote:
> On Mon, Mar 5, 2018 at 1:35 PM, Joerg Roedel <joro@8bytes.org> wrote:
> > I could probably add some debug instrumentation to check for that in my
> > future testing, as there is no NX protection in the user address-range
> > for the kernel-cr3.
> 
> Does not NX work with PAE?
> 
> Oh, it looks like the NX bit is marked as "RSVD (must be 0)" in the
> PDPDT. Oh well.

I had a version of these patches running which implemented NX on the PDE
level by allocating 8k PMD pages. But that ended up needing 5 order-1
allocations for each page-table, which I got to fail pretty easily after
some time. So I abandoned this approach for now.

Maybe it can be implemented with order-0 allocations for PMD pages, the
open problem is how to link the user and kernel PMD page-pairs together
then.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack
  2018-03-05 16:41   ` Brian Gerst
  2018-03-05 18:25     ` Joerg Roedel
@ 2018-03-06 12:27     ` Joerg Roedel
  1 sibling, 0 replies; 56+ messages in thread
From: Joerg Roedel @ 2018-03-06 12:27 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, Linux-MM,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

Hi Brian,

On Mon, Mar 05, 2018 at 11:41:01AM -0500, Brian Gerst wrote:
> We can keep the same process as the existing debug/NMI handlers -
> leave the current exception pt_regs on the entry stack and just switch
> to the task stack for the call to the handler.  Then switch back to
> the entry stack and continue.  No copying needed.

I looked into this and things are a bit more complicated than in the NMI
and debug handlers. The current code after pt_regs is set up relies on
%esp pointing to the pt_regs structure. But if pt_regs could be on
another stack we need to carry the pt_regs pointer in another register
through the whole ret_from_exception code-path until we actually switch
back the stack.

Since the code-path is used for all stack/cr3 entry/exit cases we need
to setup the extra pt_regs pointer unconditionally and update all places
that reference it through %esp.

It can certainly be done but it looks like another major surgery in the
entry code to optimize a slow-path for handling unlikely segment-loading
exceptions and debug traps. I am not sure if it's worth it.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/34] x86/entry/32: Restore segments before int registers
  2018-03-06  7:04                   ` Ingo Molnar
@ 2018-03-06 13:45                     ` Dave Hansen
  0 siblings, 0 replies; 56+ messages in thread
From: Dave Hansen @ 2018-03-06 13:45 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin
  Cc: Linus Torvalds, Joerg Roedel, Brian Gerst, Thomas Gleixner,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Josh Poimboeuf, Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, Joerg Roedel

On 03/05/2018 11:04 PM, Ingo Molnar wrote:
> * H. Peter Anvin <hpa@zytor.com> wrote:
>> On NX-enabled hardware NX works with PDE, but the PDPDT in general doesn't
>> have permission bits (it's really more of a set of four CR3s than a page
>> table level.)
> The 4 PDPDT entries are also shadowed in the CPU and are only refreshed
> on CR3 loads, not spontaneously reloaded from memory during TLB walk
> like regular page table entries, right?

Yes.  The SDM even calls them non-architectural "PDPTE Registers" and
talks about them only being loaded at CR3 write time.

~5 years ago we even had a bug directly related to this feature:

> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=324cdc3f7e6a752fe0e95fa7b5c9664171a34ded

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2018-03-06 13:45 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-05 10:25 [PATCH 00/34 v3] PTI support for x32 Joerg Roedel
2018-03-05 10:25 ` [PATCH 01/34] x86/asm-offsets: Move TSS_sp0 and TSS_sp1 to asm-offsets.c Joerg Roedel
2018-03-05 10:25 ` [PATCH 02/34] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_entry_stack Joerg Roedel
2018-03-05 10:25 ` [PATCH 03/34] x86/entry/32: Load task stack from x86_tss.sp1 in SYSENTER handler Joerg Roedel
2018-03-05 10:25 ` [PATCH 04/34] x86/entry/32: Put ESPFIX code into a macro Joerg Roedel
2018-03-05 10:25 ` [PATCH 05/34] x86/entry/32: Unshare NMI return path Joerg Roedel
2018-03-05 10:25 ` [PATCH 06/34] x86/entry/32: Split off return-to-kernel path Joerg Roedel
2018-03-05 10:25 ` [PATCH 07/34] x86/entry/32: Restore segments before int registers Joerg Roedel
2018-03-05 12:17   ` Linus Torvalds
2018-03-05 13:12     ` Joerg Roedel
2018-03-05 14:51       ` Brian Gerst
2018-03-05 16:44         ` Joerg Roedel
2018-03-05 17:21           ` Brian Gerst
2018-03-05 18:23       ` Linus Torvalds
2018-03-05 18:36         ` Joerg Roedel
2018-03-05 20:38         ` Brian Gerst
2018-03-05 20:50           ` Linus Torvalds
2018-03-05 21:35             ` Joerg Roedel
2018-03-05 21:58               ` Linus Torvalds
2018-03-05 22:03                 ` H. Peter Anvin
2018-03-06  7:04                   ` Ingo Molnar
2018-03-06 13:45                     ` Dave Hansen
2018-03-06  8:38                 ` Joerg Roedel
2018-03-05 10:25 ` [PATCH 08/34] x86/entry/32: Enter the kernel via trampoline stack Joerg Roedel
2018-03-05 10:25 ` [PATCH 09/34] x86/entry/32: Leave " Joerg Roedel
2018-03-05 10:25 ` [PATCH 10/34] x86/entry/32: Introduce SAVE_ALL_NMI and RESTORE_ALL_NMI Joerg Roedel
2018-03-05 10:25 ` [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack Joerg Roedel
2018-03-05 16:41   ` Brian Gerst
2018-03-05 18:25     ` Joerg Roedel
2018-03-05 20:32       ` Brian Gerst
2018-03-06 12:27     ` Joerg Roedel
2018-03-05 10:25 ` [PATCH 12/34] x86/entry/32: Simplify debug entry point Joerg Roedel
2018-03-05 10:25 ` [PATCH 13/34] x86/entry/32: Add PTI cr3 switches to NMI handler code Joerg Roedel
2018-03-05 10:25 ` [PATCH 14/34] x86/entry/32: Add PTI cr3 switch to non-NMI entry/exit points Joerg Roedel
2018-03-05 10:25 ` [PATCH 15/34] x86/pgtable: Rename pti_set_user_pgd to pti_set_user_pgtbl Joerg Roedel
2018-03-05 10:25 ` [PATCH 16/34] x86/pgtable/pae: Unshare kernel PMDs when PTI is enabled Joerg Roedel
2018-03-05 10:25 ` [PATCH 17/34] x86/pgtable/32: Allocate 8k page-tables " Joerg Roedel
2018-03-05 10:25 ` [PATCH 18/34] x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h Joerg Roedel
2018-03-05 10:25 ` [PATCH 19/34] x86/pgtable: Move pti_set_user_pgtbl() " Joerg Roedel
2018-03-05 10:25 ` [PATCH 20/34] x86/pgtable: Move two more functions from pgtable_64.h " Joerg Roedel
2018-03-05 10:25 ` [PATCH 21/34] x86/mm/pae: Populate valid user PGD entries Joerg Roedel
2018-03-05 10:25 ` [PATCH 22/34] x86/mm/pae: Populate the user page-table with user pgd's Joerg Roedel
2018-03-05 10:25 ` [PATCH 23/34] x86/mm/legacy: " Joerg Roedel
2018-03-05 10:25 ` [PATCH 24/34] x86/mm/pti: Add an overflow check to pti_clone_pmds() Joerg Roedel
2018-03-05 10:25 ` [PATCH 25/34] x86/mm/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32 Joerg Roedel
2018-03-05 10:25 ` [PATCH 26/34] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level " Joerg Roedel
2018-03-05 10:25 ` [PATCH 27/34] x86/mm/dump_pagetables: Define INIT_PGD Joerg Roedel
2018-03-05 10:25 ` [PATCH 28/34] x86/pgtable/pae: Use separate kernel PMDs for user page-table Joerg Roedel
2018-03-05 10:25 ` [PATCH 29/34] x86/ldt: Reserve address-space range on 32 bit for the LDT Joerg Roedel
2018-03-05 10:25 ` [PATCH 30/34] x86/ldt: Define LDT_END_ADDR Joerg Roedel
2018-03-05 10:26 ` [PATCH 31/34] x86/ldt: Split out sanity check in map_ldt_struct() Joerg Roedel
2018-03-05 10:26 ` [PATCH 32/34] x86/ldt: Enable LDT user-mapping for PAE Joerg Roedel
2018-03-05 10:26 ` [PATCH 33/34] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32 Joerg Roedel
2018-03-05 10:26 ` [PATCH 34/34] x86/mm/pti: Add Warning when booting on a PCIE capable CPU Joerg Roedel
2018-03-05 13:39   ` Waiman Long
2018-03-05 16:09   ` Denys Vlasenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).