All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-16 16:36 ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Hi,

here is my current WIP code to enable PTI on x86-32. It is
still in a pretty early state, but it successfully boots my
KVM guest with PAE and with legacy paging. The existing PTI
code for x86-64 already prepares a lot of the stuff needed
for 32 bit too, thanks for that to all the people involved
in its development :)

The patches are split as follows:

	- 1-3 contain the entry-code changes to enter and
	  exit the kernel via the sysenter trampoline stack.

	- 4-7 are fixes to get the code compile on 32 bit
	  with CONFIG_PAGE_TABLE_ISOLATION=y.

	- 8-14 adapt the existing PTI code to work properly
	  on 32 bit and add the needed parts to 32 bit
	  page-table code.

	- 15 switches PTI on by adding the CR3 switches to
	  kernel entry/exit.

	- 16 enables the Kconfig for all of X86

The code has not run on bare-metal yet, I'll test that in
the next days once I setup a 32 bit box again. I also havn't
tested Wine and DosEMU yet, so this might also be broken.

With that post I'd like to ask for all kinds of constructive
feedback on the approaches I have taken and of course the
many things I broke with it :)

One of the things that are surely broken is XEN_PV support.
I'd appreciate any help with testing and bugfixing on that
front.

So please review and let me know your thoughts.

Thanks,

	Joerg

Joerg Roedel (16):
  x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
  x86/entry/32: Enter the kernel via trampoline stack
  x86/entry/32: Leave the kernel via the trampoline stack
  x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
  x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
  x86/mm/ldt: Reserve high address-space range for the LDT
  x86/mm: Move two more functions from pgtable_64.h to pgtable.h
  x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
  x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
  x86/mm/pti: Populate valid user pud entries
  x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
  x86/mm/pae: Populate the user page-table with user pgd's
  x86/mm/pti: Add an overflow check to pti_clone_pmds()
  x86/mm/legacy: Populate the user page-table with user pgd's
  x86/entry/32: Switch between kernel and user cr3 on entry/exit
  x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32

 arch/x86/entry/entry_32.S               | 170 +++++++++++++++++++++++++++++---
 arch/x86/include/asm/pgtable-2level.h   |   3 +
 arch/x86/include/asm/pgtable-3level.h   |   3 +
 arch/x86/include/asm/pgtable.h          |  88 +++++++++++++++++
 arch/x86/include/asm/pgtable_32_types.h |   5 +-
 arch/x86/include/asm/pgtable_64.h       |  85 ----------------
 arch/x86/include/asm/processor-flags.h  |   8 +-
 arch/x86/include/asm/switch_to.h        |   6 +-
 arch/x86/kernel/asm-offsets_32.c        |   5 +-
 arch/x86/kernel/cpu/common.c            |   5 +-
 arch/x86/kernel/head_32.S               |  23 ++++-
 arch/x86/kernel/process.c               |   2 -
 arch/x86/kernel/process_32.c            |   6 ++
 arch/x86/mm/pgtable.c                   |  11 ++-
 arch/x86/mm/pti.c                       |  34 ++++++-
 security/Kconfig                        |   2 +-
 16 files changed, 333 insertions(+), 123 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 183+ messages in thread

* [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-16 16:36 ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Hi,

here is my current WIP code to enable PTI on x86-32. It is
still in a pretty early state, but it successfully boots my
KVM guest with PAE and with legacy paging. The existing PTI
code for x86-64 already prepares a lot of the stuff needed
for 32 bit too, thanks for that to all the people involved
in its development :)

The patches are split as follows:

	- 1-3 contain the entry-code changes to enter and
	  exit the kernel via the sysenter trampoline stack.

	- 4-7 are fixes to get the code compile on 32 bit
	  with CONFIG_PAGE_TABLE_ISOLATION=y.

	- 8-14 adapt the existing PTI code to work properly
	  on 32 bit and add the needed parts to 32 bit
	  page-table code.

	- 15 switches PTI on by adding the CR3 switches to
	  kernel entry/exit.

	- 16 enables the Kconfig for all of X86

The code has not run on bare-metal yet, I'll test that in
the next days once I setup a 32 bit box again. I also havn't
tested Wine and DosEMU yet, so this might also be broken.

With that post I'd like to ask for all kinds of constructive
feedback on the approaches I have taken and of course the
many things I broke with it :)

One of the things that are surely broken is XEN_PV support.
I'd appreciate any help with testing and bugfixing on that
front.

So please review and let me know your thoughts.

Thanks,

	Joerg

Joerg Roedel (16):
  x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
  x86/entry/32: Enter the kernel via trampoline stack
  x86/entry/32: Leave the kernel via the trampoline stack
  x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
  x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
  x86/mm/ldt: Reserve high address-space range for the LDT
  x86/mm: Move two more functions from pgtable_64.h to pgtable.h
  x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
  x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
  x86/mm/pti: Populate valid user pud entries
  x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
  x86/mm/pae: Populate the user page-table with user pgd's
  x86/mm/pti: Add an overflow check to pti_clone_pmds()
  x86/mm/legacy: Populate the user page-table with user pgd's
  x86/entry/32: Switch between kernel and user cr3 on entry/exit
  x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32

 arch/x86/entry/entry_32.S               | 170 +++++++++++++++++++++++++++++---
 arch/x86/include/asm/pgtable-2level.h   |   3 +
 arch/x86/include/asm/pgtable-3level.h   |   3 +
 arch/x86/include/asm/pgtable.h          |  88 +++++++++++++++++
 arch/x86/include/asm/pgtable_32_types.h |   5 +-
 arch/x86/include/asm/pgtable_64.h       |  85 ----------------
 arch/x86/include/asm/processor-flags.h  |   8 +-
 arch/x86/include/asm/switch_to.h        |   6 +-
 arch/x86/kernel/asm-offsets_32.c        |   5 +-
 arch/x86/kernel/cpu/common.c            |   5 +-
 arch/x86/kernel/head_32.S               |  23 ++++-
 arch/x86/kernel/process.c               |   2 -
 arch/x86/kernel/process_32.c            |   6 ++
 arch/x86/mm/pgtable.c                   |  11 ++-
 arch/x86/mm/pti.c                       |  34 ++++++-
 security/Kconfig                        |   2 +-
 16 files changed, 333 insertions(+), 123 deletions(-)

-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* [PATCH 01/16] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

The stack addresss doesn't need to be stored in tss.sp0 if
we switch manually like on sysenter. Rename the offset so
that it still makes sense when we its location.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S        | 2 +-
 arch/x86/kernel/asm-offsets_32.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index a1f28a54f23a..eb8c5615777b 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -401,7 +401,7 @@ ENTRY(xen_sysenter_target)
  * 0(%ebp) arg6
  */
 ENTRY(entry_SYSENTER_32)
-	movl	TSS_sysenter_sp0(%esp), %esp
+	movl	TSS_sysenter_stack(%esp), %esp
 .Lsysenter_past_esp:
 	pushl	$__USER_DS		/* pt_regs->ss */
 	pushl	%ebp			/* pt_regs->sp (stashed in bp) */
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index fa1261eefa16..654229bac2fc 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,7 +47,7 @@ void foo(void)
 	BLANK();
 
 	/* Offset from the sysenter stack to tss.sp0 */
-	DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
 	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
 
 #ifdef CONFIG_CC_STACKPROTECTOR
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 01/16] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

The stack addresss doesn't need to be stored in tss.sp0 if
we switch manually like on sysenter. Rename the offset so
that it still makes sense when we its location.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S        | 2 +-
 arch/x86/kernel/asm-offsets_32.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index a1f28a54f23a..eb8c5615777b 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -401,7 +401,7 @@ ENTRY(xen_sysenter_target)
  * 0(%ebp) arg6
  */
 ENTRY(entry_SYSENTER_32)
-	movl	TSS_sysenter_sp0(%esp), %esp
+	movl	TSS_sysenter_stack(%esp), %esp
 .Lsysenter_past_esp:
 	pushl	$__USER_DS		/* pt_regs->ss */
 	pushl	%ebp			/* pt_regs->sp (stashed in bp) */
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index fa1261eefa16..654229bac2fc 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,7 +47,7 @@ void foo(void)
 	BLANK();
 
 	/* Offset from the sysenter stack to tss.sp0 */
-	DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
 	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
 
 #ifdef CONFIG_CC_STACKPROTECTOR
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Use the sysenter stack as a trampoline stack to enter the
kernel. The sysenter stack is already in the cpu_entry_area
and will be mapped to userspace when PTI is enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S        | 89 +++++++++++++++++++++++++++++++++++-----
 arch/x86/include/asm/switch_to.h |  6 +--
 arch/x86/kernel/asm-offsets_32.c |  4 +-
 arch/x86/kernel/cpu/common.c     |  5 ++-
 arch/x86/kernel/process.c        |  2 -
 arch/x86/kernel/process_32.c     |  6 +++
 6 files changed, 91 insertions(+), 21 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index eb8c5615777b..5a7bdb73be9f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -222,6 +222,47 @@
 .endm
 
 /*
+ * Switch from the entry-trampline stack to the kernel stack of the
+ * running task.
+ *
+ * nr_regs is the number of dwords to push from the entry stack to the
+ * task stack. If it is > 0 it expects an irq frame at the bottom of the
+ * stack.
+ *
+ * check_user != 0 it will add a check to only switch stacks if the
+ * kernel entry was from user-space.
+ */
+.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
+
+	.if \check_user > 0 && \nr_regs > 0
+	testb $3, (\nr_regs - 4)*4(%esp)		/* CS */
+	jz .Lend_\@
+	.endif
+
+	pushl %edi
+	movl  %esp, %edi
+
+	/*
+	 * TSS_sysenter_stack is the offset from the bottom of the
+	 * entry-stack
+	 */
+	movl  TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp
+
+	/* Copy the registers over */
+	.if \nr_regs > 0
+	i = 0
+	.rept \nr_regs
+	pushl (\nr_regs - i) * 4(%edi)
+	i = i + 1
+	.endr
+	.endif
+
+	mov (%edi), %edi
+
+.Lend_\@:
+.endm
+
+/*
  * %eax: prev task
  * %edx: next task
  */
@@ -401,7 +442,9 @@ ENTRY(xen_sysenter_target)
  * 0(%ebp) arg6
  */
 ENTRY(entry_SYSENTER_32)
-	movl	TSS_sysenter_stack(%esp), %esp
+	/* Kernel stack is empty */
+	SWITCH_TO_KERNEL_STACK
+
 .Lsysenter_past_esp:
 	pushl	$__USER_DS		/* pt_regs->ss */
 	pushl	%ebp			/* pt_regs->sp (stashed in bp) */
@@ -521,6 +564,10 @@ ENDPROC(entry_SYSENTER_32)
 ENTRY(entry_INT80_32)
 	ASM_CLAC
 	pushl	%eax			/* pt_regs->orig_ax */
+
+	/* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
+	SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
 	SAVE_ALL pt_regs_ax=$-ENOSYS	/* save rest */
 
 	/*
@@ -655,6 +702,10 @@ END(irq_entries_start)
 common_interrupt:
 	ASM_CLAC
 	addl	$-0x80, (%esp)			/* Adjust vector into the [-256, -1] range */
+
+	/* Stack layout: ss, esp, eflags, cs, eip, vector */
+	SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
@@ -663,16 +714,17 @@ common_interrupt:
 	jmp	ret_from_intr
 ENDPROC(common_interrupt)
 
-#define BUILD_INTERRUPT3(name, nr, fn)	\
-ENTRY(name)				\
-	ASM_CLAC;			\
-	pushl	$~(nr);			\
-	SAVE_ALL;			\
-	ENCODE_FRAME_POINTER;		\
-	TRACE_IRQS_OFF			\
-	movl	%esp, %eax;		\
-	call	fn;			\
-	jmp	ret_from_intr;		\
+#define BUILD_INTERRUPT3(name, nr, fn)			\
+ENTRY(name)						\
+	ASM_CLAC;					\
+	pushl	$~(nr);					\
+	SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1;	\
+	SAVE_ALL;					\
+	ENCODE_FRAME_POINTER;				\
+	TRACE_IRQS_OFF					\
+	movl	%esp, %eax;				\
+	call	fn;					\
+	jmp	ret_from_intr;				\
 ENDPROC(name)
 
 #define BUILD_INTERRUPT(name, nr)		\
@@ -893,6 +945,9 @@ ENTRY(page_fault)
 END(page_fault)
 
 common_exception:
+	/* Stack layout: ss, esp, eflags, cs, eip, error_code, handler */
+	SWITCH_TO_KERNEL_STACK nr_regs=7 check_user=1
+
 	/* the function address is in %gs's slot on the stack */
 	pushl	%fs
 	pushl	%es
@@ -936,6 +991,10 @@ ENTRY(debug)
 	 */
 	ASM_CLAC
 	pushl	$-1				# mark this as an int
+
+	/* Stack layout: ss, esp, eflags, cs, eip, $-1 */
+	SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
 	xorl	%edx, %edx			# error code 0
@@ -971,6 +1030,10 @@ END(debug)
  */
 ENTRY(nmi)
 	ASM_CLAC
+
+	/* Stack layout: ss, esp, eflags, cs, eip */
+	SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1
+
 #ifdef CONFIG_X86_ESPFIX32
 	pushl	%eax
 	movl	%ss, %eax
@@ -1034,6 +1097,10 @@ END(nmi)
 ENTRY(int3)
 	ASM_CLAC
 	pushl	$-1				# mark this as an int
+
+	/* Stack layout: ss, esp, eflags, cs, eip, vector */
+	SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index eb5f7999a893..20e5f7ab8260 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
 /* This is used when switching tasks or entering/exiting vm86 mode. */
 static inline void update_sp0(struct task_struct *task)
 {
-	/* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
-#ifdef CONFIG_X86_32
-	load_sp0(task->thread.sp0);
-#else
+	/* sp0 always points to the entry trampoline stack, which is constant: */
 	if (static_cpu_has(X86_FEATURE_XENPV))
 		load_sp0(task_top_of_stack(task));
-#endif
 }
 
 #endif /* _ASM_X86_SWITCH_TO_H */
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index 654229bac2fc..7270dd834f4b 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,9 +47,11 @@ void foo(void)
 	BLANK();
 
 	/* Offset from the sysenter stack to tss.sp0 */
-	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
 	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
 
+	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
+
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
 	OFFSET(stack_canary_offset, stack_canary, canary);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ef29ad001991..20a71c914e59 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1649,11 +1649,12 @@ void cpu_init(void)
 	enter_lazy_tlb(&init_mm, curr);
 
 	/*
-	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
-	 * task never enters user mode.
+	 * Initialize the TSS.  sp0 points to the entry trampoline stack
+	 * regardless of what task is running.
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
+	load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 832a6acd730f..a9950946b263 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -57,14 +57,12 @@ __visible DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw) = {
 		 */
 		.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
 
-#ifdef CONFIG_X86_64
 		/*
 		 * .sp1 is cpu_current_top_of_stack.  The init task never
 		 * runs user code, but cpu_current_top_of_stack should still
 		 * be well defined before the first context switch.
 		 */
 		.sp1 = TOP_OF_INIT_STACK,
-#endif
 
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 5224c6099184..452eeac00b80 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -292,6 +292,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	this_cpu_write(cpu_current_top_of_stack,
 		       (unsigned long)task_stack_page(next_p) +
 		       THREAD_SIZE);
+	/*
+	 * TODO: Find a way to let cpu_current_top_of_stack point to
+	 * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
+	 * iret exceptions.
+	 */
+	this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
 
 	/*
 	 * Restore %gs if needed (which is common)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Use the sysenter stack as a trampoline stack to enter the
kernel. The sysenter stack is already in the cpu_entry_area
and will be mapped to userspace when PTI is enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S        | 89 +++++++++++++++++++++++++++++++++++-----
 arch/x86/include/asm/switch_to.h |  6 +--
 arch/x86/kernel/asm-offsets_32.c |  4 +-
 arch/x86/kernel/cpu/common.c     |  5 ++-
 arch/x86/kernel/process.c        |  2 -
 arch/x86/kernel/process_32.c     |  6 +++
 6 files changed, 91 insertions(+), 21 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index eb8c5615777b..5a7bdb73be9f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -222,6 +222,47 @@
 .endm
 
 /*
+ * Switch from the entry-trampline stack to the kernel stack of the
+ * running task.
+ *
+ * nr_regs is the number of dwords to push from the entry stack to the
+ * task stack. If it is > 0 it expects an irq frame at the bottom of the
+ * stack.
+ *
+ * check_user != 0 it will add a check to only switch stacks if the
+ * kernel entry was from user-space.
+ */
+.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
+
+	.if \check_user > 0 && \nr_regs > 0
+	testb $3, (\nr_regs - 4)*4(%esp)		/* CS */
+	jz .Lend_\@
+	.endif
+
+	pushl %edi
+	movl  %esp, %edi
+
+	/*
+	 * TSS_sysenter_stack is the offset from the bottom of the
+	 * entry-stack
+	 */
+	movl  TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp
+
+	/* Copy the registers over */
+	.if \nr_regs > 0
+	i = 0
+	.rept \nr_regs
+	pushl (\nr_regs - i) * 4(%edi)
+	i = i + 1
+	.endr
+	.endif
+
+	mov (%edi), %edi
+
+.Lend_\@:
+.endm
+
+/*
  * %eax: prev task
  * %edx: next task
  */
@@ -401,7 +442,9 @@ ENTRY(xen_sysenter_target)
  * 0(%ebp) arg6
  */
 ENTRY(entry_SYSENTER_32)
-	movl	TSS_sysenter_stack(%esp), %esp
+	/* Kernel stack is empty */
+	SWITCH_TO_KERNEL_STACK
+
 .Lsysenter_past_esp:
 	pushl	$__USER_DS		/* pt_regs->ss */
 	pushl	%ebp			/* pt_regs->sp (stashed in bp) */
@@ -521,6 +564,10 @@ ENDPROC(entry_SYSENTER_32)
 ENTRY(entry_INT80_32)
 	ASM_CLAC
 	pushl	%eax			/* pt_regs->orig_ax */
+
+	/* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
+	SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
 	SAVE_ALL pt_regs_ax=$-ENOSYS	/* save rest */
 
 	/*
@@ -655,6 +702,10 @@ END(irq_entries_start)
 common_interrupt:
 	ASM_CLAC
 	addl	$-0x80, (%esp)			/* Adjust vector into the [-256, -1] range */
+
+	/* Stack layout: ss, esp, eflags, cs, eip, vector */
+	SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
@@ -663,16 +714,17 @@ common_interrupt:
 	jmp	ret_from_intr
 ENDPROC(common_interrupt)
 
-#define BUILD_INTERRUPT3(name, nr, fn)	\
-ENTRY(name)				\
-	ASM_CLAC;			\
-	pushl	$~(nr);			\
-	SAVE_ALL;			\
-	ENCODE_FRAME_POINTER;		\
-	TRACE_IRQS_OFF			\
-	movl	%esp, %eax;		\
-	call	fn;			\
-	jmp	ret_from_intr;		\
+#define BUILD_INTERRUPT3(name, nr, fn)			\
+ENTRY(name)						\
+	ASM_CLAC;					\
+	pushl	$~(nr);					\
+	SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1;	\
+	SAVE_ALL;					\
+	ENCODE_FRAME_POINTER;				\
+	TRACE_IRQS_OFF					\
+	movl	%esp, %eax;				\
+	call	fn;					\
+	jmp	ret_from_intr;				\
 ENDPROC(name)
 
 #define BUILD_INTERRUPT(name, nr)		\
@@ -893,6 +945,9 @@ ENTRY(page_fault)
 END(page_fault)
 
 common_exception:
+	/* Stack layout: ss, esp, eflags, cs, eip, error_code, handler */
+	SWITCH_TO_KERNEL_STACK nr_regs=7 check_user=1
+
 	/* the function address is in %gs's slot on the stack */
 	pushl	%fs
 	pushl	%es
@@ -936,6 +991,10 @@ ENTRY(debug)
 	 */
 	ASM_CLAC
 	pushl	$-1				# mark this as an int
+
+	/* Stack layout: ss, esp, eflags, cs, eip, $-1 */
+	SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
 	xorl	%edx, %edx			# error code 0
@@ -971,6 +1030,10 @@ END(debug)
  */
 ENTRY(nmi)
 	ASM_CLAC
+
+	/* Stack layout: ss, esp, eflags, cs, eip */
+	SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1
+
 #ifdef CONFIG_X86_ESPFIX32
 	pushl	%eax
 	movl	%ss, %eax
@@ -1034,6 +1097,10 @@ END(nmi)
 ENTRY(int3)
 	ASM_CLAC
 	pushl	$-1				# mark this as an int
+
+	/* Stack layout: ss, esp, eflags, cs, eip, vector */
+	SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
+
 	SAVE_ALL
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index eb5f7999a893..20e5f7ab8260 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
 /* This is used when switching tasks or entering/exiting vm86 mode. */
 static inline void update_sp0(struct task_struct *task)
 {
-	/* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
-#ifdef CONFIG_X86_32
-	load_sp0(task->thread.sp0);
-#else
+	/* sp0 always points to the entry trampoline stack, which is constant: */
 	if (static_cpu_has(X86_FEATURE_XENPV))
 		load_sp0(task_top_of_stack(task));
-#endif
 }
 
 #endif /* _ASM_X86_SWITCH_TO_H */
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index 654229bac2fc..7270dd834f4b 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,9 +47,11 @@ void foo(void)
 	BLANK();
 
 	/* Offset from the sysenter stack to tss.sp0 */
-	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
 	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
 
+	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
+
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
 	OFFSET(stack_canary_offset, stack_canary, canary);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ef29ad001991..20a71c914e59 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1649,11 +1649,12 @@ void cpu_init(void)
 	enter_lazy_tlb(&init_mm, curr);
 
 	/*
-	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
-	 * task never enters user mode.
+	 * Initialize the TSS.  sp0 points to the entry trampoline stack
+	 * regardless of what task is running.
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
+	load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 832a6acd730f..a9950946b263 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -57,14 +57,12 @@ __visible DEFINE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw) = {
 		 */
 		.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
 
-#ifdef CONFIG_X86_64
 		/*
 		 * .sp1 is cpu_current_top_of_stack.  The init task never
 		 * runs user code, but cpu_current_top_of_stack should still
 		 * be well defined before the first context switch.
 		 */
 		.sp1 = TOP_OF_INIT_STACK,
-#endif
 
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 5224c6099184..452eeac00b80 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -292,6 +292,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	this_cpu_write(cpu_current_top_of_stack,
 		       (unsigned long)task_stack_page(next_p) +
 		       THREAD_SIZE);
+	/*
+	 * TODO: Find a way to let cpu_current_top_of_stack point to
+	 * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
+	 * iret exceptions.
+	 */
+	this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
 
 	/*
 	 * Restore %gs if needed (which is common)
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Switch back to the trampoline stack before returning to
userspace.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S        | 58 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/asm-offsets_32.c |  1 +
 2 files changed, 59 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 5a7bdb73be9f..14018eeb11c3 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -263,6 +263,61 @@
 .endm
 
 /*
+ * Switch back from the kernel stack to the entry stack.
+ *
+ * iret_frame > 0 adds code to copie over an iret frame from the old to
+ *                the new stack. It also adds a check which bails out if
+ *                we are not returning to user-space.
+ *
+ * This macro is allowed not modify eflags when iret_frame == 0.
+ */
+.macro SWITCH_TO_ENTRY_STACK iret_frame=0
+	.if \iret_frame > 0
+	/* Are we returning to userspace? */
+	testb   $3, 4(%esp) /* return CS */
+	jz .Lend_\@
+	.endif
+
+	/*
+	 * We run with user-%fs already loaded from pt_regs, so we don't
+	 * have access to per_cpu data anymore, and there is no swapgs
+	 * equivalent on x86_32.
+	 * We work around this by loading the kernel-%fs again and
+	 * reading the entry stack address from there. Then we restore
+	 * the user-%fs and return.
+	 */
+	pushl %fs
+	pushl %edi
+
+	/* Re-load kernel-%fs, after that we can use PER_CPU_VAR */
+	movl $(__KERNEL_PERCPU), %edi
+	movl %edi, %fs
+
+	/* Save old stack pointer to copy the return frame over if needed */
+	movl %esp, %edi
+	movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %esp
+
+	/* Now we are on the entry stack */
+
+	.if \iret_frame > 0
+	/* Stack frame: ss, esp, eflags, cs, eip, fs, edi */
+	pushl 6*4(%edi) /* ss */
+	pushl 5*4(%edi) /* esp */
+	pushl 4*4(%edi) /* eflags */
+	pushl 3*4(%edi) /* cs */
+	pushl 2*4(%edi) /* eip */
+	.endif
+
+	pushl 4(%edi)   /* fs */
+	
+	/* Restore user %edi and user %fs */
+	movl (%edi), %edi
+	popl %fs
+
+.Lend_\@:
+.endm
+
+/*
  * %eax: prev task
  * %edx: next task
  */
@@ -512,6 +567,8 @@ ENTRY(entry_SYSENTER_32)
 	btr	$X86_EFLAGS_IF_BIT, (%esp)
 	popfl
 
+	SWITCH_TO_ENTRY_STACK
+
 	/*
 	 * Return back to the vDSO, which will pop ecx and edx.
 	 * Don't bother with DS and ES (they already contain __USER_DS).
@@ -601,6 +658,7 @@ restore_all:
 .Lrestore_nocheck:
 	RESTORE_REGS 4				# skip orig_eax/error_code
 .Lirq_return:
+	SWITCH_TO_ENTRY_STACK iret_frame=1
 	INTERRUPT_RETURN
 
 .section .fixup, "ax"
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index 7270dd834f4b..b628f898edd2 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -50,6 +50,7 @@ void foo(void)
 	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
 	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
 
+	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
 	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 
 #ifdef CONFIG_CC_STACKPROTECTOR
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Switch back to the trampoline stack before returning to
userspace.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S        | 58 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/asm-offsets_32.c |  1 +
 2 files changed, 59 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 5a7bdb73be9f..14018eeb11c3 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -263,6 +263,61 @@
 .endm
 
 /*
+ * Switch back from the kernel stack to the entry stack.
+ *
+ * iret_frame > 0 adds code to copie over an iret frame from the old to
+ *                the new stack. It also adds a check which bails out if
+ *                we are not returning to user-space.
+ *
+ * This macro is allowed not modify eflags when iret_frame == 0.
+ */
+.macro SWITCH_TO_ENTRY_STACK iret_frame=0
+	.if \iret_frame > 0
+	/* Are we returning to userspace? */
+	testb   $3, 4(%esp) /* return CS */
+	jz .Lend_\@
+	.endif
+
+	/*
+	 * We run with user-%fs already loaded from pt_regs, so we don't
+	 * have access to per_cpu data anymore, and there is no swapgs
+	 * equivalent on x86_32.
+	 * We work around this by loading the kernel-%fs again and
+	 * reading the entry stack address from there. Then we restore
+	 * the user-%fs and return.
+	 */
+	pushl %fs
+	pushl %edi
+
+	/* Re-load kernel-%fs, after that we can use PER_CPU_VAR */
+	movl $(__KERNEL_PERCPU), %edi
+	movl %edi, %fs
+
+	/* Save old stack pointer to copy the return frame over if needed */
+	movl %esp, %edi
+	movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %esp
+
+	/* Now we are on the entry stack */
+
+	.if \iret_frame > 0
+	/* Stack frame: ss, esp, eflags, cs, eip, fs, edi */
+	pushl 6*4(%edi) /* ss */
+	pushl 5*4(%edi) /* esp */
+	pushl 4*4(%edi) /* eflags */
+	pushl 3*4(%edi) /* cs */
+	pushl 2*4(%edi) /* eip */
+	.endif
+
+	pushl 4(%edi)   /* fs */
+	
+	/* Restore user %edi and user %fs */
+	movl (%edi), %edi
+	popl %fs
+
+.Lend_\@:
+.endm
+
+/*
  * %eax: prev task
  * %edx: next task
  */
@@ -512,6 +567,8 @@ ENTRY(entry_SYSENTER_32)
 	btr	$X86_EFLAGS_IF_BIT, (%esp)
 	popfl
 
+	SWITCH_TO_ENTRY_STACK
+
 	/*
 	 * Return back to the vDSO, which will pop ecx and edx.
 	 * Don't bother with DS and ES (they already contain __USER_DS).
@@ -601,6 +658,7 @@ restore_all:
 .Lrestore_nocheck:
 	RESTORE_REGS 4				# skip orig_eax/error_code
 .Lirq_return:
+	SWITCH_TO_ENTRY_STACK iret_frame=1
 	INTERRUPT_RETURN
 
 .section .fixup, "ax"
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index 7270dd834f4b..b628f898edd2 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -50,6 +50,7 @@ void foo(void)
 	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
 	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
 
+	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
 	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 
 #ifdef CONFIG_CC_STACKPROTECTOR
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 04/16] x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Move it out of the X86_64 specific processor defines so
that its visible for 32bit too.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/processor-flags.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 625a52a5594f..02c2cbda4a74 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -39,10 +39,6 @@
 #define CR3_PCID_MASK	0xFFFull
 #define CR3_NOFLUSH	BIT_ULL(63)
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-# define X86_CR3_PTI_PCID_USER_BIT	11
-#endif
-
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -53,4 +49,8 @@
 #define CR3_NOFLUSH	0
 #endif
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+# define X86_CR3_PTI_PCID_USER_BIT	11
+#endif
+
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 04/16] x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Move it out of the X86_64 specific processor defines so
that its visible for 32bit too.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/processor-flags.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 625a52a5594f..02c2cbda4a74 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -39,10 +39,6 @@
 #define CR3_PCID_MASK	0xFFFull
 #define CR3_NOFLUSH	BIT_ULL(63)
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-# define X86_CR3_PTI_PCID_USER_BIT	11
-#endif
-
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -53,4 +49,8 @@
 #define CR3_NOFLUSH	0
 #endif
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+# define X86_CR3_PTI_PCID_USER_BIT	11
+#endif
+
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 05/16] x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Make them available on 32 bit and clone_pgd_range() happy.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable.h    | 49 +++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h | 49 ---------------------------------------
 2 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e42b8943cb1a..0a9f746cbdc1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1109,6 +1109,55 @@ static inline int pud_write(pud_t pud)
 	return pud_flags(pud) & _PAGE_RW;
 }
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
+ * (8k-aligned and 8k in size).  The kernel one is at the beginning 4k and
+ * the user one is in the last 4k.  To switch between them, you
+ * just need to flip the 12th bit in their addresses.
+ */
+#define PTI_PGTABLE_SWITCH_BIT	PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr |= BIT(bit);
+	return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr &= ~BIT(bit);
+	return (void *)__ptr;
+}
+
+static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
+{
+	return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
+{
+	return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
+{
+	return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
+{
+	return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 81462e9a34f6..58d7f10e937d 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,55 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 #endif
 }
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-/*
- * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
- * (8k-aligned and 8k in size).  The kernel one is at the beginning 4k and
- * the user one is in the last 4k.  To switch between them, you
- * just need to flip the 12th bit in their addresses.
- */
-#define PTI_PGTABLE_SWITCH_BIT	PAGE_SHIFT
-
-/*
- * This generates better code than the inline assembly in
- * __set_bit().
- */
-static inline void *ptr_set_bit(void *ptr, int bit)
-{
-	unsigned long __ptr = (unsigned long)ptr;
-
-	__ptr |= BIT(bit);
-	return (void *)__ptr;
-}
-static inline void *ptr_clear_bit(void *ptr, int bit)
-{
-	unsigned long __ptr = (unsigned long)ptr;
-
-	__ptr &= ~BIT(bit);
-	return (void *)__ptr;
-}
-
-static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
-{
-	return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
-{
-	return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
-{
-	return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
-{
-	return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-#endif /* CONFIG_PAGE_TABLE_ISOLATION */
-
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 05/16] x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Make them available on 32 bit and clone_pgd_range() happy.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable.h    | 49 +++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h | 49 ---------------------------------------
 2 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e42b8943cb1a..0a9f746cbdc1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1109,6 +1109,55 @@ static inline int pud_write(pud_t pud)
 	return pud_flags(pud) & _PAGE_RW;
 }
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
+ * (8k-aligned and 8k in size).  The kernel one is at the beginning 4k and
+ * the user one is in the last 4k.  To switch between them, you
+ * just need to flip the 12th bit in their addresses.
+ */
+#define PTI_PGTABLE_SWITCH_BIT	PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr |= BIT(bit);
+	return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr &= ~BIT(bit);
+	return (void *)__ptr;
+}
+
+static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
+{
+	return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
+{
+	return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
+{
+	return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
+{
+	return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 81462e9a34f6..58d7f10e937d 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,55 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 #endif
 }
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-/*
- * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
- * (8k-aligned and 8k in size).  The kernel one is at the beginning 4k and
- * the user one is in the last 4k.  To switch between them, you
- * just need to flip the 12th bit in their addresses.
- */
-#define PTI_PGTABLE_SWITCH_BIT	PAGE_SHIFT
-
-/*
- * This generates better code than the inline assembly in
- * __set_bit().
- */
-static inline void *ptr_set_bit(void *ptr, int bit)
-{
-	unsigned long __ptr = (unsigned long)ptr;
-
-	__ptr |= BIT(bit);
-	return (void *)__ptr;
-}
-static inline void *ptr_clear_bit(void *ptr, int bit)
-{
-	unsigned long __ptr = (unsigned long)ptr;
-
-	__ptr &= ~BIT(bit);
-	return (void *)__ptr;
-}
-
-static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
-{
-	return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
-{
-	return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
-{
-	return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-
-static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
-{
-	return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
-}
-#endif /* CONFIG_PAGE_TABLE_ISOLATION */
-
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Reserve 2MB/4MB of address space for mapping the LDT to
user-space.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable_32_types.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h b/arch/x86/include/asm/pgtable_32_types.h
index ce245b0cdfca..3c30a7fcae68 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -47,9 +47,12 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
 #define CPU_ENTRY_AREA_BASE				\
 	((FIXADDR_START - PAGE_SIZE * (CPU_ENTRY_AREA_PAGES + 1)) & PMD_MASK)
 
-#define PKMAP_BASE		\
+#define LDT_BASE_ADDR		\
 	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
 
+#define PKMAP_BASE		\
+	((LDT_BASE_ADDR - PAGE_SIZE) & PMD_MASK)
+
 #ifdef CONFIG_HIGHMEM
 # define VMALLOC_END	(PKMAP_BASE - 2 * PAGE_SIZE)
 #else
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Reserve 2MB/4MB of address space for mapping the LDT to
user-space.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable_32_types.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h b/arch/x86/include/asm/pgtable_32_types.h
index ce245b0cdfca..3c30a7fcae68 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -47,9 +47,12 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
 #define CPU_ENTRY_AREA_BASE				\
 	((FIXADDR_START - PAGE_SIZE * (CPU_ENTRY_AREA_PAGES + 1)) & PMD_MASK)
 
-#define PKMAP_BASE		\
+#define LDT_BASE_ADDR		\
 	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
 
+#define PKMAP_BASE		\
+	((LDT_BASE_ADDR - PAGE_SIZE) & PMD_MASK)
+
 #ifdef CONFIG_HIGHMEM
 # define VMALLOC_END	(PKMAP_BASE - 2 * PAGE_SIZE)
 #else
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

These two functions are required for PTI on 32 bit:

	* pgdp_maps_userspace()
	* pgd_large()

Also re-implement pgdp_maps_userspace() so that it will work
on 64 and 32 bit kernels.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable.h    | 16 ++++++++++++++++
 arch/x86/include/asm/pgtable_64.h | 15 ---------------
 2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0a9f746cbdc1..abafe4d7fd3e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1109,6 +1109,22 @@ static inline int pud_write(pud_t pud)
 	return pud_flags(pud) & _PAGE_RW;
 }
 
+/*
+ * Page table pages are page-aligned.  The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
+ */
+static inline bool pgdp_maps_userspace(void *__ptr)
+{
+	unsigned long ptr = (unsigned long)__ptr;
+
+	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
+}
+
+static inline int pgd_large(pgd_t pgd) { return 0; }
+
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 /*
  * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 58d7f10e937d..3c5a73c8bb50 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,20 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 #endif
 }
 
-/*
- * Page table pages are page-aligned.  The lower half of the top
- * level is used for userspace and the top half for the kernel.
- *
- * Returns true for parts of the PGD that map userspace and
- * false for the parts that map the kernel.
- */
-static inline bool pgdp_maps_userspace(void *__ptr)
-{
-	unsigned long ptr = (unsigned long)__ptr;
-
-	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
-}
-
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
 
@@ -208,7 +194,6 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
 /*
  * Level 4 access.
  */
-static inline int pgd_large(pgd_t pgd) { return 0; }
 #define mk_kernel_pgd(address) __pgd((address) | _KERNPG_TABLE)
 
 /* PUD - Level3 access */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

These two functions are required for PTI on 32 bit:

	* pgdp_maps_userspace()
	* pgd_large()

Also re-implement pgdp_maps_userspace() so that it will work
on 64 and 32 bit kernels.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable.h    | 16 ++++++++++++++++
 arch/x86/include/asm/pgtable_64.h | 15 ---------------
 2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0a9f746cbdc1..abafe4d7fd3e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1109,6 +1109,22 @@ static inline int pud_write(pud_t pud)
 	return pud_flags(pud) & _PAGE_RW;
 }
 
+/*
+ * Page table pages are page-aligned.  The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
+ */
+static inline bool pgdp_maps_userspace(void *__ptr)
+{
+	unsigned long ptr = (unsigned long)__ptr;
+
+	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
+}
+
+static inline int pgd_large(pgd_t pgd) { return 0; }
+
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 /*
  * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 58d7f10e937d..3c5a73c8bb50 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,20 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 #endif
 }
 
-/*
- * Page table pages are page-aligned.  The lower half of the top
- * level is used for userspace and the top half for the kernel.
- *
- * Returns true for parts of the PGD that map userspace and
- * false for the parts that map the kernel.
- */
-static inline bool pgdp_maps_userspace(void *__ptr)
-{
-	unsigned long ptr = (unsigned long)__ptr;
-
-	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
-}
-
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
 
@@ -208,7 +194,6 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
 /*
  * Level 4 access.
  */
-static inline int pgd_large(pgd_t pgd) { return 0; }
 #define mk_kernel_pgd(address) __pgd((address) | _KERNPG_TABLE)
 
 /* PUD - Level3 access */
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 08/16] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Allocate a kernel and a user page-table root when PTI is
enabled. Also allocate a full page per root for PAEm because
otherwise the bit to flip in cr3 to switch between them
would be non-constant, which creates a lot of hassle.
Keep that for a later optimization.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head_32.S | 23 ++++++++++++++++++-----
 arch/x86/mm/pgtable.c     | 11 ++++++-----
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index c29020907886..fc550559bf58 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -512,28 +512,41 @@ ENTRY(initial_code)
 ENTRY(setup_once_ref)
 	.long setup_once
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+#define	PGD_ALIGN	(2 * PAGE_SIZE)
+#define PTI_USER_PGD_FILL	1024
+#else
+#define	PGD_ALIGN	(PAGE_SIZE)
+#define PTI_USER_PGD_FILL	0
+#endif
 /*
  * BSS section
  */
 __PAGE_ALIGNED_BSS
-	.align PAGE_SIZE
+	.align PGD_ALIGN
 #ifdef CONFIG_X86_PAE
 .globl initial_pg_pmd
 initial_pg_pmd:
 	.fill 1024*KPMDS,4,0
+	.fill PTI_USER_PGD_FILL,4,0
 #else
 .globl initial_page_table
 initial_page_table:
 	.fill 1024,4,0
+	.fill PTI_USER_PGD_FILL,4,0
 #endif
+	.align PGD_ALIGN
 initial_pg_fixmap:
 	.fill 1024,4,0
-.globl empty_zero_page
-empty_zero_page:
-	.fill 4096,1,0
+	.fill PTI_USER_PGD_FILL,4,0
 .globl swapper_pg_dir
+	.align PGD_ALIGN
 swapper_pg_dir:
 	.fill 1024,4,0
+	.fill PTI_USER_PGD_FILL,4,0
+.globl empty_zero_page
+empty_zero_page:
+	.fill 4096,1,0
 EXPORT_SYMBOL(empty_zero_page)
 
 /*
@@ -542,7 +555,7 @@ EXPORT_SYMBOL(empty_zero_page)
 #ifdef CONFIG_X86_PAE
 __PAGE_ALIGNED_DATA
 	/* Page-aligned for the benefit of paravirt? */
-	.align PAGE_SIZE
+	.align PGD_ALIGN
 ENTRY(initial_page_table)
 	.long	pa(initial_pg_pmd+PGD_IDENT_ATTR),0	/* low identity map */
 # if KPMDS == 3
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 004abf9ebf12..48abefd95924 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -313,7 +313,7 @@ static int __init pgd_cache_init(void)
 	 * When PAE kernel is running as a Xen domain, it does not use
 	 * shared kernel pmd. And this requires a whole page for pgd.
 	 */
-	if (!SHARED_KERNEL_PMD)
+	if (static_cpu_has(X86_FEATURE_PTI) || !SHARED_KERNEL_PMD)
 		return 0;
 
 	/*
@@ -337,8 +337,9 @@ static inline pgd_t *_pgd_alloc(void)
 	 * If no SHARED_KERNEL_PMD, PAE kernel is running as a Xen domain.
 	 * We allocate one page for pgd.
 	 */
-	if (!SHARED_KERNEL_PMD)
-		return (pgd_t *)__get_free_page(PGALLOC_GFP);
+	if (static_cpu_has(X86_FEATURE_PTI) || !SHARED_KERNEL_PMD)
+		return (pgd_t *)__get_free_pages(PGALLOC_GFP,
+						 PGD_ALLOCATION_ORDER);
 
 	/*
 	 * Now PAE kernel is not running as a Xen domain. We can allocate
@@ -349,8 +350,8 @@ static inline pgd_t *_pgd_alloc(void)
 
 static inline void _pgd_free(pgd_t *pgd)
 {
-	if (!SHARED_KERNEL_PMD)
-		free_page((unsigned long)pgd);
+	if (static_cpu_has(X86_FEATURE_PTI) || !SHARED_KERNEL_PMD)
+		free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
 	else
 		kmem_cache_free(pgd_cache, pgd);
 }
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 08/16] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Allocate a kernel and a user page-table root when PTI is
enabled. Also allocate a full page per root for PAEm because
otherwise the bit to flip in cr3 to switch between them
would be non-constant, which creates a lot of hassle.
Keep that for a later optimization.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/kernel/head_32.S | 23 ++++++++++++++++++-----
 arch/x86/mm/pgtable.c     | 11 ++++++-----
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index c29020907886..fc550559bf58 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -512,28 +512,41 @@ ENTRY(initial_code)
 ENTRY(setup_once_ref)
 	.long setup_once
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+#define	PGD_ALIGN	(2 * PAGE_SIZE)
+#define PTI_USER_PGD_FILL	1024
+#else
+#define	PGD_ALIGN	(PAGE_SIZE)
+#define PTI_USER_PGD_FILL	0
+#endif
 /*
  * BSS section
  */
 __PAGE_ALIGNED_BSS
-	.align PAGE_SIZE
+	.align PGD_ALIGN
 #ifdef CONFIG_X86_PAE
 .globl initial_pg_pmd
 initial_pg_pmd:
 	.fill 1024*KPMDS,4,0
+	.fill PTI_USER_PGD_FILL,4,0
 #else
 .globl initial_page_table
 initial_page_table:
 	.fill 1024,4,0
+	.fill PTI_USER_PGD_FILL,4,0
 #endif
+	.align PGD_ALIGN
 initial_pg_fixmap:
 	.fill 1024,4,0
-.globl empty_zero_page
-empty_zero_page:
-	.fill 4096,1,0
+	.fill PTI_USER_PGD_FILL,4,0
 .globl swapper_pg_dir
+	.align PGD_ALIGN
 swapper_pg_dir:
 	.fill 1024,4,0
+	.fill PTI_USER_PGD_FILL,4,0
+.globl empty_zero_page
+empty_zero_page:
+	.fill 4096,1,0
 EXPORT_SYMBOL(empty_zero_page)
 
 /*
@@ -542,7 +555,7 @@ EXPORT_SYMBOL(empty_zero_page)
 #ifdef CONFIG_X86_PAE
 __PAGE_ALIGNED_DATA
 	/* Page-aligned for the benefit of paravirt? */
-	.align PAGE_SIZE
+	.align PGD_ALIGN
 ENTRY(initial_page_table)
 	.long	pa(initial_pg_pmd+PGD_IDENT_ATTR),0	/* low identity map */
 # if KPMDS == 3
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 004abf9ebf12..48abefd95924 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -313,7 +313,7 @@ static int __init pgd_cache_init(void)
 	 * When PAE kernel is running as a Xen domain, it does not use
 	 * shared kernel pmd. And this requires a whole page for pgd.
 	 */
-	if (!SHARED_KERNEL_PMD)
+	if (static_cpu_has(X86_FEATURE_PTI) || !SHARED_KERNEL_PMD)
 		return 0;
 
 	/*
@@ -337,8 +337,9 @@ static inline pgd_t *_pgd_alloc(void)
 	 * If no SHARED_KERNEL_PMD, PAE kernel is running as a Xen domain.
 	 * We allocate one page for pgd.
 	 */
-	if (!SHARED_KERNEL_PMD)
-		return (pgd_t *)__get_free_page(PGALLOC_GFP);
+	if (static_cpu_has(X86_FEATURE_PTI) || !SHARED_KERNEL_PMD)
+		return (pgd_t *)__get_free_pages(PGALLOC_GFP,
+						 PGD_ALLOCATION_ORDER);
 
 	/*
 	 * Now PAE kernel is not running as a Xen domain. We can allocate
@@ -349,8 +350,8 @@ static inline pgd_t *_pgd_alloc(void)
 
 static inline void _pgd_free(pgd_t *pgd)
 {
-	if (!SHARED_KERNEL_PMD)
-		free_page((unsigned long)pgd);
+	if (static_cpu_has(X86_FEATURE_PTI) || !SHARED_KERNEL_PMD)
+		free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
 	else
 		kmem_cache_free(pgd_cache, pgd);
 }
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 09/16] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Cloning on the P4D level would clone the complete kernel
address space into the user-space page-tables for PAE
kernels. Cloning on PMD level is fine for PAE and legacy
paging.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/pti.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ce38f165489b..20be21301a59 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -308,6 +308,7 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
 	}
 }
 
+#ifdef CONFIG_X86_64
 /*
  * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
  * next-level entry on 5-level systems.
@@ -322,13 +323,29 @@ static void __init pti_clone_p4d(unsigned long addr)
 	kernel_p4d = p4d_offset(kernel_pgd, addr);
 	*user_p4d = *kernel_p4d;
 }
+#endif
 
 /*
  * Clone the CPU_ENTRY_AREA into the user space visible page table.
  */
 static void __init pti_clone_user_shared(void)
 {
+#ifdef CONFIG_X86_32
+	/*
+	 * On 32 bit PAE systems with 1GB of Kernel address space there is only
+	 * one pgd/p4d for the whole kernel. Cloning that would map the whole
+	 * address space into the user page-tables, making PTI useless. So clone
+	 * the page-table on the PMD level to prevent that.
+	 */
+	unsigned long start, end;
+
+	start = CPU_ENTRY_AREA_BASE;
+	end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
+
+	pti_clone_pmds(start, end, _PAGE_GLOBAL);
+#else
 	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+#endif
 }
 
 /*
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 09/16] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Cloning on the P4D level would clone the complete kernel
address space into the user-space page-tables for PAE
kernels. Cloning on PMD level is fine for PAE and legacy
paging.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/pti.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ce38f165489b..20be21301a59 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -308,6 +308,7 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
 	}
 }
 
+#ifdef CONFIG_X86_64
 /*
  * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
  * next-level entry on 5-level systems.
@@ -322,13 +323,29 @@ static void __init pti_clone_p4d(unsigned long addr)
 	kernel_p4d = p4d_offset(kernel_pgd, addr);
 	*user_p4d = *kernel_p4d;
 }
+#endif
 
 /*
  * Clone the CPU_ENTRY_AREA into the user space visible page table.
  */
 static void __init pti_clone_user_shared(void)
 {
+#ifdef CONFIG_X86_32
+	/*
+	 * On 32 bit PAE systems with 1GB of Kernel address space there is only
+	 * one pgd/p4d for the whole kernel. Cloning that would map the whole
+	 * address space into the user page-tables, making PTI useless. So clone
+	 * the page-table on the PMD level to prevent that.
+	 */
+	unsigned long start, end;
+
+	start = CPU_ENTRY_AREA_BASE;
+	end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
+
+	pti_clone_pmds(start, end, _PAGE_GLOBAL);
+#else
 	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+#endif
 }
 
 /*
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 10/16] x86/mm/pti: Populate valid user pud entries
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

With PAE paging we don't have PGD and P4D levels in the
page-table, instead the PUD level is the highest one.

In PAE page-tables at the top-level most bits we usually set
with _KERNPG_TABLE are reserved, resulting in a #GP when
they are loaded by the processor.

Work around this by populating PUD entries in the user
page-table only with _PAGE_PRESENT set.

I am pretty sure there is a cleaner way to do this, but
until I find it use this #ifdef solution.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/pti.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 20be21301a59..6b6bfd13350e 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -202,8 +202,12 @@ static __init pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
 		unsigned long new_pmd_page = __get_free_page(gfp);
 		if (!new_pmd_page)
 			return NULL;
-
+#ifdef CONFIG_X86_PAE
+		/* TODO: There must be a cleaner way to do this */
+		set_pud(pud, __pud(_PAGE_PRESENT | __pa(new_pmd_page)));
+#else
 		set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
+#endif
 	}
 
 	return pmd_offset(pud, address);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 10/16] x86/mm/pti: Populate valid user pud entries
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

With PAE paging we don't have PGD and P4D levels in the
page-table, instead the PUD level is the highest one.

In PAE page-tables at the top-level most bits we usually set
with _KERNPG_TABLE are reserved, resulting in a #GP when
they are loaded by the processor.

Work around this by populating PUD entries in the user
page-table only with _PAGE_PRESENT set.

I am pretty sure there is a cleaner way to do this, but
until I find it use this #ifdef solution.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/pti.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 20be21301a59..6b6bfd13350e 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -202,8 +202,12 @@ static __init pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
 		unsigned long new_pmd_page = __get_free_page(gfp);
 		if (!new_pmd_page)
 			return NULL;
-
+#ifdef CONFIG_X86_PAE
+		/* TODO: There must be a cleaner way to do this */
+		set_pud(pud, __pud(_PAGE_PRESENT | __pa(new_pmd_page)));
+#else
 		set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
+#endif
 	}
 
 	return pmd_offset(pud, address);
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 11/16] x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

There it is also usable from 32 bit code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable.h    | 23 +++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h | 21 ---------------------
 2 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index abafe4d7fd3e..248721971532 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -618,8 +618,31 @@ static inline int is_new_memtype_allowed(u64 paddr, unsigned long size,
 
 pmd_t *populate_extra_pmd(unsigned long vaddr);
 pte_t *populate_extra_pte(unsigned long vaddr);
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
+
+/*
+ * Take a PGD location (pgdp) and a pgd value that needs to be set there.
+ * Populates the user and returns the resulting PGD that must be set in
+ * the kernel copy of the page tables.
+ */
+static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return pgd;
+	return __pti_set_user_pgd(pgdp, pgd);
+}
+#else   /* CONFIG_PAGE_TABLE_ISOLATION */
+static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+	return pgd;
+}
+#endif  /* CONFIG_PAGE_TABLE_ISOLATION */
+
 #endif	/* __ASSEMBLY__ */
 
+
 #ifdef CONFIG_X86_32
 # include <asm/pgtable_32.h>
 #else
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 3c5a73c8bb50..50a02a32a0b3 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,27 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 #endif
 }
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
-
-/*
- * Take a PGD location (pgdp) and a pgd value that needs to be set there.
- * Populates the user and returns the resulting PGD that must be set in
- * the kernel copy of the page tables.
- */
-static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
-{
-	if (!static_cpu_has(X86_FEATURE_PTI))
-		return pgd;
-	return __pti_set_user_pgd(pgdp, pgd);
-}
-#else
-static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
-{
-	return pgd;
-}
-#endif
-
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
 #if defined(CONFIG_PAGE_TABLE_ISOLATION) && !defined(CONFIG_X86_5LEVEL)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 11/16] x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

There it is also usable from 32 bit code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable.h    | 23 +++++++++++++++++++++++
 arch/x86/include/asm/pgtable_64.h | 21 ---------------------
 2 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index abafe4d7fd3e..248721971532 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -618,8 +618,31 @@ static inline int is_new_memtype_allowed(u64 paddr, unsigned long size,
 
 pmd_t *populate_extra_pmd(unsigned long vaddr);
 pte_t *populate_extra_pte(unsigned long vaddr);
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
+
+/*
+ * Take a PGD location (pgdp) and a pgd value that needs to be set there.
+ * Populates the user and returns the resulting PGD that must be set in
+ * the kernel copy of the page tables.
+ */
+static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return pgd;
+	return __pti_set_user_pgd(pgdp, pgd);
+}
+#else   /* CONFIG_PAGE_TABLE_ISOLATION */
+static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+	return pgd;
+}
+#endif  /* CONFIG_PAGE_TABLE_ISOLATION */
+
 #endif	/* __ASSEMBLY__ */
 
+
 #ifdef CONFIG_X86_32
 # include <asm/pgtable_32.h>
 #else
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 3c5a73c8bb50..50a02a32a0b3 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,27 +131,6 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 #endif
 }
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
-
-/*
- * Take a PGD location (pgdp) and a pgd value that needs to be set there.
- * Populates the user and returns the resulting PGD that must be set in
- * the kernel copy of the page tables.
- */
-static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
-{
-	if (!static_cpu_has(X86_FEATURE_PTI))
-		return pgd;
-	return __pti_set_user_pgd(pgdp, pgd);
-}
-#else
-static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
-{
-	return pgd;
-}
-#endif
-
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
 #if defined(CONFIG_PAGE_TABLE_ISOLATION) && !defined(CONFIG_X86_5LEVEL)
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

This is the last part of the PAE page-table setup for PAE
before we can add the CR3 switch to the entry code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable-3level.h | 3 +++
 arch/x86/mm/pti.c                     | 7 +++++++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index bc4af5453802..910f0b35370e 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -98,6 +98,9 @@ static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 
 static inline void native_set_pud(pud_t *pudp, pud_t pud)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pud.p4d.pgd = pti_set_user_pgd(&pudp->p4d.pgd, pud.p4d.pgd);
+#endif
 	set_64bit((unsigned long long *)(pudp), native_pud_val(pud));
 }
 
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 6b6bfd13350e..a561b5625d6c 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -122,6 +122,7 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 	 */
 	kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
 
+#ifdef CONFIG_X86_64
 	/*
 	 * If this is normal user memory, make it NX in the kernel
 	 * pagetables so that, if we somehow screw up and return to
@@ -134,10 +135,16 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 	 *     may execute from it
 	 *  - we don't have NX support
 	 *  - we're clearing the PGD (i.e. the new pgd is not present).
+	 *  - We run on a 32 bit kernel. 2-level paging doesn't support NX at
+	 *    all and PAE paging does not support it on the PGD level. We can
+	 *    set it in the PMD level there in the future, but that means we
+	 *    need to unshare the PMDs between the kernel and the user
+	 *    page-tables.
 	 */
 	if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
 	    (__supported_pte_mask & _PAGE_NX))
 		pgd.pgd |= _PAGE_NX;
+#endif
 
 	/* return the copy of the PGD we want the kernel to use: */
 	return pgd;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

This is the last part of the PAE page-table setup for PAE
before we can add the CR3 switch to the entry code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable-3level.h | 3 +++
 arch/x86/mm/pti.c                     | 7 +++++++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index bc4af5453802..910f0b35370e 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -98,6 +98,9 @@ static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 
 static inline void native_set_pud(pud_t *pudp, pud_t pud)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pud.p4d.pgd = pti_set_user_pgd(&pudp->p4d.pgd, pud.p4d.pgd);
+#endif
 	set_64bit((unsigned long long *)(pudp), native_pud_val(pud));
 }
 
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 6b6bfd13350e..a561b5625d6c 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -122,6 +122,7 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 	 */
 	kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
 
+#ifdef CONFIG_X86_64
 	/*
 	 * If this is normal user memory, make it NX in the kernel
 	 * pagetables so that, if we somehow screw up and return to
@@ -134,10 +135,16 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 	 *     may execute from it
 	 *  - we don't have NX support
 	 *  - we're clearing the PGD (i.e. the new pgd is not present).
+	 *  - We run on a 32 bit kernel. 2-level paging doesn't support NX at
+	 *    all and PAE paging does not support it on the PGD level. We can
+	 *    set it in the PMD level there in the future, but that means we
+	 *    need to unshare the PMDs between the kernel and the user
+	 *    page-tables.
 	 */
 	if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
 	    (__supported_pte_mask & _PAGE_NX))
 		pgd.pgd |= _PAGE_NX;
+#endif
 
 	/* return the copy of the PGD we want the kernel to use: */
 	return pgd;
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 13/16] x86/mm/pti: Add an overflow check to pti_clone_pmds()
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

The addr counter will overflow if we clone the last PMD of
the address space, resulting in an endless loop.

Check for that and bail out of the loop when it happens.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/pti.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index a561b5625d6c..faea5faeddc5 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -293,6 +293,10 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
 		p4d_t *p4d;
 		pud_t *pud;
 
+		/* Overflow check */
+		if (addr < start)
+			break;
+
 		pgd = pgd_offset_k(addr);
 		if (WARN_ON(pgd_none(*pgd)))
 			return;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 13/16] x86/mm/pti: Add an overflow check to pti_clone_pmds()
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

The addr counter will overflow if we clone the last PMD of
the address space, resulting in an endless loop.

Check for that and bail out of the loop when it happens.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/mm/pti.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index a561b5625d6c..faea5faeddc5 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -293,6 +293,10 @@ pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
 		p4d_t *p4d;
 		pud_t *pud;
 
+		/* Overflow check */
+		if (addr < start)
+			break;
+
 		pgd = pgd_offset_k(addr);
 		if (WARN_ON(pgd_none(*pgd)))
 			return;
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 14/16] x86/mm/legacy: Populate the user page-table with user pgd's
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Also populate the user-spage pgd's in the user page-table.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable-2level.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
index 685ffe8a0eaf..d96486d23c58 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -19,6 +19,9 @@ static inline void native_set_pte(pte_t *ptep , pte_t pte)
 
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pmd.pud.p4d.pgd = pti_set_user_pgd(&pmdp->pud.p4d.pgd, pmd.pud.p4d.pgd);
+#endif
 	*pmdp = pmd;
 }
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 14/16] x86/mm/legacy: Populate the user page-table with user pgd's
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Also populate the user-spage pgd's in the user page-table.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/include/asm/pgtable-2level.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
index 685ffe8a0eaf..d96486d23c58 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -19,6 +19,9 @@ static inline void native_set_pte(pte_t *ptep , pte_t pte)
 
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pmd.pud.p4d.pgd = pti_set_user_pgd(&pmdp->pud.p4d.pgd, pmd.pud.p4d.pgd);
+#endif
 	*pmdp = pmd;
 }
 
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 15/16] x86/entry/32: Switch between kernel and user cr3 on entry/exit
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Add the cr3 switches between the kernel and the user
page-table when PTI is enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 14018eeb11c3..6a1d9f1e1f89 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -221,6 +221,25 @@
 	POP_GS_EX
 .endm
 
+#define PTI_SWITCH_MASK         (1 << PAGE_SHIFT)
+
+.macro SWITCH_TO_KERNEL_CR3
+        ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+        movl    %cr3, %edi
+        andl    $(~PTI_SWITCH_MASK), %edi
+        movl    %edi, %cr3
+.Lend_\@:
+.endm
+
+.macro SWITCH_TO_USER_CR3
+        ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+        mov     %cr3, %edi
+        /* Flip the PGD to the user version */
+        orl     $(PTI_SWITCH_MASK), %edi
+        mov     %edi, %cr3
+.Lend_\@:
+.endm
+
 /*
  * Switch from the entry-trampline stack to the kernel stack of the
  * running task.
@@ -240,6 +259,7 @@
 	.endif
 
 	pushl %edi
+	SWITCH_TO_KERNEL_CR3
 	movl  %esp, %edi
 
 	/*
@@ -309,9 +329,12 @@
 	.endif
 
 	pushl 4(%edi)   /* fs */
+	pushl  (%edi)   /* edi */
+
+	SWITCH_TO_USER_CR3
 	
 	/* Restore user %edi and user %fs */
-	movl (%edi), %edi
+	popl %edi
 	popl %fs
 
 .Lend_\@:
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 15/16] x86/entry/32: Switch between kernel and user cr3 on entry/exit
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Add the cr3 switches between the kernel and the user
page-table when PTI is enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 14018eeb11c3..6a1d9f1e1f89 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -221,6 +221,25 @@
 	POP_GS_EX
 .endm
 
+#define PTI_SWITCH_MASK         (1 << PAGE_SHIFT)
+
+.macro SWITCH_TO_KERNEL_CR3
+        ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+        movl    %cr3, %edi
+        andl    $(~PTI_SWITCH_MASK), %edi
+        movl    %edi, %cr3
+.Lend_\@:
+.endm
+
+.macro SWITCH_TO_USER_CR3
+        ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+        mov     %cr3, %edi
+        /* Flip the PGD to the user version */
+        orl     $(PTI_SWITCH_MASK), %edi
+        mov     %edi, %cr3
+.Lend_\@:
+.endm
+
 /*
  * Switch from the entry-trampline stack to the kernel stack of the
  * running task.
@@ -240,6 +259,7 @@
 	.endif
 
 	pushl %edi
+	SWITCH_TO_KERNEL_CR3
 	movl  %esp, %edi
 
 	/*
@@ -309,9 +329,12 @@
 	.endif
 
 	pushl 4(%edi)   /* fs */
+	pushl  (%edi)   /* edi */
+
+	SWITCH_TO_USER_CR3
 	
 	/* Restore user %edi and user %fs */
-	movl (%edi), %edi
+	popl %edi
 	popl %fs
 
 .Lend_\@:
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 16/16] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 16:36   ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Allow PTI to be compiled on x86_32.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 security/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/Kconfig b/security/Kconfig
index b0cb9a5f9448..93d85fda0f54 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -57,7 +57,7 @@ config SECURITY_NETWORK
 config PAGE_TABLE_ISOLATION
 	bool "Remove the kernel mapping in user mode"
 	default y
-	depends on X86_64 && !UML
+	depends on X86 && !UML
 	help
 	  This feature reduces the number of hardware side channels by
 	  ensuring that the majority of kernel addresses are not mapped
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* [PATCH 16/16] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
@ 2018-01-16 16:36   ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 16:36 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel, joro

From: Joerg Roedel <jroedel@suse.de>

Allow PTI to be compiled on x86_32.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 security/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/Kconfig b/security/Kconfig
index b0cb9a5f9448..93d85fda0f54 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -57,7 +57,7 @@ config SECURITY_NETWORK
 config PAGE_TABLE_ISOLATION
 	bool "Remove the kernel mapping in user mode"
 	default y
-	depends on X86_64 && !UML
+	depends on X86 && !UML
 	help
 	  This feature reduces the number of hardware side channels by
 	  ensuring that the majority of kernel addresses are not mapped
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 16:52     ` Peter Zijlstra
  -1 siblings, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2018-01-16 16:52 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Reserve 2MB/4MB of address space for mapping the LDT to
> user-space.

LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
64K*2*64=8M > 2M.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
@ 2018-01-16 16:52     ` Peter Zijlstra
  0 siblings, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2018-01-16 16:52 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Reserve 2MB/4MB of address space for mapping the LDT to
> user-space.

LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
64K*2*64=8M > 2M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
  2018-01-16 16:52     ` Peter Zijlstra
@ 2018-01-16 17:13       ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Hi Peter,

On Tue, Jan 16, 2018 at 05:52:13PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> > From: Joerg Roedel <jroedel@suse.de>
> > 
> > Reserve 2MB/4MB of address space for mapping the LDT to
> > user-space.
> 
> LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> 64K*2*64=8M > 2M.

Thanks, I'll fix that in the next version.


	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
@ 2018-01-16 17:13       ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Hi Peter,

On Tue, Jan 16, 2018 at 05:52:13PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> > From: Joerg Roedel <jroedel@suse.de>
> > 
> > Reserve 2MB/4MB of address space for mapping the LDT to
> > user-space.
> 
> LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> 64K*2*64=8M > 2M.

Thanks, I'll fix that in the next version.


	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
  2018-01-16 17:13       ` Joerg Roedel
@ 2018-01-16 17:31         ` Peter Zijlstra
  -1 siblings, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2018-01-16 17:31 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 06:13:43PM +0100, Joerg Roedel wrote:
> Hi Peter,
> 
> On Tue, Jan 16, 2018 at 05:52:13PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> > > From: Joerg Roedel <jroedel@suse.de>
> > > 
> > > Reserve 2MB/4MB of address space for mapping the LDT to
> > > user-space.
> > 
> > LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> > 64K*2*64=8M > 2M.
> 
> Thanks, I'll fix that in the next version.

Just lower the max SMP setting until it fits or something. 32bit is too
address space starved for lots of CPU in any case, 64 CPUs on 32bit is
absolutely insane.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
@ 2018-01-16 17:31         ` Peter Zijlstra
  0 siblings, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2018-01-16 17:31 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 06:13:43PM +0100, Joerg Roedel wrote:
> Hi Peter,
> 
> On Tue, Jan 16, 2018 at 05:52:13PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> > > From: Joerg Roedel <jroedel@suse.de>
> > > 
> > > Reserve 2MB/4MB of address space for mapping the LDT to
> > > user-space.
> > 
> > LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> > 64K*2*64=8M > 2M.
> 
> Thanks, I'll fix that in the next version.

Just lower the max SMP setting until it fits or something. 32bit is too
address space starved for lots of CPU in any case, 64 CPUs on 32bit is
absolutely insane.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
  2018-01-16 17:31         ` Peter Zijlstra
@ 2018-01-16 17:34           ` Waiman Long
  -1 siblings, 0 replies; 183+ messages in thread
From: Waiman Long @ 2018-01-16 17:34 UTC (permalink / raw)
  To: Peter Zijlstra, Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On 01/16/2018 12:31 PM, Peter Zijlstra wrote:
> On Tue, Jan 16, 2018 at 06:13:43PM +0100, Joerg Roedel wrote:
>> Hi Peter,
>>
>> On Tue, Jan 16, 2018 at 05:52:13PM +0100, Peter Zijlstra wrote:
>>> On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
>>>> From: Joerg Roedel <jroedel@suse.de>
>>>>
>>>> Reserve 2MB/4MB of address space for mapping the LDT to
>>>> user-space.
>>> LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
>>> 64K*2*64=8M > 2M.
>> Thanks, I'll fix that in the next version.
> Just lower the max SMP setting until it fits or something. 32bit is too
> address space starved for lots of CPU in any case, 64 CPUs on 32bit is
> absolutely insane.

Maybe we can just scale the amount of reserved space according to the
current NR_CPUS setting. In this way, we won't waste more memory than is
necessary.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
@ 2018-01-16 17:34           ` Waiman Long
  0 siblings, 0 replies; 183+ messages in thread
From: Waiman Long @ 2018-01-16 17:34 UTC (permalink / raw)
  To: Peter Zijlstra, Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On 01/16/2018 12:31 PM, Peter Zijlstra wrote:
> On Tue, Jan 16, 2018 at 06:13:43PM +0100, Joerg Roedel wrote:
>> Hi Peter,
>>
>> On Tue, Jan 16, 2018 at 05:52:13PM +0100, Peter Zijlstra wrote:
>>> On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
>>>> From: Joerg Roedel <jroedel@suse.de>
>>>>
>>>> Reserve 2MB/4MB of address space for mapping the LDT to
>>>> user-space.
>>> LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
>>> 64K*2*64=8M > 2M.
>> Thanks, I'll fix that in the next version.
> Just lower the max SMP setting until it fits or something. 32bit is too
> address space starved for lots of CPU in any case, 64 CPUs on 32bit is
> absolutely insane.

Maybe we can just scale the amount of reserved space according to the
current NR_CPUS setting. In this way, we won't waste more memory than is
necessary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 18:03     ` Dave Hansen
  -1 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 18:03 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel

On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> +/*
> + * Page table pages are page-aligned.  The lower half of the top
> + * level is used for userspace and the top half for the kernel.
> + *
> + * Returns true for parts of the PGD that map userspace and
> + * false for the parts that map the kernel.
> + */
> +static inline bool pgdp_maps_userspace(void *__ptr)
> +{
> +	unsigned long ptr = (unsigned long)__ptr;
> +
> +	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
> +}

One of the reasons to implement it the other way:

-	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);

is that the compiler can do this all quickly.  KERNEL_PGD_BOUNDARY
depends on PAGE_OFFSET which depends on a variable.  IOW, the compiler
can't do it.

How much worse is the code that this generates?

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h
@ 2018-01-16 18:03     ` Dave Hansen
  0 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 18:03 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel

On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> +/*
> + * Page table pages are page-aligned.  The lower half of the top
> + * level is used for userspace and the top half for the kernel.
> + *
> + * Returns true for parts of the PGD that map userspace and
> + * false for the parts that map the kernel.
> + */
> +static inline bool pgdp_maps_userspace(void *__ptr)
> +{
> +	unsigned long ptr = (unsigned long)__ptr;
> +
> +	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
> +}

One of the reasons to implement it the other way:

-	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);

is that the compiler can do this all quickly.  KERNEL_PGD_BOUNDARY
depends on PAGE_OFFSET which depends on a variable.  IOW, the compiler
can't do it.

How much worse is the code that this generates?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 10/16] x86/mm/pti: Populate valid user pud entries
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 18:06     ` Dave Hansen
  -1 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 18:06 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel

On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> 
> In PAE page-tables at the top-level most bits we usually set
> with _KERNPG_TABLE are reserved, resulting in a #GP when
> they are loaded by the processor.

Can you save me the trip to the SDM and remind me which bits actually
cause trouble here?

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 10/16] x86/mm/pti: Populate valid user pud entries
@ 2018-01-16 18:06     ` Dave Hansen
  0 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 18:06 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel

On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> 
> In PAE page-tables at the top-level most bits we usually set
> with _KERNPG_TABLE are reserved, resulting in a #GP when
> they are loaded by the processor.

Can you save me the trip to the SDM and remind me which bits actually
cause trouble here?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 18:11     ` Dave Hansen
  -1 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 18:11 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel

On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> +#ifdef CONFIG_X86_64
>  	/*
>  	 * If this is normal user memory, make it NX in the kernel
>  	 * pagetables so that, if we somehow screw up and return to
> @@ -134,10 +135,16 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
>  	 *     may execute from it
>  	 *  - we don't have NX support
>  	 *  - we're clearing the PGD (i.e. the new pgd is not present).
> +	 *  - We run on a 32 bit kernel. 2-level paging doesn't support NX at
> +	 *    all and PAE paging does not support it on the PGD level. We can
> +	 *    set it in the PMD level there in the future, but that means we
> +	 *    need to unshare the PMDs between the kernel and the user
> +	 *    page-tables.
>  	 */
>  	if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
>  	    (__supported_pte_mask & _PAGE_NX))
>  		pgd.pgd |= _PAGE_NX;
> +#endif

Ugh.  The ghosts of PAE have come back to haunt us.

Could we do:

static inline bool pgd_supports_nx(unsigned long)
{
#ifdef CONFIG_X86_64
	return (__supported_pte_mask & _PAGE_NX);
#else
	/* No 32-bit page tables support NX at PGD level */
	return 0;
#endif
}

Nobody will ever spot the #ifdef the way you laid it out.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's
@ 2018-01-16 18:11     ` Dave Hansen
  0 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 18:11 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel

On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> +#ifdef CONFIG_X86_64
>  	/*
>  	 * If this is normal user memory, make it NX in the kernel
>  	 * pagetables so that, if we somehow screw up and return to
> @@ -134,10 +135,16 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
>  	 *     may execute from it
>  	 *  - we don't have NX support
>  	 *  - we're clearing the PGD (i.e. the new pgd is not present).
> +	 *  - We run on a 32 bit kernel. 2-level paging doesn't support NX at
> +	 *    all and PAE paging does not support it on the PGD level. We can
> +	 *    set it in the PMD level there in the future, but that means we
> +	 *    need to unshare the PMDs between the kernel and the user
> +	 *    page-tables.
>  	 */
>  	if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
>  	    (__supported_pte_mask & _PAGE_NX))
>  		pgd.pgd |= _PAGE_NX;
> +#endif

Ugh.  The ghosts of PAE have come back to haunt us.

Could we do:

static inline bool pgd_supports_nx(unsigned long)
{
#ifdef CONFIG_X86_64
	return (__supported_pte_mask & _PAGE_NX);
#else
	/* No 32-bit page tables support NX at PGD level */
	return 0;
#endif
}

Nobody will ever spot the #ifdef the way you laid it out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 18:14   ` Dave Hansen
  -1 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 18:14 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel

Joerg,

Very cool!.

I really appreciate you putting this together.  I don't see any real
showstoppers or things that I think will *break* 64-bit.  I just hope
that we can merge this _slowly_ in case it breaks 64-bit along the way.

I didn't look at the assembly in too much detail.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-16 18:14   ` Dave Hansen
  0 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 18:14 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel

Joerg,

Very cool!.

I really appreciate you putting this together.  I don't see any real
showstoppers or things that I think will *break* 64-bit.  I just hope
that we can merge this _slowly_ in case it breaks 64-bit along the way.

I didn't look at the assembly in too much detail.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 01/16] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 18:35     ` Thomas Gleixner
  -1 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 18:35 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:

> From: Joerg Roedel <jroedel@suse.de>
> 
> The stack addresss doesn't need to be stored in tss.sp0 if
> we switch manually like on sysenter. Rename the offset so
> that it still makes sense when we its location.

-ENOSENTENCE

Other than that. Makes sense.

> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/entry/entry_32.S        | 2 +-
>  arch/x86/kernel/asm-offsets_32.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index a1f28a54f23a..eb8c5615777b 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -401,7 +401,7 @@ ENTRY(xen_sysenter_target)
>   * 0(%ebp) arg6
>   */
>  ENTRY(entry_SYSENTER_32)
> -	movl	TSS_sysenter_sp0(%esp), %esp
> +	movl	TSS_sysenter_stack(%esp), %esp
>  .Lsysenter_past_esp:
>  	pushl	$__USER_DS		/* pt_regs->ss */
>  	pushl	%ebp			/* pt_regs->sp (stashed in bp) */
> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
> index fa1261eefa16..654229bac2fc 100644
> --- a/arch/x86/kernel/asm-offsets_32.c
> +++ b/arch/x86/kernel/asm-offsets_32.c
> @@ -47,7 +47,7 @@ void foo(void)
>  	BLANK();
>  
>  	/* Offset from the sysenter stack to tss.sp0 */
> -	DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
> +	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
>  	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
>  
>  #ifdef CONFIG_CC_STACKPROTECTOR
> -- 
> 2.13.6
> 
> 

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 01/16] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
@ 2018-01-16 18:35     ` Thomas Gleixner
  0 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 18:35 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:

> From: Joerg Roedel <jroedel@suse.de>
> 
> The stack addresss doesn't need to be stored in tss.sp0 if
> we switch manually like on sysenter. Rename the offset so
> that it still makes sense when we its location.

-ENOSENTENCE

Other than that. Makes sense.

> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/entry/entry_32.S        | 2 +-
>  arch/x86/kernel/asm-offsets_32.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index a1f28a54f23a..eb8c5615777b 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -401,7 +401,7 @@ ENTRY(xen_sysenter_target)
>   * 0(%ebp) arg6
>   */
>  ENTRY(entry_SYSENTER_32)
> -	movl	TSS_sysenter_sp0(%esp), %esp
> +	movl	TSS_sysenter_stack(%esp), %esp
>  .Lsysenter_past_esp:
>  	pushl	$__USER_DS		/* pt_regs->ss */
>  	pushl	%ebp			/* pt_regs->sp (stashed in bp) */
> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
> index fa1261eefa16..654229bac2fc 100644
> --- a/arch/x86/kernel/asm-offsets_32.c
> +++ b/arch/x86/kernel/asm-offsets_32.c
> @@ -47,7 +47,7 @@ void foo(void)
>  	BLANK();
>  
>  	/* Offset from the sysenter stack to tss.sp0 */
> -	DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
> +	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
>  	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
>  
>  #ifdef CONFIG_CC_STACKPROTECTOR
> -- 
> 2.13.6
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 18:59   ` Linus Torvalds
  -1 siblings, 0 replies; 183+ messages in thread
From: Linus Torvalds @ 2018-01-16 18:59 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
>
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

Yes, I'm very happy to see that this is actually not nearly as bad as
I feared it might be,

Some of those #ifdef's in the PTI code you added might want more
commentary about what the exact differences are. And maybe they could
be done more cleanly with some abstraction. But nothing looked
_horrible_.

> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.

.. and please run all the segment and syscall selfchecks that Andy has written.

But yes, checking bare metal, and checking the "odd" applications like
Wine and dosemu (and kvm etc) within the PTI kernel is certainly a
good idea.

> One of the things that are surely broken is XEN_PV support.
> I'd appreciate any help with testing and bugfixing on that
> front.

Xen PV and PTI don't work together even on x86-64 afaik, the Xen
people apparently felt it wasn't worth it.  See the

        if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
                pti_print_if_insecure("disabled on XEN PV.");
                return;
        }

in pti_check_boottime_disable().

            Linus

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-16 18:59   ` Linus Torvalds
  0 siblings, 0 replies; 183+ messages in thread
From: Linus Torvalds @ 2018-01-16 18:59 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
>
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

Yes, I'm very happy to see that this is actually not nearly as bad as
I feared it might be,

Some of those #ifdef's in the PTI code you added might want more
commentary about what the exact differences are. And maybe they could
be done more cleanly with some abstraction. But nothing looked
_horrible_.

> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.

.. and please run all the segment and syscall selfchecks that Andy has written.

But yes, checking bare metal, and checking the "odd" applications like
Wine and dosemu (and kvm etc) within the PTI kernel is certainly a
good idea.

> One of the things that are surely broken is XEN_PV support.
> I'd appreciate any help with testing and bugfixing on that
> front.

Xen PV and PTI don't work together even on x86-64 afaik, the Xen
people apparently felt it wasn't worth it.  See the

        if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
                pti_print_if_insecure("disabled on XEN PV.");
                return;
        }

in pti_check_boottime_disable().

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 18:59   ` Linus Torvalds
@ 2018-01-16 19:02     ` Dave Hansen
  -1 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 19:02 UTC (permalink / raw)
  To: Linus Torvalds, Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, Liguori, Anthony, Daniel Gruss, Hugh Dickins,
	Kees Cook, Andrea Arcangeli, Waiman Long, Joerg Roedel

On 01/16/2018 10:59 AM, Linus Torvalds wrote:
>> The code has not run on bare-metal yet, I'll test that in
>> the next days once I setup a 32 bit box again. I also havn't
>> tested Wine and DosEMU yet, so this might also be broken.
> .. and please run all the segment and syscall selfchecks that Andy has written.
> 
> But yes, checking bare metal, and checking the "odd" applications like
> Wine and dosemu (and kvm etc) within the PTI kernel is certainly a
> good idea.

I tried to document a list of the "gotchas" that tripped us up during
the 64-bit effort under "Testing":

> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/pti&id=01c9b17bf673b05bb401b76ec763e9730ccf1376

NMIs were a biggie too.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-16 19:02     ` Dave Hansen
  0 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 19:02 UTC (permalink / raw)
  To: Linus Torvalds, Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, Liguori, Anthony, Daniel Gruss, Hugh Dickins,
	Kees Cook, Andrea Arcangeli, Waiman Long, Joerg Roedel

On 01/16/2018 10:59 AM, Linus Torvalds wrote:
>> The code has not run on bare-metal yet, I'll test that in
>> the next days once I setup a 32 bit box again. I also havn't
>> tested Wine and DosEMU yet, so this might also be broken.
> .. and please run all the segment and syscall selfchecks that Andy has written.
> 
> But yes, checking bare metal, and checking the "odd" applications like
> Wine and dosemu (and kvm etc) within the PTI kernel is certainly a
> good idea.

I tried to document a list of the "gotchas" that tripped us up during
the 64-bit effort under "Testing":

> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/pti&id=01c9b17bf673b05bb401b76ec763e9730ccf1376

NMIs were a biggie too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h
  2018-01-16 18:03     ` Dave Hansen
@ 2018-01-16 19:11       ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 19:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 10:03:09AM -0800, Dave Hansen wrote:
> On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> > +	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
> > +}
> 
> One of the reasons to implement it the other way:
> 
> -	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
> 
> is that the compiler can do this all quickly.  KERNEL_PGD_BOUNDARY
> depends on PAGE_OFFSET which depends on a variable.  IOW, the compiler
> can't do it.
> 
> How much worse is the code that this generates?

I havn't looked at the actual code this generates, but the
(PAGE_SIZE / 2) comparison doesn't work on 32 bit where the address
space is not always evenly split. I'll look into a better way to check
this.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h
@ 2018-01-16 19:11       ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 19:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 10:03:09AM -0800, Dave Hansen wrote:
> On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> > +	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
> > +}
> 
> One of the reasons to implement it the other way:
> 
> -	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
> 
> is that the compiler can do this all quickly.  KERNEL_PGD_BOUNDARY
> depends on PAGE_OFFSET which depends on a variable.  IOW, the compiler
> can't do it.
> 
> How much worse is the code that this generates?

I havn't looked at the actual code this generates, but the
(PAGE_SIZE / 2) comparison doesn't work on 32 bit where the address
space is not always evenly split. I'll look into a better way to check
this.

Thanks,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 18:59   ` Linus Torvalds
@ 2018-01-16 19:21     ` Andrew Cooper
  -1 siblings, 0 replies; 183+ messages in thread
From: Andrew Cooper @ 2018-01-16 19:21 UTC (permalink / raw)
  To: Linus Torvalds, Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel, Juergen Gross, Jan Beulich

On 16/01/18 18:59, Linus Torvalds wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> One of the things that are surely broken is XEN_PV support.
>> I'd appreciate any help with testing and bugfixing on that
>> front.
> Xen PV and PTI don't work together even on x86-64 afaik, the Xen
> people apparently felt it wasn't worth it.  See the
>
>         if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
>                 pti_print_if_insecure("disabled on XEN PV.");
>                 return;
>         }

64bit PV guests under Xen already have split pagetables.  It is a base
and necessary part of the ABI, because segment limits stopped working in
64bit.

32bit PV guests aren't split, but by far the most efficient way of doing
this is to introduce a new enlightenment and have Xen switch all this
stuff (and IBRS, for that matter) on behalf of the guest kernel on
context switch.

~Andrew

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-16 19:21     ` Andrew Cooper
  0 siblings, 0 replies; 183+ messages in thread
From: Andrew Cooper @ 2018-01-16 19:21 UTC (permalink / raw)
  To: Linus Torvalds, Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel, Juergen Gross, Jan Beulich

On 16/01/18 18:59, Linus Torvalds wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> One of the things that are surely broken is XEN_PV support.
>> I'd appreciate any help with testing and bugfixing on that
>> front.
> Xen PV and PTI don't work together even on x86-64 afaik, the Xen
> people apparently felt it wasn't worth it.  See the
>
>         if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
>                 pti_print_if_insecure("disabled on XEN PV.");
>                 return;
>         }

64bit PV guests under Xen already have split pagetables.A  It is a base
and necessary part of the ABI, because segment limits stopped working in
64bit.

32bit PV guests aren't split, but by far the most efficient way of doing
this is to introduce a new enlightenment and have Xen switch all this
stuff (and IBRS, for that matter) on behalf of the guest kernel on
context switch.

~Andrew

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h
  2018-01-16 19:11       ` Joerg Roedel
@ 2018-01-16 19:34         ` Thomas Gleixner
  -1 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 19:34 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Dave Hansen, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:

> On Tue, Jan 16, 2018 at 10:03:09AM -0800, Dave Hansen wrote:
> > On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> > > +	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
> > > +}
> > 
> > One of the reasons to implement it the other way:
> > 
> > -	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
> > 
> > is that the compiler can do this all quickly.  KERNEL_PGD_BOUNDARY
> > depends on PAGE_OFFSET which depends on a variable.  IOW, the compiler
> > can't do it.
> > 
> > How much worse is the code that this generates?
> 
> I havn't looked at the actual code this generates, but the
> (PAGE_SIZE / 2) comparison doesn't work on 32 bit where the address
> space is not always evenly split. I'll look into a better way to check
> this.

It should be trivial enough to do

   return (ptr & ~PAGE_MASK) < PGD_SPLIT_SIZE);

and define it PAGE_SIZE/2 for 64bit and for PAE make it depend on the
configured address space split.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h
@ 2018-01-16 19:34         ` Thomas Gleixner
  0 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 19:34 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Dave Hansen, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:

> On Tue, Jan 16, 2018 at 10:03:09AM -0800, Dave Hansen wrote:
> > On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> > > +	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < KERNEL_PGD_BOUNDARY);
> > > +}
> > 
> > One of the reasons to implement it the other way:
> > 
> > -	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
> > 
> > is that the compiler can do this all quickly.  KERNEL_PGD_BOUNDARY
> > depends on PAGE_OFFSET which depends on a variable.  IOW, the compiler
> > can't do it.
> > 
> > How much worse is the code that this generates?
> 
> I havn't looked at the actual code this generates, but the
> (PAGE_SIZE / 2) comparison doesn't work on 32 bit where the address
> space is not always evenly split. I'll look into a better way to check
> this.

It should be trivial enough to do

   return (ptr & ~PAGE_MASK) < PGD_SPLIT_SIZE);

and define it PAGE_SIZE/2 for 64bit and for PAE make it depend on the
configured address space split.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 10/16] x86/mm/pti: Populate valid user pud entries
  2018-01-16 18:06     ` Dave Hansen
@ 2018-01-16 19:41       ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 19:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 10:06:48AM -0800, Dave Hansen wrote:
> On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> > 
> > In PAE page-tables at the top-level most bits we usually set
> > with _KERNPG_TABLE are reserved, resulting in a #GP when
> > they are loaded by the processor.
> 
> Can you save me the trip to the SDM and remind me which bits actually
> cause trouble here?

Everything besides PRESENT, PCD, PWT and the actual physical address, so
RW, and NX for example cause a #GP.


	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 10/16] x86/mm/pti: Populate valid user pud entries
@ 2018-01-16 19:41       ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 19:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 10:06:48AM -0800, Dave Hansen wrote:
> On 01/16/2018 08:36 AM, Joerg Roedel wrote:
> > 
> > In PAE page-tables at the top-level most bits we usually set
> > with _KERNPG_TABLE are reserved, resulting in a #GP when
> > they are loaded by the processor.
> 
> Can you save me the trip to the SDM and remind me which bits actually
> cause trouble here?

Everything besides PRESENT, PCD, PWT and the actual physical address, so
RW, and NX for example cause a #GP.


	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's
  2018-01-16 18:11     ` Dave Hansen
@ 2018-01-16 19:44       ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 19:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 10:11:14AM -0800, Dave Hansen wrote:
> 
> Ugh.  The ghosts of PAE have come back to haunt us.

:-) Yeah, PAE caused the most trouble for me while getting this running.

> 
> Could we do:
> 
> static inline bool pgd_supports_nx(unsigned long)
> {
> #ifdef CONFIG_X86_64
> 	return (__supported_pte_mask & _PAGE_NX);
> #else
> 	/* No 32-bit page tables support NX at PGD level */
> 	return 0;
> #endif
> }
> 
> Nobody will ever spot the #ifdef the way you laid it out.

Right, thats a better way to do it. I'll change it in the next version.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's
@ 2018-01-16 19:44       ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 19:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 10:11:14AM -0800, Dave Hansen wrote:
> 
> Ugh.  The ghosts of PAE have come back to haunt us.

:-) Yeah, PAE caused the most trouble for me while getting this running.

> 
> Could we do:
> 
> static inline bool pgd_supports_nx(unsigned long)
> {
> #ifdef CONFIG_X86_64
> 	return (__supported_pte_mask & _PAGE_NX);
> #else
> 	/* No 32-bit page tables support NX at PGD level */
> 	return 0;
> #endif
> }
> 
> Nobody will ever spot the #ifdef the way you laid it out.

Right, thats a better way to do it. I'll change it in the next version.

Thanks,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 18:14   ` Dave Hansen
@ 2018-01-16 19:46     ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 19:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 10:14:19AM -0800, Dave Hansen wrote:
> Joerg,
> 
> Very cool!.

Thanks :)

> I really appreciate you putting this together.  I don't see any real
> showstoppers or things that I think will *break* 64-bit.  I just hope
> that we can merge this _slowly_ in case it breaks 64-bit along the way.

Sure, it needs a lot more testing and most likely fixing anyway. So
there is still some way to go before this is ready for merging.


	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-16 19:46     ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 19:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, Jan 16, 2018 at 10:14:19AM -0800, Dave Hansen wrote:
> Joerg,
> 
> Very cool!.

Thanks :)

> I really appreciate you putting this together.  I don't see any real
> showstoppers or things that I think will *break* 64-bit.  I just hope
> that we can merge this _slowly_ in case it breaks 64-bit along the way.

Sure, it needs a lot more testing and most likely fixing anyway. So
there is still some way to go before this is ready for merging.


	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 18:59   ` Linus Torvalds
@ 2018-01-16 19:55     ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 19:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

Hi Linus,

On Tue, Jan 16, 2018 at 10:59:01AM -0800, Linus Torvalds wrote:
> Yes, I'm very happy to see that this is actually not nearly as bad as
> I feared it might be,

Yeah, I was looking at the original PTI patches and my impression was
that a lot of the complicated stuff (like setting up the cpu_entry_area)
was already in there for 32 bit too. So it was mostly about the entry
code and some changes to the 32bit page-table code.

> Some of those #ifdef's in the PTI code you added might want more
> commentary about what the exact differences are. And maybe they could
> be done more cleanly with some abstraction. But nothing looked
> _horrible_.

I'll add more comments and better abstraction, Dave has already
suggested some improvements here. Reading some of my comments again,
they need a rework anyway.

> .. and please run all the segment and syscall selfchecks that Andy has written.

Didn't know about them yet, thanks. I will run them too in my testing

> Xen PV and PTI don't work together even on x86-64 afaik, the Xen
> people apparently felt it wasn't worth it.  See the
> 
>         if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
>                 pti_print_if_insecure("disabled on XEN PV.");
>                 return;
>         }
> 
> in pti_check_boottime_disable().

But I might have broken something for them anyway, honestly I didn't pay
much attention to the XEN_PV case as I was trying to get it running
here. My hope is that someone who knows Xen better than I do will help
out :)


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-16 19:55     ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-16 19:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

Hi Linus,

On Tue, Jan 16, 2018 at 10:59:01AM -0800, Linus Torvalds wrote:
> Yes, I'm very happy to see that this is actually not nearly as bad as
> I feared it might be,

Yeah, I was looking at the original PTI patches and my impression was
that a lot of the complicated stuff (like setting up the cpu_entry_area)
was already in there for 32 bit too. So it was mostly about the entry
code and some changes to the 32bit page-table code.

> Some of those #ifdef's in the PTI code you added might want more
> commentary about what the exact differences are. And maybe they could
> be done more cleanly with some abstraction. But nothing looked
> _horrible_.

I'll add more comments and better abstraction, Dave has already
suggested some improvements here. Reading some of my comments again,
they need a rework anyway.

> .. and please run all the segment and syscall selfchecks that Andy has written.

Didn't know about them yet, thanks. I will run them too in my testing

> Xen PV and PTI don't work together even on x86-64 afaik, the Xen
> people apparently felt it wasn't worth it.  See the
> 
>         if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
>                 pti_print_if_insecure("disabled on XEN PV.");
>                 return;
>         }
> 
> in pti_check_boottime_disable().

But I might have broken something for them anyway, honestly I didn't pay
much attention to the XEN_PV case as I was trying to get it running
here. My hope is that someone who knows Xen better than I do will help
out :)


Regards,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 20:30     ` Thomas Gleixner
  -1 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 20:30 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:
> @@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
>  /* This is used when switching tasks or entering/exiting vm86 mode. */
>  static inline void update_sp0(struct task_struct *task)
>  {
> -	/* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
> -#ifdef CONFIG_X86_32
> -	load_sp0(task->thread.sp0);
> -#else
> +	/* sp0 always points to the entry trampoline stack, which is constant: */
>  	if (static_cpu_has(X86_FEATURE_XENPV))
>  		load_sp0(task_top_of_stack(task));
> -#endif
>  }
>  
>  #endif /* _ASM_X86_SWITCH_TO_H */
> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
> index 654229bac2fc..7270dd834f4b 100644
> --- a/arch/x86/kernel/asm-offsets_32.c
> +++ b/arch/x86/kernel/asm-offsets_32.c
> @@ -47,9 +47,11 @@ void foo(void)
>  	BLANK();
>  
>  	/* Offset from the sysenter stack to tss.sp0 */
> -	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
> +	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
>  	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
>  
> +	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);

Can you please split out the change of TSS_sysenter_stack into a separate
patch?

Other than that, this looks good.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-16 20:30     ` Thomas Gleixner
  0 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 20:30 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:
> @@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
>  /* This is used when switching tasks or entering/exiting vm86 mode. */
>  static inline void update_sp0(struct task_struct *task)
>  {
> -	/* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
> -#ifdef CONFIG_X86_32
> -	load_sp0(task->thread.sp0);
> -#else
> +	/* sp0 always points to the entry trampoline stack, which is constant: */
>  	if (static_cpu_has(X86_FEATURE_XENPV))
>  		load_sp0(task_top_of_stack(task));
> -#endif
>  }
>  
>  #endif /* _ASM_X86_SWITCH_TO_H */
> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
> index 654229bac2fc..7270dd834f4b 100644
> --- a/arch/x86/kernel/asm-offsets_32.c
> +++ b/arch/x86/kernel/asm-offsets_32.c
> @@ -47,9 +47,11 @@ void foo(void)
>  	BLANK();
>  
>  	/* Offset from the sysenter stack to tss.sp0 */
> -	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
> +	DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
>  	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
>  
> +	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);

Can you please split out the change of TSS_sysenter_stack into a separate
patch?

Other than that, this looks good.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 09/16] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 21:03     ` Thomas Gleixner
  -1 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 21:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:
> +#ifdef CONFIG_X86_64
>  /*
>   * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
>   * next-level entry on 5-level systems.
> @@ -322,13 +323,29 @@ static void __init pti_clone_p4d(unsigned long addr)
>  	kernel_p4d = p4d_offset(kernel_pgd, addr);
>  	*user_p4d = *kernel_p4d;
>  }
> +#endif
>  
>  /*
>   * Clone the CPU_ENTRY_AREA into the user space visible page table.
>   */
>  static void __init pti_clone_user_shared(void)
>  {
> +#ifdef CONFIG_X86_32
> +	/*
> +	 * On 32 bit PAE systems with 1GB of Kernel address space there is only
> +	 * one pgd/p4d for the whole kernel. Cloning that would map the whole
> +	 * address space into the user page-tables, making PTI useless. So clone
> +	 * the page-table on the PMD level to prevent that.
> +	 */
> +	unsigned long start, end;
> +
> +	start = CPU_ENTRY_AREA_BASE;
> +	end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
> +
> +	pti_clone_pmds(start, end, _PAGE_GLOBAL);
> +#else
>  	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
> +#endif
>  }

Just a minor nit. You already wrap pti_clone_p4d() into X86_64. So it would
be cleaner to do:

  	kernel_p4d = p4d_offset(kernel_pgd, addr);
  	*user_p4d = *kernel_p4d;
}

static void __init pti_clone_user_shared(void)
{
  	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
}

#else /* CONFIG_X86_64 */

/*
 * Big fat comment.
 */
static void __init pti_clone_user_shared(void)
{
	....
}
#endif /* !CONFIG_X86_64 */

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 09/16] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
@ 2018-01-16 21:03     ` Thomas Gleixner
  0 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 21:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:
> +#ifdef CONFIG_X86_64
>  /*
>   * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
>   * next-level entry on 5-level systems.
> @@ -322,13 +323,29 @@ static void __init pti_clone_p4d(unsigned long addr)
>  	kernel_p4d = p4d_offset(kernel_pgd, addr);
>  	*user_p4d = *kernel_p4d;
>  }
> +#endif
>  
>  /*
>   * Clone the CPU_ENTRY_AREA into the user space visible page table.
>   */
>  static void __init pti_clone_user_shared(void)
>  {
> +#ifdef CONFIG_X86_32
> +	/*
> +	 * On 32 bit PAE systems with 1GB of Kernel address space there is only
> +	 * one pgd/p4d for the whole kernel. Cloning that would map the whole
> +	 * address space into the user page-tables, making PTI useless. So clone
> +	 * the page-table on the PMD level to prevent that.
> +	 */
> +	unsigned long start, end;
> +
> +	start = CPU_ENTRY_AREA_BASE;
> +	end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
> +
> +	pti_clone_pmds(start, end, _PAGE_GLOBAL);
> +#else
>  	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
> +#endif
>  }

Just a minor nit. You already wrap pti_clone_p4d() into X86_64. So it would
be cleaner to do:

  	kernel_p4d = p4d_offset(kernel_pgd, addr);
  	*user_p4d = *kernel_p4d;
}

static void __init pti_clone_user_shared(void)
{
  	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
}

#else /* CONFIG_X86_64 */

/*
 * Big fat comment.
 */
static void __init pti_clone_user_shared(void)
{
	....
}
#endif /* !CONFIG_X86_64 */

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 10/16] x86/mm/pti: Populate valid user pud entries
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 21:06     ` Thomas Gleixner
  -1 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 21:06 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:

> From: Joerg Roedel <jroedel@suse.de>
> 
> With PAE paging we don't have PGD and P4D levels in the
> page-table, instead the PUD level is the highest one.
> 
> In PAE page-tables at the top-level most bits we usually set
> with _KERNPG_TABLE are reserved, resulting in a #GP when
> they are loaded by the processor.
> 
> Work around this by populating PUD entries in the user
> page-table only with _PAGE_PRESENT set.
> 
> I am pretty sure there is a cleaner way to do this, but
> until I find it use this #ifdef solution.

Stick somehting like

#define _KERNELPG_TABLE_PUD_ENTRY

into the 32 and 64 bit variants of some relevant header file 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 10/16] x86/mm/pti: Populate valid user pud entries
@ 2018-01-16 21:06     ` Thomas Gleixner
  0 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 21:06 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:

> From: Joerg Roedel <jroedel@suse.de>
> 
> With PAE paging we don't have PGD and P4D levels in the
> page-table, instead the PUD level is the highest one.
> 
> In PAE page-tables at the top-level most bits we usually set
> with _KERNPG_TABLE are reserved, resulting in a #GP when
> they are loaded by the processor.
> 
> Work around this by populating PUD entries in the user
> page-table only with _PAGE_PRESENT set.
> 
> I am pretty sure there is a cleaner way to do this, but
> until I find it use this #ifdef solution.

Stick somehting like

#define _KERNELPG_TABLE_PUD_ENTRY

into the 32 and 64 bit variants of some relevant header file 

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 21:10     ` Thomas Gleixner
  -1 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 21:10 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:
>  
> +#ifdef CONFIG_X86_64
>  	/*
>  	 * If this is normal user memory, make it NX in the kernel
>  	 * pagetables so that, if we somehow screw up and return to
> @@ -134,10 +135,16 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
>  	 *     may execute from it
>  	 *  - we don't have NX support
>  	 *  - we're clearing the PGD (i.e. the new pgd is not present).
> +	 *  - We run on a 32 bit kernel. 2-level paging doesn't support NX at
> +	 *    all and PAE paging does not support it on the PGD level. We can
> +	 *    set it in the PMD level there in the future, but that means we
> +	 *    need to unshare the PMDs between the kernel and the user
> +	 *    page-tables.
>  	 */
>  	if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
>  	    (__supported_pte_mask & _PAGE_NX))
>  		pgd.pgd |= _PAGE_NX;

I'd suggest to have:

static inline pteval_t supported_pgd_mask(void)
{
	if (IS_ENABLED(CONFIG_X86_64))
		return __supported_pte_mask;
	return __supported_pte_mask & ~_PAGE_NX);
}

and get rid of the ifdeffery completely.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's
@ 2018-01-16 21:10     ` Thomas Gleixner
  0 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 21:10 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:
>  
> +#ifdef CONFIG_X86_64
>  	/*
>  	 * If this is normal user memory, make it NX in the kernel
>  	 * pagetables so that, if we somehow screw up and return to
> @@ -134,10 +135,16 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
>  	 *     may execute from it
>  	 *  - we don't have NX support
>  	 *  - we're clearing the PGD (i.e. the new pgd is not present).
> +	 *  - We run on a 32 bit kernel. 2-level paging doesn't support NX at
> +	 *    all and PAE paging does not support it on the PGD level. We can
> +	 *    set it in the PMD level there in the future, but that means we
> +	 *    need to unshare the PMDs between the kernel and the user
> +	 *    page-tables.
>  	 */
>  	if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
>  	    (__supported_pte_mask & _PAGE_NX))
>  		pgd.pgd |= _PAGE_NX;

I'd suggest to have:

static inline pteval_t supported_pgd_mask(void)
{
	if (IS_ENABLED(CONFIG_X86_64))
		return __supported_pte_mask;
	return __supported_pte_mask & ~_PAGE_NX);
}

and get rid of the ifdeffery completely.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's
  2018-01-16 21:10     ` Thomas Gleixner
@ 2018-01-16 21:15       ` Dave Hansen
  -1 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 21:15 UTC (permalink / raw)
  To: Thomas Gleixner, Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel

On 01/16/2018 01:10 PM, Thomas Gleixner wrote:
> 
> static inline pteval_t supported_pgd_mask(void)
> {
> 	if (IS_ENABLED(CONFIG_X86_64))
> 		return __supported_pte_mask;
> 	return __supported_pte_mask & ~_PAGE_NX);
> }
> 
> and get rid of the ifdeffery completely.

Heh, that's an entertaining way to do it.  Joerg, if you go do it this
way, it would be nice to add all the other gunk that we don't allow to
be set in the PAE pgd.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's
@ 2018-01-16 21:15       ` Dave Hansen
  0 siblings, 0 replies; 183+ messages in thread
From: Dave Hansen @ 2018-01-16 21:15 UTC (permalink / raw)
  To: Thomas Gleixner, Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel

On 01/16/2018 01:10 PM, Thomas Gleixner wrote:
> 
> static inline pteval_t supported_pgd_mask(void)
> {
> 	if (IS_ENABLED(CONFIG_X86_64))
> 		return __supported_pte_mask;
> 	return __supported_pte_mask & ~_PAGE_NX);
> }
> 
> and get rid of the ifdeffery completely.

Heh, that's an entertaining way to do it.  Joerg, if you go do it this
way, it would be nice to add all the other gunk that we don't allow to
be set in the PAE pgd.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 21:20   ` Thomas Gleixner
  -1 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 21:20 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

>  16 files changed, 333 insertions(+), 123 deletions(-)

Impressively small and well done !

Can you please make that patch set against

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-pti-for-linus

so we immediately have it backportable for 4.14 stable? It's only a trivial
conflict in pgtable.h, but we'd like to make the life of stable as simple
as possible. They have enough headache with the pre 4.14 trees.

We can pick some of the simple patches which make defines and inlines
available out of the pile right away and apply them to x86/pti to shrink
the amount of stuff you have to worry about.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-16 21:20   ` Thomas Gleixner
  0 siblings, 0 replies; 183+ messages in thread
From: Thomas Gleixner @ 2018-01-16 21:20 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Tue, 16 Jan 2018, Joerg Roedel wrote:
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

>  16 files changed, 333 insertions(+), 123 deletions(-)

Impressively small and well done !

Can you please make that patch set against

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-pti-for-linus

so we immediately have it backportable for 4.14 stable? It's only a trivial
conflict in pgtable.h, but we'd like to make the life of stable as simple
as possible. They have enough headache with the pre 4.14 trees.

We can pick some of the simple patches which make defines and inlines
available out of the pile right away and apply them to x86/pti to shrink
the amount of stuff you have to worry about.

Thanks,

	tglx


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-16 22:26   ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:26 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Hi,
>
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)
>
> The patches are split as follows:
>
>         - 1-3 contain the entry-code changes to enter and
>           exit the kernel via the sysenter trampoline stack.
>
>         - 4-7 are fixes to get the code compile on 32 bit
>           with CONFIG_PAGE_TABLE_ISOLATION=y.
>
>         - 8-14 adapt the existing PTI code to work properly
>           on 32 bit and add the needed parts to 32 bit
>           page-table code.
>
>         - 15 switches PTI on by adding the CR3 switches to
>           kernel entry/exit.
>
>         - 16 enables the Kconfig for all of X86
>
> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.
>

If you pass all the x86 selftests, then Wine and DOSEMU are pretty
likely to work :)

--Andy

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-16 22:26   ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:26 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Hi,
>
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)
>
> The patches are split as follows:
>
>         - 1-3 contain the entry-code changes to enter and
>           exit the kernel via the sysenter trampoline stack.
>
>         - 4-7 are fixes to get the code compile on 32 bit
>           with CONFIG_PAGE_TABLE_ISOLATION=y.
>
>         - 8-14 adapt the existing PTI code to work properly
>           on 32 bit and add the needed parts to 32 bit
>           page-table code.
>
>         - 15 switches PTI on by adding the CR3 switches to
>           kernel entry/exit.
>
>         - 16 enables the Kconfig for all of X86
>
> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.
>

If you pass all the x86 selftests, then Wine and DOSEMU are pretty
likely to work :)

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-16 20:30     ` Thomas Gleixner
@ 2018-01-16 22:37       ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Joerg Roedel, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 12:30 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 16 Jan 2018, Joerg Roedel wrote:
>> @@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
>>  /* This is used when switching tasks or entering/exiting vm86 mode. */
>>  static inline void update_sp0(struct task_struct *task)
>>  {
>> -     /* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
>> -#ifdef CONFIG_X86_32
>> -     load_sp0(task->thread.sp0);
>> -#else
>> +     /* sp0 always points to the entry trampoline stack, which is constant: */
>>       if (static_cpu_has(X86_FEATURE_XENPV))
>>               load_sp0(task_top_of_stack(task));
>> -#endif
>>  }
>>
>>  #endif /* _ASM_X86_SWITCH_TO_H */
>> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
>> index 654229bac2fc..7270dd834f4b 100644
>> --- a/arch/x86/kernel/asm-offsets_32.c
>> +++ b/arch/x86/kernel/asm-offsets_32.c
>> @@ -47,9 +47,11 @@ void foo(void)
>>       BLANK();
>>
>>       /* Offset from the sysenter stack to tss.sp0 */
>> -     DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
>> +     DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
>>              offsetofend(struct cpu_entry_area, entry_stack_page.stack));

I was going to say that this is just too magical.  The convention is
that STRUCT_member refers to "member" of "STRUCT".  Here you're
encoding a more complicated calculation.  How about putting just the
needed offsets in asm_offsets and putting the actual calculation in
the asm code or a header.

>>
>> +     OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);

This belongs in asm_offsets.c.  Just move the asm_offsets_64.c version
there and call it a day.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-16 22:37       ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Joerg Roedel, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 12:30 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 16 Jan 2018, Joerg Roedel wrote:
>> @@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
>>  /* This is used when switching tasks or entering/exiting vm86 mode. */
>>  static inline void update_sp0(struct task_struct *task)
>>  {
>> -     /* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
>> -#ifdef CONFIG_X86_32
>> -     load_sp0(task->thread.sp0);
>> -#else
>> +     /* sp0 always points to the entry trampoline stack, which is constant: */
>>       if (static_cpu_has(X86_FEATURE_XENPV))
>>               load_sp0(task_top_of_stack(task));
>> -#endif
>>  }
>>
>>  #endif /* _ASM_X86_SWITCH_TO_H */
>> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
>> index 654229bac2fc..7270dd834f4b 100644
>> --- a/arch/x86/kernel/asm-offsets_32.c
>> +++ b/arch/x86/kernel/asm-offsets_32.c
>> @@ -47,9 +47,11 @@ void foo(void)
>>       BLANK();
>>
>>       /* Offset from the sysenter stack to tss.sp0 */
>> -     DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
>> +     DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
>>              offsetofend(struct cpu_entry_area, entry_stack_page.stack));

I was going to say that this is just too magical.  The convention is
that STRUCT_member refers to "member" of "STRUCT".  Here you're
encoding a more complicated calculation.  How about putting just the
needed offsets in asm_offsets and putting the actual calculation in
the asm code or a header.

>>
>> +     OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);

This belongs in asm_offsets.c.  Just move the asm_offsets_64.c version
there and call it a day.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 22:45     ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:45 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Use the sysenter stack as a trampoline stack to enter the
> kernel. The sysenter stack is already in the cpu_entry_area
> and will be mapped to userspace when PTI is enabled.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/entry/entry_32.S        | 89 +++++++++++++++++++++++++++++++++++-----
>  arch/x86/include/asm/switch_to.h |  6 +--
>  arch/x86/kernel/asm-offsets_32.c |  4 +-
>  arch/x86/kernel/cpu/common.c     |  5 ++-
>  arch/x86/kernel/process.c        |  2 -
>  arch/x86/kernel/process_32.c     |  6 +++
>  6 files changed, 91 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index eb8c5615777b..5a7bdb73be9f 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -222,6 +222,47 @@
>  .endm
>
>  /*
> + * Switch from the entry-trampline stack to the kernel stack of the
> + * running task.
> + *
> + * nr_regs is the number of dwords to push from the entry stack to the
> + * task stack. If it is > 0 it expects an irq frame at the bottom of the
> + * stack.
> + *
> + * check_user != 0 it will add a check to only switch stacks if the
> + * kernel entry was from user-space.
> + */
> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0

How about marking nr_regs with :req to force everyone to be explicit?

> +
> +       .if \check_user > 0 && \nr_regs > 0
> +       testb $3, (\nr_regs - 4)*4(%esp)                /* CS */
> +       jz .Lend_\@
> +       .endif
> +
> +       pushl %edi
> +       movl  %esp, %edi
> +
> +       /*
> +        * TSS_sysenter_stack is the offset from the bottom of the
> +        * entry-stack
> +        */
> +       movl  TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp

This is incomprehensible.  You're adding what appears to be the offset
of sysenter_stack within the TSS to something based on esp and
dereferencing that to get the new esp.  That't not actually what
you're doing, but please change asm_offsets.c (as in my previous
email) to avoid putting serious arithmetic in it and then do the
arithmetic right here so that it's possible to follow what's going on.

> +
> +       /* Copy the registers over */
> +       .if \nr_regs > 0
> +       i = 0
> +       .rept \nr_regs
> +       pushl (\nr_regs - i) * 4(%edi)
> +       i = i + 1
> +       .endr
> +       .endif
> +
> +       mov (%edi), %edi
> +
> +.Lend_\@:
> +.endm
> +
> +/*
>   * %eax: prev task
>   * %edx: next task
>   */
> @@ -401,7 +442,9 @@ ENTRY(xen_sysenter_target)
>   * 0(%ebp) arg6
>   */
>  ENTRY(entry_SYSENTER_32)
> -       movl    TSS_sysenter_stack(%esp), %esp
> +       /* Kernel stack is empty */
> +       SWITCH_TO_KERNEL_STACK

This would be more readable if you put nr_regs in here.

> +
>  .Lsysenter_past_esp:
>         pushl   $__USER_DS              /* pt_regs->ss */
>         pushl   %ebp                    /* pt_regs->sp (stashed in bp) */
> @@ -521,6 +564,10 @@ ENDPROC(entry_SYSENTER_32)
>  ENTRY(entry_INT80_32)
>         ASM_CLAC
>         pushl   %eax                    /* pt_regs->orig_ax */
> +
> +       /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
> +       SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
> +

Why check_user?

> @@ -655,6 +702,10 @@ END(irq_entries_start)
>  common_interrupt:
>         ASM_CLAC
>         addl    $-0x80, (%esp)                  /* Adjust vector into the [-256, -1] range */
> +
> +       /* Stack layout: ss, esp, eflags, cs, eip, vector */
> +       SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1

LGTM.

>  ENTRY(nmi)
>         ASM_CLAC
> +
> +       /* Stack layout: ss, esp, eflags, cs, eip */
> +       SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1

This is wrong, I think.  If you get an nmi in kernel mode but while
still on the sysenter stack, you blow up.  IIRC we have some crazy
code already to handle this (for nmi and #DB), and maybe that's
already adequate or can be made adequate, but at the very least this
needs a big comment explaining why it's okay.

> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index eb5f7999a893..20e5f7ab8260 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
>  /* This is used when switching tasks or entering/exiting vm86 mode. */
>  static inline void update_sp0(struct task_struct *task)
>  {
> -       /* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
> -#ifdef CONFIG_X86_32
> -       load_sp0(task->thread.sp0);
> -#else
> +       /* sp0 always points to the entry trampoline stack, which is constant: */
>         if (static_cpu_has(X86_FEATURE_XENPV))
>                 load_sp0(task_top_of_stack(task));
> -#endif
>  }
>
>  #endif /* _ASM_X86_SWITCH_TO_H */
> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
> index 654229bac2fc..7270dd834f4b 100644
> --- a/arch/x86/kernel/asm-offsets_32.c
> +++ b/arch/x86/kernel/asm-offsets_32.c
> @@ -47,9 +47,11 @@ void foo(void)
>         BLANK();
>
>         /* Offset from the sysenter stack to tss.sp0 */
> -       DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
> +       DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
>                offsetofend(struct cpu_entry_area, entry_stack_page.stack));



>
> +       OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
> +
>  #ifdef CONFIG_CC_STACKPROTECTOR
>         BLANK();
>         OFFSET(stack_canary_offset, stack_canary, canary);
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index ef29ad001991..20a71c914e59 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1649,11 +1649,12 @@ void cpu_init(void)
>         enter_lazy_tlb(&init_mm, curr);
>
>         /*
> -        * Initialize the TSS.  Don't bother initializing sp0, as the initial
> -        * task never enters user mode.
> +        * Initialize the TSS.  sp0 points to the entry trampoline stack
> +        * regardless of what task is running.
>          */
>         set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
>         load_TR_desc();
> +       load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));

It's high time we unified the 32-bit and 64-bit versions of the code.
This isn't necessarily needed for your series, though.

> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 5224c6099184..452eeac00b80 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -292,6 +292,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>         this_cpu_write(cpu_current_top_of_stack,
>                        (unsigned long)task_stack_page(next_p) +
>                        THREAD_SIZE);
> +       /*
> +        * TODO: Find a way to let cpu_current_top_of_stack point to
> +        * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
> +        * iret exceptions.
> +        */
> +       this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);

Do you know what the issue is?

As a general comment, the interaction between this patch and vm86 is a
bit scary.  In vm86 mode, the kernel gets entered with extra stuff on
the stack, which may screw up all your offsets.

--Andy

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-16 22:45     ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:45 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Use the sysenter stack as a trampoline stack to enter the
> kernel. The sysenter stack is already in the cpu_entry_area
> and will be mapped to userspace when PTI is enabled.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/entry/entry_32.S        | 89 +++++++++++++++++++++++++++++++++++-----
>  arch/x86/include/asm/switch_to.h |  6 +--
>  arch/x86/kernel/asm-offsets_32.c |  4 +-
>  arch/x86/kernel/cpu/common.c     |  5 ++-
>  arch/x86/kernel/process.c        |  2 -
>  arch/x86/kernel/process_32.c     |  6 +++
>  6 files changed, 91 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index eb8c5615777b..5a7bdb73be9f 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -222,6 +222,47 @@
>  .endm
>
>  /*
> + * Switch from the entry-trampline stack to the kernel stack of the
> + * running task.
> + *
> + * nr_regs is the number of dwords to push from the entry stack to the
> + * task stack. If it is > 0 it expects an irq frame at the bottom of the
> + * stack.
> + *
> + * check_user != 0 it will add a check to only switch stacks if the
> + * kernel entry was from user-space.
> + */
> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0

How about marking nr_regs with :req to force everyone to be explicit?

> +
> +       .if \check_user > 0 && \nr_regs > 0
> +       testb $3, (\nr_regs - 4)*4(%esp)                /* CS */
> +       jz .Lend_\@
> +       .endif
> +
> +       pushl %edi
> +       movl  %esp, %edi
> +
> +       /*
> +        * TSS_sysenter_stack is the offset from the bottom of the
> +        * entry-stack
> +        */
> +       movl  TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp

This is incomprehensible.  You're adding what appears to be the offset
of sysenter_stack within the TSS to something based on esp and
dereferencing that to get the new esp.  That't not actually what
you're doing, but please change asm_offsets.c (as in my previous
email) to avoid putting serious arithmetic in it and then do the
arithmetic right here so that it's possible to follow what's going on.

> +
> +       /* Copy the registers over */
> +       .if \nr_regs > 0
> +       i = 0
> +       .rept \nr_regs
> +       pushl (\nr_regs - i) * 4(%edi)
> +       i = i + 1
> +       .endr
> +       .endif
> +
> +       mov (%edi), %edi
> +
> +.Lend_\@:
> +.endm
> +
> +/*
>   * %eax: prev task
>   * %edx: next task
>   */
> @@ -401,7 +442,9 @@ ENTRY(xen_sysenter_target)
>   * 0(%ebp) arg6
>   */
>  ENTRY(entry_SYSENTER_32)
> -       movl    TSS_sysenter_stack(%esp), %esp
> +       /* Kernel stack is empty */
> +       SWITCH_TO_KERNEL_STACK

This would be more readable if you put nr_regs in here.

> +
>  .Lsysenter_past_esp:
>         pushl   $__USER_DS              /* pt_regs->ss */
>         pushl   %ebp                    /* pt_regs->sp (stashed in bp) */
> @@ -521,6 +564,10 @@ ENDPROC(entry_SYSENTER_32)
>  ENTRY(entry_INT80_32)
>         ASM_CLAC
>         pushl   %eax                    /* pt_regs->orig_ax */
> +
> +       /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
> +       SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
> +

Why check_user?

> @@ -655,6 +702,10 @@ END(irq_entries_start)
>  common_interrupt:
>         ASM_CLAC
>         addl    $-0x80, (%esp)                  /* Adjust vector into the [-256, -1] range */
> +
> +       /* Stack layout: ss, esp, eflags, cs, eip, vector */
> +       SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1

LGTM.

>  ENTRY(nmi)
>         ASM_CLAC
> +
> +       /* Stack layout: ss, esp, eflags, cs, eip */
> +       SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1

This is wrong, I think.  If you get an nmi in kernel mode but while
still on the sysenter stack, you blow up.  IIRC we have some crazy
code already to handle this (for nmi and #DB), and maybe that's
already adequate or can be made adequate, but at the very least this
needs a big comment explaining why it's okay.

> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index eb5f7999a893..20e5f7ab8260 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -89,13 +89,9 @@ static inline void refresh_sysenter_cs(struct thread_struct *thread)
>  /* This is used when switching tasks or entering/exiting vm86 mode. */
>  static inline void update_sp0(struct task_struct *task)
>  {
> -       /* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
> -#ifdef CONFIG_X86_32
> -       load_sp0(task->thread.sp0);
> -#else
> +       /* sp0 always points to the entry trampoline stack, which is constant: */
>         if (static_cpu_has(X86_FEATURE_XENPV))
>                 load_sp0(task_top_of_stack(task));
> -#endif
>  }
>
>  #endif /* _ASM_X86_SWITCH_TO_H */
> diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
> index 654229bac2fc..7270dd834f4b 100644
> --- a/arch/x86/kernel/asm-offsets_32.c
> +++ b/arch/x86/kernel/asm-offsets_32.c
> @@ -47,9 +47,11 @@ void foo(void)
>         BLANK();
>
>         /* Offset from the sysenter stack to tss.sp0 */
> -       DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
> +       DEFINE(TSS_sysenter_stack, offsetof(struct cpu_entry_area, tss.x86_tss.sp1) -
>                offsetofend(struct cpu_entry_area, entry_stack_page.stack));



>
> +       OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
> +
>  #ifdef CONFIG_CC_STACKPROTECTOR
>         BLANK();
>         OFFSET(stack_canary_offset, stack_canary, canary);
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index ef29ad001991..20a71c914e59 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1649,11 +1649,12 @@ void cpu_init(void)
>         enter_lazy_tlb(&init_mm, curr);
>
>         /*
> -        * Initialize the TSS.  Don't bother initializing sp0, as the initial
> -        * task never enters user mode.
> +        * Initialize the TSS.  sp0 points to the entry trampoline stack
> +        * regardless of what task is running.
>          */
>         set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
>         load_TR_desc();
> +       load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));

It's high time we unified the 32-bit and 64-bit versions of the code.
This isn't necessarily needed for your series, though.

> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 5224c6099184..452eeac00b80 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -292,6 +292,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>         this_cpu_write(cpu_current_top_of_stack,
>                        (unsigned long)task_stack_page(next_p) +
>                        THREAD_SIZE);
> +       /*
> +        * TODO: Find a way to let cpu_current_top_of_stack point to
> +        * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
> +        * iret exceptions.
> +        */
> +       this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);

Do you know what the issue is?

As a general comment, the interaction between this patch and vm86 is a
bit scary.  In vm86 mode, the kernel gets entered with extra stuff on
the stack, which may screw up all your offsets.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 04/16] x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 22:46     ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Move it out of the X86_64 specific processor defines so
> that its visible for 32bit too.

Hmm.  This is okay, I guess, but any code that actually uses this
definition is inherently wrong, since 32-bit implies !PCID.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 04/16] x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
@ 2018-01-16 22:46     ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Move it out of the X86_64 specific processor defines so
> that its visible for 32bit too.

Hmm.  This is okay, I guess, but any code that actually uses this
definition is inherently wrong, since 32-bit implies !PCID.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-16 22:48     ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:48 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Switch back to the trampoline stack before returning to
> userspace.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/entry/entry_32.S        | 58 ++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/asm-offsets_32.c |  1 +
>  2 files changed, 59 insertions(+)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 5a7bdb73be9f..14018eeb11c3 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -263,6 +263,61 @@
>  .endm
>
>  /*
> + * Switch back from the kernel stack to the entry stack.
> + *
> + * iret_frame > 0 adds code to copie over an iret frame from the old to
> + *                the new stack. It also adds a check which bails out if
> + *                we are not returning to user-space.
> + *
> + * This macro is allowed not modify eflags when iret_frame == 0.
> + */
> +.macro SWITCH_TO_ENTRY_STACK iret_frame=0
> +       .if \iret_frame > 0
> +       /* Are we returning to userspace? */
> +       testb   $3, 4(%esp) /* return CS */
> +       jz .Lend_\@
> +       .endif
> +
> +       /*
> +        * We run with user-%fs already loaded from pt_regs, so we don't
> +        * have access to per_cpu data anymore, and there is no swapgs
> +        * equivalent on x86_32.
> +        * We work around this by loading the kernel-%fs again and
> +        * reading the entry stack address from there. Then we restore
> +        * the user-%fs and return.
> +        */
> +       pushl %fs
> +       pushl %edi
> +
> +       /* Re-load kernel-%fs, after that we can use PER_CPU_VAR */
> +       movl $(__KERNEL_PERCPU), %edi
> +       movl %edi, %fs
> +
> +       /* Save old stack pointer to copy the return frame over if needed */
> +       movl %esp, %edi
> +       movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %esp
> +
> +       /* Now we are on the entry stack */
> +
> +       .if \iret_frame > 0
> +       /* Stack frame: ss, esp, eflags, cs, eip, fs, edi */
> +       pushl 6*4(%edi) /* ss */
> +       pushl 5*4(%edi) /* esp */
> +       pushl 4*4(%edi) /* eflags */
> +       pushl 3*4(%edi) /* cs */
> +       pushl 2*4(%edi) /* eip */
> +       .endif
> +
> +       pushl 4(%edi)   /* fs */
> +
> +       /* Restore user %edi and user %fs */
> +       movl (%edi), %edi
> +       popl %fs

Yikes!  We're not *supposed* to be able to observe an asynchronous
descriptor table change, but if the LDT changes out from under you,
this is going to blow up badly.  It would be really nice if you could
pull this off without percpu access or without needing to do this
dance where you load user FS, then kernel FS, then user FS.  If that's
not doable, then you should at least add exception handling -- look at
the other 'pop %fs' instructions in entry_32.S.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
@ 2018-01-16 22:48     ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:48 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Switch back to the trampoline stack before returning to
> userspace.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/entry/entry_32.S        | 58 ++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/asm-offsets_32.c |  1 +
>  2 files changed, 59 insertions(+)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 5a7bdb73be9f..14018eeb11c3 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -263,6 +263,61 @@
>  .endm
>
>  /*
> + * Switch back from the kernel stack to the entry stack.
> + *
> + * iret_frame > 0 adds code to copie over an iret frame from the old to
> + *                the new stack. It also adds a check which bails out if
> + *                we are not returning to user-space.
> + *
> + * This macro is allowed not modify eflags when iret_frame == 0.
> + */
> +.macro SWITCH_TO_ENTRY_STACK iret_frame=0
> +       .if \iret_frame > 0
> +       /* Are we returning to userspace? */
> +       testb   $3, 4(%esp) /* return CS */
> +       jz .Lend_\@
> +       .endif
> +
> +       /*
> +        * We run with user-%fs already loaded from pt_regs, so we don't
> +        * have access to per_cpu data anymore, and there is no swapgs
> +        * equivalent on x86_32.
> +        * We work around this by loading the kernel-%fs again and
> +        * reading the entry stack address from there. Then we restore
> +        * the user-%fs and return.
> +        */
> +       pushl %fs
> +       pushl %edi
> +
> +       /* Re-load kernel-%fs, after that we can use PER_CPU_VAR */
> +       movl $(__KERNEL_PERCPU), %edi
> +       movl %edi, %fs
> +
> +       /* Save old stack pointer to copy the return frame over if needed */
> +       movl %esp, %edi
> +       movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %esp
> +
> +       /* Now we are on the entry stack */
> +
> +       .if \iret_frame > 0
> +       /* Stack frame: ss, esp, eflags, cs, eip, fs, edi */
> +       pushl 6*4(%edi) /* ss */
> +       pushl 5*4(%edi) /* esp */
> +       pushl 4*4(%edi) /* eflags */
> +       pushl 3*4(%edi) /* cs */
> +       pushl 2*4(%edi) /* eip */
> +       .endif
> +
> +       pushl 4(%edi)   /* fs */
> +
> +       /* Restore user %edi and user %fs */
> +       movl (%edi), %edi
> +       popl %fs

Yikes!  We're not *supposed* to be able to observe an asynchronous
descriptor table change, but if the LDT changes out from under you,
this is going to blow up badly.  It would be really nice if you could
pull this off without percpu access or without needing to do this
dance where you load user FS, then kernel FS, then user FS.  If that's
not doable, then you should at least add exception handling -- look at
the other 'pop %fs' instructions in entry_32.S.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
  2018-01-16 16:52     ` Peter Zijlstra
@ 2018-01-16 22:51       ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
>> From: Joerg Roedel <jroedel@suse.de>
>>
>> Reserve 2MB/4MB of address space for mapping the LDT to
>> user-space.
>
> LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> 64K*2*64=8M > 2M.

If this works like it does on 64-bit, it only needs 128k regardless of
the number of CPUs.  The LDT mapping is specific to the mm.

How are you dealing with PAE here?  That is, what's your pagetable
layout?  What parts of the address space are owned by what code?

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
@ 2018-01-16 22:51       ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-16 22:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
>> From: Joerg Roedel <jroedel@suse.de>
>>
>> Reserve 2MB/4MB of address space for mapping the LDT to
>> user-space.
>
> LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> 64K*2*64=8M > 2M.

If this works like it does on 64-bit, it only needs 128k regardless of
the number of CPUs.  The LDT mapping is specific to the mm.

How are you dealing with PAE here?  That is, what's your pagetable
layout?  What parts of the address space are owned by what code?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-17  2:47     ` Boris Ostrovsky
  -1 siblings, 0 replies; 183+ messages in thread
From: Boris Ostrovsky @ 2018-01-17  2:47 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel



On 01/16/2018 11:36 AM, Joerg Roedel wrote:

>   
>   /*
> + * Switch from the entry-trampline stack to the kernel stack of the
> + * running task.
> + *
> + * nr_regs is the number of dwords to push from the entry stack to the
> + * task stack. If it is > 0 it expects an irq frame at the bottom of the
> + * stack.
> + *
> + * check_user != 0 it will add a check to only switch stacks if the
> + * kernel entry was from user-space.
> + */
> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0


This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.

With those macros fixed I was able to boot 32-bit Xen PV guest.

-boris

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-17  2:47     ` Boris Ostrovsky
  0 siblings, 0 replies; 183+ messages in thread
From: Boris Ostrovsky @ 2018-01-17  2:47 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel



On 01/16/2018 11:36 AM, Joerg Roedel wrote:

>   
>   /*
> + * Switch from the entry-trampline stack to the kernel stack of the
> + * running task.
> + *
> + * nr_regs is the number of dwords to push from the entry stack to the
> + * task stack. If it is > 0 it expects an irq frame at the bottom of the
> + * stack.
> + *
> + * check_user != 0 it will add a check to only switch stacks if the
> + * kernel entry was from user-space.
> + */
> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0


This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.

With those macros fixed I was able to boot 32-bit Xen PV guest.

-boris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
  2018-01-16 22:51       ` Andy Lutomirski
@ 2018-01-17  7:59         ` Peter Zijlstra
  -1 siblings, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2018-01-17  7:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, linux-mm, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 02:51:45PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> >> From: Joerg Roedel <jroedel@suse.de>
> >>
> >> Reserve 2MB/4MB of address space for mapping the LDT to
> >> user-space.
> >
> > LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> > 64K*2*64=8M > 2M.
> 
> If this works like it does on 64-bit, it only needs 128k regardless of
> the number of CPUs.  The LDT mapping is specific to the mm.

Ah, then I got my LDT things confused again... which is certainly
possible, we had a few too many variants back then.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT
@ 2018-01-17  7:59         ` Peter Zijlstra
  0 siblings, 0 replies; 183+ messages in thread
From: Peter Zijlstra @ 2018-01-17  7:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, linux-mm, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 02:51:45PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Jan 16, 2018 at 05:36:49PM +0100, Joerg Roedel wrote:
> >> From: Joerg Roedel <jroedel@suse.de>
> >>
> >> Reserve 2MB/4MB of address space for mapping the LDT to
> >> user-space.
> >
> > LDT is 64k, we need 2 per CPU, and NR_CPUS <= 64 on 32bit, that gives
> > 64K*2*64=8M > 2M.
> 
> If this works like it does on 64-bit, it only needs 128k regardless of
> the number of CPUs.  The LDT mapping is specific to the mm.

Ah, then I got my LDT things confused again... which is certainly
possible, we had a few too many variants back then.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-17  2:47     ` Boris Ostrovsky
@ 2018-01-17  9:02       ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:02 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Hi Boris,

thanks for testing this :)

On Tue, Jan 16, 2018 at 09:47:06PM -0500, Boris Ostrovsky wrote:
> On 01/16/2018 11:36 AM, Joerg Roedel wrote:
> >+.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
> 
> 
> This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.
> 
> With those macros fixed I was able to boot 32-bit Xen PV guest.

Hmm, on bare metal the stack switch happens regardless of the
X86_FEATURE_PTI feature being set, because we always program tss.sp0
with the systenter stack. How is the kernel entry stack setup on xen-pv?
I think something is missing there instead.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-17  9:02       ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:02 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Hi Boris,

thanks for testing this :)

On Tue, Jan 16, 2018 at 09:47:06PM -0500, Boris Ostrovsky wrote:
> On 01/16/2018 11:36 AM, Joerg Roedel wrote:
> >+.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
> 
> 
> This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.
> 
> With those macros fixed I was able to boot 32-bit Xen PV guest.

Hmm, on bare metal the stack switch happens regardless of the
X86_FEATURE_PTI feature being set, because we always program tss.sp0
with the systenter stack. How is the kernel entry stack setup on xen-pv?
I think something is missing there instead.

Regards,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-16 22:45     ` Andy Lutomirski
@ 2018-01-17  9:18       ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 02:45:27PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
> 
> How about marking nr_regs with :req to force everyone to be explicit?

Yeah, that's more readable, I'll change it.

> > +       /*
> > +        * TSS_sysenter_stack is the offset from the bottom of the
> > +        * entry-stack
> > +        */
> > +       movl  TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp
> 
> This is incomprehensible.  You're adding what appears to be the offset
> of sysenter_stack within the TSS to something based on esp and
> dereferencing that to get the new esp.  That't not actually what
> you're doing, but please change asm_offsets.c (as in my previous
> email) to avoid putting serious arithmetic in it and then do the
> arithmetic right here so that it's possible to follow what's going on.

Probably this needs better comments. So TSS_sysenter_stack is the offset
from to tss.sp0 (tss.sp1 later) from the _bottom_ of the stack. But in
this macro the stack might not be empty, it has a configurable (by
\nr_regs) number of dwords on it. Before this instruction we also do a
push %edi, so we need (\nr_regs + 1).

This can't be put into asm_offset.c, as the actual offset depends on how
much is on the stack.

> >  ENTRY(entry_INT80_32)
> >         ASM_CLAC
> >         pushl   %eax                    /* pt_regs->orig_ax */
> > +
> > +       /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
> > +       SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
> > +
> 
> Why check_user?

You are right, check_user shouldn't ne needed as INT80 is never called
from kernel mode.

> >  ENTRY(nmi)
> >         ASM_CLAC
> > +
> > +       /* Stack layout: ss, esp, eflags, cs, eip */
> > +       SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1
> 
> This is wrong, I think.  If you get an nmi in kernel mode but while
> still on the sysenter stack, you blow up.  IIRC we have some crazy
> code already to handle this (for nmi and #DB), and maybe that's
> already adequate or can be made adequate, but at the very least this
> needs a big comment explaining why it's okay.

If we get an nmi while still on the sysenter stack, then we are not
entering the handler from user-space and the above code will do
nothing and behave as before.

But you are right, it might blow up. There is a problem with the cr3
switch, because the nmi can happen in kernel mode before the cr3 is
switched, then this handler will not do the cr3 switch itself and crash
the kernel. But the stack switching should be fine, I think.

> > +       /*
> > +        * TODO: Find a way to let cpu_current_top_of_stack point to
> > +        * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
> > +        * iret exceptions.
> > +        */
> > +       this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
> 
> Do you know what the issue is?

No, not yet, I will look into that again. But first I want to get
this series stable enough as it is.

> As a general comment, the interaction between this patch and vm86 is a
> bit scary.  In vm86 mode, the kernel gets entered with extra stuff on
> the stack, which may screw up all your offsets.

Just read up on vm86 mode control transfers and the stack layout then.
Looks like I need to check for eflags.vm=1 and copy four more registers
from/to the entry stack. Thanks for pointing that out.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-17  9:18       ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 02:45:27PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
> 
> How about marking nr_regs with :req to force everyone to be explicit?

Yeah, that's more readable, I'll change it.

> > +       /*
> > +        * TSS_sysenter_stack is the offset from the bottom of the
> > +        * entry-stack
> > +        */
> > +       movl  TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp
> 
> This is incomprehensible.  You're adding what appears to be the offset
> of sysenter_stack within the TSS to something based on esp and
> dereferencing that to get the new esp.  That't not actually what
> you're doing, but please change asm_offsets.c (as in my previous
> email) to avoid putting serious arithmetic in it and then do the
> arithmetic right here so that it's possible to follow what's going on.

Probably this needs better comments. So TSS_sysenter_stack is the offset
from to tss.sp0 (tss.sp1 later) from the _bottom_ of the stack. But in
this macro the stack might not be empty, it has a configurable (by
\nr_regs) number of dwords on it. Before this instruction we also do a
push %edi, so we need (\nr_regs + 1).

This can't be put into asm_offset.c, as the actual offset depends on how
much is on the stack.

> >  ENTRY(entry_INT80_32)
> >         ASM_CLAC
> >         pushl   %eax                    /* pt_regs->orig_ax */
> > +
> > +       /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
> > +       SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
> > +
> 
> Why check_user?

You are right, check_user shouldn't ne needed as INT80 is never called
from kernel mode.

> >  ENTRY(nmi)
> >         ASM_CLAC
> > +
> > +       /* Stack layout: ss, esp, eflags, cs, eip */
> > +       SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1
> 
> This is wrong, I think.  If you get an nmi in kernel mode but while
> still on the sysenter stack, you blow up.  IIRC we have some crazy
> code already to handle this (for nmi and #DB), and maybe that's
> already adequate or can be made adequate, but at the very least this
> needs a big comment explaining why it's okay.

If we get an nmi while still on the sysenter stack, then we are not
entering the handler from user-space and the above code will do
nothing and behave as before.

But you are right, it might blow up. There is a problem with the cr3
switch, because the nmi can happen in kernel mode before the cr3 is
switched, then this handler will not do the cr3 switch itself and crash
the kernel. But the stack switching should be fine, I think.

> > +       /*
> > +        * TODO: Find a way to let cpu_current_top_of_stack point to
> > +        * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
> > +        * iret exceptions.
> > +        */
> > +       this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
> 
> Do you know what the issue is?

No, not yet, I will look into that again. But first I want to get
this series stable enough as it is.

> As a general comment, the interaction between this patch and vm86 is a
> bit scary.  In vm86 mode, the kernel gets entered with extra stuff on
> the stack, which may screw up all your offsets.

Just read up on vm86 mode control transfers and the stack layout then.
Looks like I need to check for eflags.vm=1 and copy four more registers
from/to the entry stack. Thanks for pointing that out.

Thanks,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
  2018-01-16 22:48     ` Andy Lutomirski
@ 2018-01-17  9:24       ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 02:48:43PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > +       /* Restore user %edi and user %fs */
> > +       movl (%edi), %edi
> > +       popl %fs
> 
> Yikes!  We're not *supposed* to be able to observe an asynchronous
> descriptor table change, but if the LDT changes out from under you,
> this is going to blow up badly.  It would be really nice if you could
> pull this off without percpu access or without needing to do this
> dance where you load user FS, then kernel FS, then user FS.  If that's
> not doable, then you should at least add exception handling -- look at
> the other 'pop %fs' instructions in entry_32.S.

You are right! This also means I need to do the 'popl %fs' before the
cr3-switch. I'll fix it in the next version.

I have no real idea on how to switch back to the entry stack without
access to per_cpu variables. I also can't access the cpu_entry_area for
the cpu yet, because for that we need to be on the entry stack already.


	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
@ 2018-01-17  9:24       ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 02:48:43PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > +       /* Restore user %edi and user %fs */
> > +       movl (%edi), %edi
> > +       popl %fs
> 
> Yikes!  We're not *supposed* to be able to observe an asynchronous
> descriptor table change, but if the LDT changes out from under you,
> this is going to blow up badly.  It would be really nice if you could
> pull this off without percpu access or without needing to do this
> dance where you load user FS, then kernel FS, then user FS.  If that's
> not doable, then you should at least add exception handling -- look at
> the other 'pop %fs' instructions in entry_32.S.

You are right! This also means I need to do the 'popl %fs' before the
cr3-switch. I'll fix it in the next version.

I have no real idea on how to switch back to the entry stack without
access to per_cpu variables. I also can't access the cpu_entry_area for
the cpu yet, because for that we need to be on the entry stack already.


	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 04/16] x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
  2018-01-16 22:46     ` Andy Lutomirski
@ 2018-01-17  9:26       ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 02:46:16PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > From: Joerg Roedel <jroedel@suse.de>
> >
> > Move it out of the X86_64 specific processor defines so
> > that its visible for 32bit too.
> 
> Hmm.  This is okay, I guess, but any code that actually uses this
> definition is inherently wrong, since 32-bit implies !PCID.

Yes, I tried another approach first which just #ifdef'ed out the
relevant parts in tlbflush.h which use this bit. But that seemed to be
the wrong path, as there is more PCID code that is compiled in for 32
bit. So defining the bit for 32 bit seemed to be the cleaner solution
for now.


	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 04/16] x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
@ 2018-01-17  9:26       ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 02:46:16PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > From: Joerg Roedel <jroedel@suse.de>
> >
> > Move it out of the X86_64 specific processor defines so
> > that its visible for 32bit too.
> 
> Hmm.  This is okay, I guess, but any code that actually uses this
> definition is inherently wrong, since 32-bit implies !PCID.

Yes, I tried another approach first which just #ifdef'ed out the
relevant parts in tlbflush.h which use this bit. But that seemed to be
the wrong path, as there is more PCID code that is compiled in for 32
bit. So defining the bit for 32 bit seemed to be the cleaner solution
for now.


	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 22:26   ` Andy Lutomirski
@ 2018-01-17  9:33     ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

Hi Andy,

thanks a lot for your review and input, especially on the entry-code
changes!

On Tue, Jan 16, 2018 at 02:26:22PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > The code has not run on bare-metal yet, I'll test that in
> > the next days once I setup a 32 bit box again. I also havn't
> > tested Wine and DosEMU yet, so this might also be broken.
> >
> 
> If you pass all the x86 selftests, then Wine and DOSEMU are pretty
> likely to work :)

Okay, good to know. I will definitily run them and make them pass :)


	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-17  9:33     ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

Hi Andy,

thanks a lot for your review and input, especially on the entry-code
changes!

On Tue, Jan 16, 2018 at 02:26:22PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > The code has not run on bare-metal yet, I'll test that in
> > the next days once I setup a 32 bit box again. I also havn't
> > tested Wine and DosEMU yet, so this might also be broken.
> >
> 
> If you pass all the x86 selftests, then Wine and DOSEMU are pretty
> likely to work :)

Okay, good to know. I will definitily run them and make them pass :)


	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 21:20   ` Thomas Gleixner
@ 2018-01-17  9:55     ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Hi Thomas,

thanks for your review, I'll work in your suggestions for the next post.

On Tue, Jan 16, 2018 at 10:20:40PM +0100, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Joerg Roedel wrote:

> >  16 files changed, 333 insertions(+), 123 deletions(-)
> 
> Impressively small and well done !

Thanks :)

> Can you please make that patch set against
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-pti-for-linus
> 
> so we immediately have it backportable for 4.14 stable? It's only a trivial
> conflict in pgtable.h, but we'd like to make the life of stable as simple
> as possible. They have enough headache with the pre 4.14 trees.

Sure, will do.

> We can pick some of the simple patches which make defines and inlines
> available out of the pile right away and apply them to x86/pti to shrink
> the amount of stuff you have to worry about.

This should be patches 4, 5, 7, 11, and I think 13 is also simple
enough. Feel free to take them, but I can also carry them forward if
needed.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-17  9:55     ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17  9:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, H . Peter Anvin, x86, linux-kernel, linux-mm,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Hi Thomas,

thanks for your review, I'll work in your suggestions for the next post.

On Tue, Jan 16, 2018 at 10:20:40PM +0100, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Joerg Roedel wrote:

> >  16 files changed, 333 insertions(+), 123 deletions(-)
> 
> Impressively small and well done !

Thanks :)

> Can you please make that patch set against
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-pti-for-linus
> 
> so we immediately have it backportable for 4.14 stable? It's only a trivial
> conflict in pgtable.h, but we'd like to make the life of stable as simple
> as possible. They have enough headache with the pre 4.14 trees.

Sure, will do.

> We can pick some of the simple patches which make defines and inlines
> available out of the pile right away and apply them to x86/pti to shrink
> the amount of stuff you have to worry about.

This should be patches 4, 5, 7, 11, and I think 13 is also simple
enough. Feel free to take them, but I can also carry them forward if
needed.

Thanks,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
  2018-01-17  9:24       ` Joerg Roedel
@ 2018-01-17 13:57         ` Brian Gerst
  -1 siblings, 0 replies; 183+ messages in thread
From: Brian Gerst @ 2018-01-17 13:57 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Tue, Jan 16, 2018 at 02:48:43PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > +       /* Restore user %edi and user %fs */
>> > +       movl (%edi), %edi
>> > +       popl %fs
>>
>> Yikes!  We're not *supposed* to be able to observe an asynchronous
>> descriptor table change, but if the LDT changes out from under you,
>> this is going to blow up badly.  It would be really nice if you could
>> pull this off without percpu access or without needing to do this
>> dance where you load user FS, then kernel FS, then user FS.  If that's
>> not doable, then you should at least add exception handling -- look at
>> the other 'pop %fs' instructions in entry_32.S.
>
> You are right! This also means I need to do the 'popl %fs' before the
> cr3-switch. I'll fix it in the next version.
>
> I have no real idea on how to switch back to the entry stack without
> access to per_cpu variables. I also can't access the cpu_entry_area for
> the cpu yet, because for that we need to be on the entry stack already.

Switch to the trampoline stack before loading user segments.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
@ 2018-01-17 13:57         ` Brian Gerst
  0 siblings, 0 replies; 183+ messages in thread
From: Brian Gerst @ 2018-01-17 13:57 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Tue, Jan 16, 2018 at 02:48:43PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > +       /* Restore user %edi and user %fs */
>> > +       movl (%edi), %edi
>> > +       popl %fs
>>
>> Yikes!  We're not *supposed* to be able to observe an asynchronous
>> descriptor table change, but if the LDT changes out from under you,
>> this is going to blow up badly.  It would be really nice if you could
>> pull this off without percpu access or without needing to do this
>> dance where you load user FS, then kernel FS, then user FS.  If that's
>> not doable, then you should at least add exception handling -- look at
>> the other 'pop %fs' instructions in entry_32.S.
>
> You are right! This also means I need to do the 'popl %fs' before the
> cr3-switch. I'll fix it in the next version.
>
> I have no real idea on how to switch back to the entry stack without
> access to per_cpu variables. I also can't access the cpu_entry_area for
> the cpu yet, because for that we need to be on the entry stack already.

Switch to the trampoline stack before loading user segments.

--
Brian Gerst

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
  2018-01-17 13:57         ` Brian Gerst
@ 2018-01-17 14:00           ` Brian Gerst
  -1 siblings, 0 replies; 183+ messages in thread
From: Brian Gerst @ 2018-01-17 14:00 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 5:57 AM, Brian Gerst <brgerst@gmail.com> wrote:
> On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> On Tue, Jan 16, 2018 at 02:48:43PM -0800, Andy Lutomirski wrote:
>>> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
>>> > +       /* Restore user %edi and user %fs */
>>> > +       movl (%edi), %edi
>>> > +       popl %fs
>>>
>>> Yikes!  We're not *supposed* to be able to observe an asynchronous
>>> descriptor table change, but if the LDT changes out from under you,
>>> this is going to blow up badly.  It would be really nice if you could
>>> pull this off without percpu access or without needing to do this
>>> dance where you load user FS, then kernel FS, then user FS.  If that's
>>> not doable, then you should at least add exception handling -- look at
>>> the other 'pop %fs' instructions in entry_32.S.
>>
>> You are right! This also means I need to do the 'popl %fs' before the
>> cr3-switch. I'll fix it in the next version.
>>
>> I have no real idea on how to switch back to the entry stack without
>> access to per_cpu variables. I also can't access the cpu_entry_area for
>> the cpu yet, because for that we need to be on the entry stack already.
>
> Switch to the trampoline stack before loading user segments.

But then again, you could take a fault on the trampoline stack if you
get a bad segment.  Perhaps just pushing the new stack pointer onto
the process stack before user segment loads will be the right move.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
@ 2018-01-17 14:00           ` Brian Gerst
  0 siblings, 0 replies; 183+ messages in thread
From: Brian Gerst @ 2018-01-17 14:00 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 5:57 AM, Brian Gerst <brgerst@gmail.com> wrote:
> On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> On Tue, Jan 16, 2018 at 02:48:43PM -0800, Andy Lutomirski wrote:
>>> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
>>> > +       /* Restore user %edi and user %fs */
>>> > +       movl (%edi), %edi
>>> > +       popl %fs
>>>
>>> Yikes!  We're not *supposed* to be able to observe an asynchronous
>>> descriptor table change, but if the LDT changes out from under you,
>>> this is going to blow up badly.  It would be really nice if you could
>>> pull this off without percpu access or without needing to do this
>>> dance where you load user FS, then kernel FS, then user FS.  If that's
>>> not doable, then you should at least add exception handling -- look at
>>> the other 'pop %fs' instructions in entry_32.S.
>>
>> You are right! This also means I need to do the 'popl %fs' before the
>> cr3-switch. I'll fix it in the next version.
>>
>> I have no real idea on how to switch back to the entry stack without
>> access to per_cpu variables. I also can't access the cpu_entry_area for
>> the cpu yet, because for that we need to be on the entry stack already.
>
> Switch to the trampoline stack before loading user segments.

But then again, you could take a fault on the trampoline stack if you
get a bad segment.  Perhaps just pushing the new stack pointer onto
the process stack before user segment loads will be the right move.

--
Brian Gerst

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-17  9:02       ` Joerg Roedel
@ 2018-01-17 14:04         ` Andrew Cooper
  -1 siblings, 0 replies; 183+ messages in thread
From: Andrew Cooper @ 2018-01-17 14:04 UTC (permalink / raw)
  To: Joerg Roedel, Boris Ostrovsky
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On 17/01/18 09:02, Joerg Roedel wrote:
> Hi Boris,
>
> thanks for testing this :)
>
> On Tue, Jan 16, 2018 at 09:47:06PM -0500, Boris Ostrovsky wrote:
>> On 01/16/2018 11:36 AM, Joerg Roedel wrote:
>>> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>>
>> This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.
>>
>> With those macros fixed I was able to boot 32-bit Xen PV guest.
> Hmm, on bare metal the stack switch happens regardless of the
> X86_FEATURE_PTI feature being set, because we always program tss.sp0
> with the systenter stack. How is the kernel entry stack setup on xen-pv?
> I think something is missing there instead.

There is one single stack registered with Xen, on which you get a normal
exception frame in all cases, even via the registered (virtual)
syscall/sysenter/failsafe handlers.

~Andrew

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-17 14:04         ` Andrew Cooper
  0 siblings, 0 replies; 183+ messages in thread
From: Andrew Cooper @ 2018-01-17 14:04 UTC (permalink / raw)
  To: Joerg Roedel, Boris Ostrovsky
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On 17/01/18 09:02, Joerg Roedel wrote:
> Hi Boris,
>
> thanks for testing this :)
>
> On Tue, Jan 16, 2018 at 09:47:06PM -0500, Boris Ostrovsky wrote:
>> On 01/16/2018 11:36 AM, Joerg Roedel wrote:
>>> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>>
>> This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.
>>
>> With those macros fixed I was able to boot 32-bit Xen PV guest.
> Hmm, on bare metal the stack switch happens regardless of the
> X86_FEATURE_PTI feature being set, because we always program tss.sp0
> with the systenter stack. How is the kernel entry stack setup on xen-pv?
> I think something is missing there instead.

There is one single stack registered with Xen, on which you get a normal
exception frame in all cases, even via the registered (virtual)
syscall/sysenter/failsafe handlers.

~Andrew

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
  2018-01-17 13:57         ` Brian Gerst
@ 2018-01-17 14:10           ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17 14:10 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 05:57:53AM -0800, Brian Gerst wrote:
> On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <joro@8bytes.org> wrote:

> > I have no real idea on how to switch back to the entry stack without
> > access to per_cpu variables. I also can't access the cpu_entry_area for
> > the cpu yet, because for that we need to be on the entry stack already.
> 
> Switch to the trampoline stack before loading user segments.

That requires to copy most of pt_regs from task- to trampoline-stack,
not sure if that is faster than temporily restoring kernel %fs.


	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
@ 2018-01-17 14:10           ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17 14:10 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 05:57:53AM -0800, Brian Gerst wrote:
> On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <joro@8bytes.org> wrote:

> > I have no real idea on how to switch back to the entry stack without
> > access to per_cpu variables. I also can't access the cpu_entry_area for
> > the cpu yet, because for that we need to be on the entry stack already.
> 
> Switch to the trampoline stack before loading user segments.

That requires to copy most of pt_regs from task- to trampoline-stack,
not sure if that is faster than temporily restoring kernel %fs.


	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
  2018-01-17 14:00           ` Brian Gerst
@ 2018-01-17 14:14             ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17 14:14 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 06:00:07AM -0800, Brian Gerst wrote:
> On Wed, Jan 17, 2018 at 5:57 AM, Brian Gerst <brgerst@gmail.com> wrote:
> But then again, you could take a fault on the trampoline stack if you
> get a bad segment.  Perhaps just pushing the new stack pointer onto
> the process stack before user segment loads will be the right move.

User segment loads pop from the stack, so having anything on-top also
doesn't work.

Maybe I can leave some space at the bottom of the task-stack at entry
time and store the pointer there on exit, if that doesn't confuse the
stack unwinder too much.


	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
@ 2018-01-17 14:14             ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-17 14:14 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 06:00:07AM -0800, Brian Gerst wrote:
> On Wed, Jan 17, 2018 at 5:57 AM, Brian Gerst <brgerst@gmail.com> wrote:
> But then again, you could take a fault on the trampoline stack if you
> get a bad segment.  Perhaps just pushing the new stack pointer onto
> the process stack before user segment loads will be the right move.

User segment loads pop from the stack, so having anything on-top also
doesn't work.

Maybe I can leave some space at the bottom of the task-stack at entry
time and store the pointer there on exit, if that doesn't confuse the
stack unwinder too much.


	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
  2018-01-17 14:14             ` Joerg Roedel
@ 2018-01-17 14:45               ` Josh Poimboeuf
  -1 siblings, 0 replies; 183+ messages in thread
From: Josh Poimboeuf @ 2018-01-17 14:45 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Brian Gerst, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, X86 ML, LKML, Linux-MM, Linus Torvalds,
	Dave Hansen, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 03:14:18PM +0100, Joerg Roedel wrote:
> On Wed, Jan 17, 2018 at 06:00:07AM -0800, Brian Gerst wrote:
> > On Wed, Jan 17, 2018 at 5:57 AM, Brian Gerst <brgerst@gmail.com> wrote:
> > But then again, you could take a fault on the trampoline stack if you
> > get a bad segment.  Perhaps just pushing the new stack pointer onto
> > the process stack before user segment loads will be the right move.
> 
> User segment loads pop from the stack, so having anything on-top also
> doesn't work.
> 
> Maybe I can leave some space at the bottom of the task-stack at entry
> time and store the pointer there on exit, if that doesn't confuse the
> stack unwinder too much.

If you put it at the end of the stack page, I _think_ all you'd have to
do is just adjust TOP_OF_KERNEL_STACK_PADDING.

-- 
Josh

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
@ 2018-01-17 14:45               ` Josh Poimboeuf
  0 siblings, 0 replies; 183+ messages in thread
From: Josh Poimboeuf @ 2018-01-17 14:45 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Brian Gerst, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, X86 ML, LKML, Linux-MM, Linus Torvalds,
	Dave Hansen, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 03:14:18PM +0100, Joerg Roedel wrote:
> On Wed, Jan 17, 2018 at 06:00:07AM -0800, Brian Gerst wrote:
> > On Wed, Jan 17, 2018 at 5:57 AM, Brian Gerst <brgerst@gmail.com> wrote:
> > But then again, you could take a fault on the trampoline stack if you
> > get a bad segment.  Perhaps just pushing the new stack pointer onto
> > the process stack before user segment loads will be the right move.
> 
> User segment loads pop from the stack, so having anything on-top also
> doesn't work.
> 
> Maybe I can leave some space at the bottom of the task-stack at entry
> time and store the pointer there on exit, if that doesn't confuse the
> stack unwinder too much.

If you put it at the end of the stack page, I _think_ all you'd have to
do is just adjust TOP_OF_KERNEL_STACK_PADDING.

-- 
Josh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-17 14:04         ` Andrew Cooper
@ 2018-01-17 15:22           ` Boris Ostrovsky
  -1 siblings, 0 replies; 183+ messages in thread
From: Boris Ostrovsky @ 2018-01-17 15:22 UTC (permalink / raw)
  To: Andrew Cooper, Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On 01/17/2018 09:04 AM, Andrew Cooper wrote:
> On 17/01/18 09:02, Joerg Roedel wrote:
>> Hi Boris,
>>
>> thanks for testing this :)
>>
>> On Tue, Jan 16, 2018 at 09:47:06PM -0500, Boris Ostrovsky wrote:
>>> On 01/16/2018 11:36 AM, Joerg Roedel wrote:
>>>> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>>> This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.
>>>
>>> With those macros fixed I was able to boot 32-bit Xen PV guest.
>> Hmm, on bare metal the stack switch happens regardless of the
>> X86_FEATURE_PTI feature being set, because we always program tss.sp0
>> with the systenter stack. How is the kernel entry stack setup on xen-pv?
>> I think something is missing there instead.
> There is one single stack registered with Xen, on which you get a normal
> exception frame in all cases, even via the registered (virtual)
> syscall/sysenter/failsafe handlers.

And so the check should be at least against X86_FEATURE_XENPV, not
necessarily X86_FEATURE_PTI.

But I guess you can still check against X86_FEATURE_PTI since without it
there is not much reason to switch stacks?

-boris

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-17 15:22           ` Boris Ostrovsky
  0 siblings, 0 replies; 183+ messages in thread
From: Boris Ostrovsky @ 2018-01-17 15:22 UTC (permalink / raw)
  To: Andrew Cooper, Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On 01/17/2018 09:04 AM, Andrew Cooper wrote:
> On 17/01/18 09:02, Joerg Roedel wrote:
>> Hi Boris,
>>
>> thanks for testing this :)
>>
>> On Tue, Jan 16, 2018 at 09:47:06PM -0500, Boris Ostrovsky wrote:
>>> On 01/16/2018 11:36 AM, Joerg Roedel wrote:
>>>> +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>>> This (and next patch's SWITCH_TO_ENTRY_STACK) need X86_FEATURE_PTI check.
>>>
>>> With those macros fixed I was able to boot 32-bit Xen PV guest.
>> Hmm, on bare metal the stack switch happens regardless of the
>> X86_FEATURE_PTI feature being set, because we always program tss.sp0
>> with the systenter stack. How is the kernel entry stack setup on xen-pv?
>> I think something is missing there instead.
> There is one single stack registered with Xen, on which you get a normal
> exception frame in all cases, even via the registered (virtual)
> syscall/sysenter/failsafe handlers.

And so the check should be at least against X86_FEATURE_XENPV, not
necessarily X86_FEATURE_PTI.

But I guess you can still check against X86_FEATURE_PTI since without it
there is not much reason to switch stacks?

-boris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-17  9:18       ` Joerg Roedel
@ 2018-01-17 18:10         ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-17 18:10 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, linux-mm, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 1:18 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Tue, Jan 16, 2018 at 02:45:27PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>>
>> How about marking nr_regs with :req to force everyone to be explicit?
>
> Yeah, that's more readable, I'll change it.
>
>> > +       /*
>> > +        * TSS_sysenter_stack is the offset from the bottom of the
>> > +        * entry-stack
>> > +        */
>> > +       movl  TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp
>>
>> This is incomprehensible.  You're adding what appears to be the offset
>> of sysenter_stack within the TSS to something based on esp and
>> dereferencing that to get the new esp.  That't not actually what
>> you're doing, but please change asm_offsets.c (as in my previous
>> email) to avoid putting serious arithmetic in it and then do the
>> arithmetic right here so that it's possible to follow what's going on.
>
> Probably this needs better comments. So TSS_sysenter_stack is the offset
> from to tss.sp0 (tss.sp1 later) from the _bottom_ of the stack. But in
> this macro the stack might not be empty, it has a configurable (by
> \nr_regs) number of dwords on it. Before this instruction we also do a
> push %edi, so we need (\nr_regs + 1).
>
> This can't be put into asm_offset.c, as the actual offset depends on how
> much is on the stack.
>
>> >  ENTRY(entry_INT80_32)
>> >         ASM_CLAC
>> >         pushl   %eax                    /* pt_regs->orig_ax */
>> > +
>> > +       /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
>> > +       SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
>> > +
>>
>> Why check_user?
>
> You are right, check_user shouldn't ne needed as INT80 is never called
> from kernel mode.
>
>> >  ENTRY(nmi)
>> >         ASM_CLAC
>> > +
>> > +       /* Stack layout: ss, esp, eflags, cs, eip */
>> > +       SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1
>>
>> This is wrong, I think.  If you get an nmi in kernel mode but while
>> still on the sysenter stack, you blow up.  IIRC we have some crazy
>> code already to handle this (for nmi and #DB), and maybe that's
>> already adequate or can be made adequate, but at the very least this
>> needs a big comment explaining why it's okay.
>
> If we get an nmi while still on the sysenter stack, then we are not
> entering the handler from user-space and the above code will do
> nothing and behave as before.
>
> But you are right, it might blow up. There is a problem with the cr3
> switch, because the nmi can happen in kernel mode before the cr3 is
> switched, then this handler will not do the cr3 switch itself and crash
> the kernel. But the stack switching should be fine, I think.
>
>> > +       /*
>> > +        * TODO: Find a way to let cpu_current_top_of_stack point to
>> > +        * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
>> > +        * iret exceptions.
>> > +        */
>> > +       this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
>>
>> Do you know what the issue is?
>
> No, not yet, I will look into that again. But first I want to get
> this series stable enough as it is.
>
>> As a general comment, the interaction between this patch and vm86 is a
>> bit scary.  In vm86 mode, the kernel gets entered with extra stuff on
>> the stack, which may screw up all your offsets.
>
> Just read up on vm86 mode control transfers and the stack layout then.
> Looks like I need to check for eflags.vm=1 and copy four more registers
> from/to the entry stack. Thanks for pointing that out.

You could just copy those slots unconditionally.  After all, you're
slowing down entries by an epic amount due to writing CR3 on with PCID
off, so four words copied should be entirely lost in the noise.  OTOH,
checking for VM86 mode is just a single bt against EFLAGS.

With the modern (rewritten a year or two ago by Brian Gerst) vm86
code, all the slots (those actually in pt_regs) are in the same
location regardless of whether we're in VM86 mode or not, but we're
still fiddling with the bottom of the stack.  Since you're controlling
the switch to the kernel thread stack, you can easily just write the
frame to the correct location, so you should not need to context
switch sp1 -- you can do it sanely and leave sp1 as the actual bottom
of the kernel stack no matter what.  In fact, you could probably avoid
context switching sp0, either, which would be a nice cleanup.

So I recommend the following.  Keep sp0 as the bottom of the sysenter
stack no matter what.  Then do:

bt $X86_EFLAGS_VM_BIT
jc .Lfrom_vm_\@

push 5 regs to real stack, starting at four-word offset (so they're in
the right place)
update %esp
...
.Lupdate_esp_\@

.Lfrom_vm_\@:
push 9 regs to real stack, starting at the bottom
jmp .Lupdate_esp_\@

Does that seem reasonable?  It's arguably much nicer than what we have now.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-17 18:10         ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-17 18:10 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, linux-mm, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 1:18 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Tue, Jan 16, 2018 at 02:45:27PM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>>
>> How about marking nr_regs with :req to force everyone to be explicit?
>
> Yeah, that's more readable, I'll change it.
>
>> > +       /*
>> > +        * TSS_sysenter_stack is the offset from the bottom of the
>> > +        * entry-stack
>> > +        */
>> > +       movl  TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp
>>
>> This is incomprehensible.  You're adding what appears to be the offset
>> of sysenter_stack within the TSS to something based on esp and
>> dereferencing that to get the new esp.  That't not actually what
>> you're doing, but please change asm_offsets.c (as in my previous
>> email) to avoid putting serious arithmetic in it and then do the
>> arithmetic right here so that it's possible to follow what's going on.
>
> Probably this needs better comments. So TSS_sysenter_stack is the offset
> from to tss.sp0 (tss.sp1 later) from the _bottom_ of the stack. But in
> this macro the stack might not be empty, it has a configurable (by
> \nr_regs) number of dwords on it. Before this instruction we also do a
> push %edi, so we need (\nr_regs + 1).
>
> This can't be put into asm_offset.c, as the actual offset depends on how
> much is on the stack.
>
>> >  ENTRY(entry_INT80_32)
>> >         ASM_CLAC
>> >         pushl   %eax                    /* pt_regs->orig_ax */
>> > +
>> > +       /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
>> > +       SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
>> > +
>>
>> Why check_user?
>
> You are right, check_user shouldn't ne needed as INT80 is never called
> from kernel mode.
>
>> >  ENTRY(nmi)
>> >         ASM_CLAC
>> > +
>> > +       /* Stack layout: ss, esp, eflags, cs, eip */
>> > +       SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1
>>
>> This is wrong, I think.  If you get an nmi in kernel mode but while
>> still on the sysenter stack, you blow up.  IIRC we have some crazy
>> code already to handle this (for nmi and #DB), and maybe that's
>> already adequate or can be made adequate, but at the very least this
>> needs a big comment explaining why it's okay.
>
> If we get an nmi while still on the sysenter stack, then we are not
> entering the handler from user-space and the above code will do
> nothing and behave as before.
>
> But you are right, it might blow up. There is a problem with the cr3
> switch, because the nmi can happen in kernel mode before the cr3 is
> switched, then this handler will not do the cr3 switch itself and crash
> the kernel. But the stack switching should be fine, I think.
>
>> > +       /*
>> > +        * TODO: Find a way to let cpu_current_top_of_stack point to
>> > +        * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
>> > +        * iret exceptions.
>> > +        */
>> > +       this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
>>
>> Do you know what the issue is?
>
> No, not yet, I will look into that again. But first I want to get
> this series stable enough as it is.
>
>> As a general comment, the interaction between this patch and vm86 is a
>> bit scary.  In vm86 mode, the kernel gets entered with extra stuff on
>> the stack, which may screw up all your offsets.
>
> Just read up on vm86 mode control transfers and the stack layout then.
> Looks like I need to check for eflags.vm=1 and copy four more registers
> from/to the entry stack. Thanks for pointing that out.

You could just copy those slots unconditionally.  After all, you're
slowing down entries by an epic amount due to writing CR3 on with PCID
off, so four words copied should be entirely lost in the noise.  OTOH,
checking for VM86 mode is just a single bt against EFLAGS.

With the modern (rewritten a year or two ago by Brian Gerst) vm86
code, all the slots (those actually in pt_regs) are in the same
location regardless of whether we're in VM86 mode or not, but we're
still fiddling with the bottom of the stack.  Since you're controlling
the switch to the kernel thread stack, you can easily just write the
frame to the correct location, so you should not need to context
switch sp1 -- you can do it sanely and leave sp1 as the actual bottom
of the kernel stack no matter what.  In fact, you could probably avoid
context switching sp0, either, which would be a nice cleanup.

So I recommend the following.  Keep sp0 as the bottom of the sysenter
stack no matter what.  Then do:

bt $X86_EFLAGS_VM_BIT
jc .Lfrom_vm_\@

push 5 regs to real stack, starting at four-word offset (so they're in
the right place)
update %esp
...
.Lupdate_esp_\@

.Lfrom_vm_\@:
push 9 regs to real stack, starting at the bottom
jmp .Lupdate_esp_\@

Does that seem reasonable?  It's arguably much nicer than what we have now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
  2018-01-17 14:10           ` Joerg Roedel
@ 2018-01-17 18:12             ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-17 18:12 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Brian Gerst, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, X86 ML, LKML, Linux-MM, Linus Torvalds,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 6:10 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Wed, Jan 17, 2018 at 05:57:53AM -0800, Brian Gerst wrote:
>> On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <joro@8bytes.org> wrote:
>
>> > I have no real idea on how to switch back to the entry stack without
>> > access to per_cpu variables. I also can't access the cpu_entry_area for
>> > the cpu yet, because for that we need to be on the entry stack already.
>>
>> Switch to the trampoline stack before loading user segments.
>
> That requires to copy most of pt_regs from task- to trampoline-stack,
> not sure if that is faster than temporily restoring kernel %fs.
>

I would optimize for simplicity, not speed.  You're already planning
to write to CR3, which is serializing, blows away the TLB, *and* takes
the absurdly large amount of time that the microcode needs to blow
away the TLB.

(For whatever reason, Intel doesn't seem to have hardware that can
quickly wipe the TLB.  I suspect that the actual implementation does
it in a loop and wipes little pieces at a time.  Whatever it actually
does, the CR3 write itself is very slow.)

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
@ 2018-01-17 18:12             ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-17 18:12 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Brian Gerst, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, X86 ML, LKML, Linux-MM, Linus Torvalds,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 6:10 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Wed, Jan 17, 2018 at 05:57:53AM -0800, Brian Gerst wrote:
>> On Wed, Jan 17, 2018 at 1:24 AM, Joerg Roedel <joro@8bytes.org> wrote:
>
>> > I have no real idea on how to switch back to the entry stack without
>> > access to per_cpu variables. I also can't access the cpu_entry_area for
>> > the cpu yet, because for that we need to be on the entry stack already.
>>
>> Switch to the trampoline stack before loading user segments.
>
> That requires to copy most of pt_regs from task- to trampoline-stack,
> not sure if that is faster than temporily restoring kernel %fs.
>

I would optimize for simplicity, not speed.  You're already planning
to write to CR3, which is serializing, blows away the TLB, *and* takes
the absurdly large amount of time that the microcode needs to blow
away the TLB.

(For whatever reason, Intel doesn't seem to have hardware that can
quickly wipe the TLB.  I suspect that the actual implementation does
it in a loop and wipes little pieces at a time.  Whatever it actually
does, the CR3 write itself is very slow.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 14/16] x86/mm/legacy: Populate the user page-table with user pgd's
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-17 23:41     ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-17 23:41 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Also populate the user-spage pgd's in the user page-table.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/include/asm/pgtable-2level.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
> index 685ffe8a0eaf..d96486d23c58 100644
> --- a/arch/x86/include/asm/pgtable-2level.h
> +++ b/arch/x86/include/asm/pgtable-2level.h
> @@ -19,6 +19,9 @@ static inline void native_set_pte(pte_t *ptep , pte_t pte)
>
>  static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
>  {
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> +       pmd.pud.p4d.pgd = pti_set_user_pgd(&pmdp->pud.p4d.pgd, pmd.pud.p4d.pgd);
> +#endif
>         *pmdp = pmd;
>  }
>

Nothing against your patch, but this seems like a perfectly fine place
to rant: I *hate* the way we deal with page table folding.  Grr.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 14/16] x86/mm/legacy: Populate the user page-table with user pgd's
@ 2018-01-17 23:41     ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-17 23:41 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Also populate the user-spage pgd's in the user page-table.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/include/asm/pgtable-2level.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
> index 685ffe8a0eaf..d96486d23c58 100644
> --- a/arch/x86/include/asm/pgtable-2level.h
> +++ b/arch/x86/include/asm/pgtable-2level.h
> @@ -19,6 +19,9 @@ static inline void native_set_pte(pte_t *ptep , pte_t pte)
>
>  static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
>  {
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> +       pmd.pud.p4d.pgd = pti_set_user_pgd(&pmdp->pud.p4d.pgd, pmd.pud.p4d.pgd);
> +#endif
>         *pmdp = pmd;
>  }
>

Nothing against your patch, but this seems like a perfectly fine place
to rant: I *hate* the way we deal with page table folding.  Grr.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 08/16] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
  2018-01-16 16:36   ` Joerg Roedel
@ 2018-01-17 23:43     ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-17 23:43 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Allocate a kernel and a user page-table root when PTI is
> enabled. Also allocate a full page per root for PAEm because
> otherwise the bit to flip in cr3 to switch between them
> would be non-constant, which creates a lot of hassle.
> Keep that for a later optimization.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/head_32.S | 23 ++++++++++++++++++-----
>  arch/x86/mm/pgtable.c     | 11 ++++++-----
>  2 files changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
> index c29020907886..fc550559bf58 100644
> --- a/arch/x86/kernel/head_32.S
> +++ b/arch/x86/kernel/head_32.S
> @@ -512,28 +512,41 @@ ENTRY(initial_code)
>  ENTRY(setup_once_ref)
>         .long setup_once
>
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> +#define        PGD_ALIGN       (2 * PAGE_SIZE)
> +#define PTI_USER_PGD_FILL      1024
> +#else
> +#define        PGD_ALIGN       (PAGE_SIZE)
> +#define PTI_USER_PGD_FILL      0
> +#endif
>  /*
>   * BSS section
>   */
>  __PAGE_ALIGNED_BSS
> -       .align PAGE_SIZE
> +       .align PGD_ALIGN
>  #ifdef CONFIG_X86_PAE
>  .globl initial_pg_pmd
>  initial_pg_pmd:
>         .fill 1024*KPMDS,4,0
> +       .fill PTI_USER_PGD_FILL,4,0

Couldn't this be simplified to just .align PGD_ALIGN, 0 without the .fill?

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 08/16] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
@ 2018-01-17 23:43     ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-17 23:43 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> Allocate a kernel and a user page-table root when PTI is
> enabled. Also allocate a full page per root for PAEm because
> otherwise the bit to flip in cr3 to switch between them
> would be non-constant, which creates a lot of hassle.
> Keep that for a later optimization.
>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/kernel/head_32.S | 23 ++++++++++++++++++-----
>  arch/x86/mm/pgtable.c     | 11 ++++++-----
>  2 files changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
> index c29020907886..fc550559bf58 100644
> --- a/arch/x86/kernel/head_32.S
> +++ b/arch/x86/kernel/head_32.S
> @@ -512,28 +512,41 @@ ENTRY(initial_code)
>  ENTRY(setup_once_ref)
>         .long setup_once
>
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> +#define        PGD_ALIGN       (2 * PAGE_SIZE)
> +#define PTI_USER_PGD_FILL      1024
> +#else
> +#define        PGD_ALIGN       (PAGE_SIZE)
> +#define PTI_USER_PGD_FILL      0
> +#endif
>  /*
>   * BSS section
>   */
>  __PAGE_ALIGNED_BSS
> -       .align PAGE_SIZE
> +       .align PGD_ALIGN
>  #ifdef CONFIG_X86_PAE
>  .globl initial_pg_pmd
>  initial_pg_pmd:
>         .fill 1024*KPMDS,4,0
> +       .fill PTI_USER_PGD_FILL,4,0

Couldn't this be simplified to just .align PGD_ALIGN, 0 without the .fill?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-17 18:10         ` Andy Lutomirski
@ 2018-01-19  9:55           ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-19  9:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

Hey Andy,

On Wed, Jan 17, 2018 at 10:10:23AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 17, 2018 at 1:18 AM, Joerg Roedel <joro@8bytes.org> wrote:

> > Just read up on vm86 mode control transfers and the stack layout then.
> > Looks like I need to check for eflags.vm=1 and copy four more registers
> > from/to the entry stack. Thanks for pointing that out.
> 
> You could just copy those slots unconditionally.  After all, you're
> slowing down entries by an epic amount due to writing CR3 on with PCID
> off, so four words copied should be entirely lost in the noise.  OTOH,
> checking for VM86 mode is just a single bt against EFLAGS.
> 
> With the modern (rewritten a year or two ago by Brian Gerst) vm86
> code, all the slots (those actually in pt_regs) are in the same
> location regardless of whether we're in VM86 mode or not, but we're
> still fiddling with the bottom of the stack.  Since you're controlling
> the switch to the kernel thread stack, you can easily just write the
> frame to the correct location, so you should not need to context
> switch sp1 -- you can do it sanely and leave sp1 as the actual bottom
> of the kernel stack no matter what.  In fact, you could probably avoid
> context switching sp0, either, which would be a nice cleanup.

I am not sure what you mean by "not context switching sp0/sp1" ...

> So I recommend the following.  Keep sp0 as the bottom of the sysenter
> stack no matter what.  Then do:
> 
> bt $X86_EFLAGS_VM_BIT
> jc .Lfrom_vm_\@
> 
> push 5 regs to real stack, starting at four-word offset (so they're in
> the right place)
> update %esp
> ...
> .Lupdate_esp_\@
> 
> .Lfrom_vm_\@:
> push 9 regs to real stack, starting at the bottom
> jmp .Lupdate_esp_\@
> 
> Does that seem reasonable?  It's arguably much nicer than what we have
> now.

But that looks like a good idea. Having a consistent stack with and
without vm86 is certainly a nice cleanup.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-19  9:55           ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-19  9:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	linux-mm, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

Hey Andy,

On Wed, Jan 17, 2018 at 10:10:23AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 17, 2018 at 1:18 AM, Joerg Roedel <joro@8bytes.org> wrote:

> > Just read up on vm86 mode control transfers and the stack layout then.
> > Looks like I need to check for eflags.vm=1 and copy four more registers
> > from/to the entry stack. Thanks for pointing that out.
> 
> You could just copy those slots unconditionally.  After all, you're
> slowing down entries by an epic amount due to writing CR3 on with PCID
> off, so four words copied should be entirely lost in the noise.  OTOH,
> checking for VM86 mode is just a single bt against EFLAGS.
> 
> With the modern (rewritten a year or two ago by Brian Gerst) vm86
> code, all the slots (those actually in pt_regs) are in the same
> location regardless of whether we're in VM86 mode or not, but we're
> still fiddling with the bottom of the stack.  Since you're controlling
> the switch to the kernel thread stack, you can easily just write the
> frame to the correct location, so you should not need to context
> switch sp1 -- you can do it sanely and leave sp1 as the actual bottom
> of the kernel stack no matter what.  In fact, you could probably avoid
> context switching sp0, either, which would be a nice cleanup.

I am not sure what you mean by "not context switching sp0/sp1" ...

> So I recommend the following.  Keep sp0 as the bottom of the sysenter
> stack no matter what.  Then do:
> 
> bt $X86_EFLAGS_VM_BIT
> jc .Lfrom_vm_\@
> 
> push 5 regs to real stack, starting at four-word offset (so they're in
> the right place)
> update %esp
> ...
> .Lupdate_esp_\@
> 
> .Lfrom_vm_\@:
> push 9 regs to real stack, starting at the bottom
> jmp .Lupdate_esp_\@
> 
> Does that seem reasonable?  It's arguably much nicer than what we have
> now.

But that looks like a good idea. Having a consistent stack with and
without vm86 is certainly a nice cleanup.


Regards,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
  2018-01-17 18:12             ` Andy Lutomirski
@ 2018-01-19  9:57               ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-19  9:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brian Gerst, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 10:12:32AM -0800, Andy Lutomirski wrote:
> I would optimize for simplicity, not speed.  You're already planning
> to write to CR3, which is serializing, blows away the TLB, *and* takes
> the absurdly large amount of time that the microcode needs to blow
> away the TLB.

Okay, so I am going to do the stack-switch before pt_regs is restored.
This is at least better than playing games with hiding the entry/exit
%esp somewhere in stack-memory.


Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 03/16] x86/entry/32: Leave the kernel via the trampoline stack
@ 2018-01-19  9:57               ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-19  9:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brian Gerst, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 10:12:32AM -0800, Andy Lutomirski wrote:
> I would optimize for simplicity, not speed.  You're already planning
> to write to CR3, which is serializing, blows away the TLB, *and* takes
> the absurdly large amount of time that the microcode needs to blow
> away the TLB.

Okay, so I am going to do the stack-switch before pt_regs is restored.
This is at least better than playing games with hiding the entry/exit
%esp somewhere in stack-memory.


Thanks,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 08/16] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
  2018-01-17 23:43     ` Andy Lutomirski
@ 2018-01-19  9:57       ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-19  9:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 03:43:14PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >  #ifdef CONFIG_X86_PAE
> >  .globl initial_pg_pmd
> >  initial_pg_pmd:
> >         .fill 1024*KPMDS,4,0
> > +       .fill PTI_USER_PGD_FILL,4,0
> 
> Couldn't this be simplified to just .align PGD_ALIGN, 0 without the .fill?

You are right, will change that.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 08/16] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
@ 2018-01-19  9:57       ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-19  9:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Wed, Jan 17, 2018 at 03:43:14PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >  #ifdef CONFIG_X86_PAE
> >  .globl initial_pg_pmd
> >  initial_pg_pmd:
> >         .fill 1024*KPMDS,4,0
> > +       .fill PTI_USER_PGD_FILL,4,0
> 
> Couldn't this be simplified to just .align PGD_ALIGN, 0 without the .fill?

You are right, will change that.

Thanks,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 16:36 ` Joerg Roedel
                   ` (20 preceding siblings ...)
  (?)
@ 2018-01-19 10:55 ` Pavel Machek
  2018-01-19 11:07     ` Joerg Roedel
  -1 siblings, 1 reply; 183+ messages in thread
From: Pavel Machek @ 2018-01-19 10:55 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	jroedel

[-- Attachment #1: Type: text/plain, Size: 1886 bytes --]

Hi!

> From: Joerg Roedel <jroedel@suse.de>
> 
> Hi,
> 
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

Thanks for doing the work.

I tried applying it on top of -next, and that did not succeed. Let me
try Linus tree...

> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.

Um. Ok, testing is something I can do. At least I have excuse to power
on T40p.

Ok... Testing is something I can do... If I can get it to compile.

  CC      arch/x86/mm/dump_pagetables.o
  arch/x86/mm/dump_pagetables.c: In function
  ‘ptdump_walk_user_pgd_level_checkwx’:
  arch/x86/mm/dump_pagetables.c:546:26: error: ‘init_top_pgt’
  undeclared (first use in this function)
    pgd_t *pgd = (pgd_t *) &init_top_pgt;
                              ^
			      arch/x86/mm/dump_pagetables.c:546:26:
  note: each undeclared identifier is reported only once for each
  function it appears in
  scripts/Makefile.build:316: recipe for target
  'arch/x86/mm/dump_pagetables.o' failed
  make[2]: *** [arch/x86/mm/dump_pagetables.o] Error 1
  scripts/Makefile.build:575: recipe for target 'arch/x86/mm' failed
  make[1]: *** [arch/x86/mm] Error 2
  make[1]: *** Waiting for unfinished jobs....
    CC      arch/x86/platform/intel/iosf_mbi.o
    
Ok, I guess I can disable some config option...
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-19 10:55 ` Pavel Machek
@ 2018-01-19 11:07     ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-19 11:07 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86,
	linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long

Hey Pavel,

On Fri, Jan 19, 2018 at 11:55:28AM +0100, Pavel Machek wrote:
> Thanks for doing the work.
> 
> I tried applying it on top of -next, and that did not succeed. Let me
> try Linus tree...

Thanks for your help with testing this patch-set, but I recommend to
wait for the next version, as review already found a couple of bugs that
might crash your system. For example there are NMI cases that might
crash your machine because the NMI happens in kernel mode before the cr3
switch. VM86 mode is also definitly broken.

I am about to fix that and will send a new version, if all goes well, at
some point next week.


Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-19 11:07     ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-19 11:07 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86,
	linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long

Hey Pavel,

On Fri, Jan 19, 2018 at 11:55:28AM +0100, Pavel Machek wrote:
> Thanks for doing the work.
> 
> I tried applying it on top of -next, and that did not succeed. Let me
> try Linus tree...

Thanks for your help with testing this patch-set, but I recommend to
wait for the next version, as review already found a couple of bugs that
might crash your system. For example there are NMI cases that might
crash your machine because the NMI happens in kernel mode before the cr3
switch. VM86 mode is also definitly broken.

I am about to fix that and will send a new version, if all goes well, at
some point next week.


Thanks,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-19 11:07     ` Joerg Roedel
  (?)
@ 2018-01-19 12:58     ` Pavel Machek
  -1 siblings, 0 replies; 183+ messages in thread
From: Pavel Machek @ 2018-01-19 12:58 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86,
	linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long

[-- Attachment #1: Type: text/plain, Size: 1851 bytes --]

On Fri 2018-01-19 12:07:26, Joerg Roedel wrote:
> Hey Pavel,
> 
> On Fri, Jan 19, 2018 at 11:55:28AM +0100, Pavel Machek wrote:
> > Thanks for doing the work.
> > 
> > I tried applying it on top of -next, and that did not succeed. Let me
> > try Linus tree...
> 
> Thanks for your help with testing this patch-set, but I recommend to
> wait for the next version, as review already found a couple of bugs that
> might crash your system. For example there are NMI cases that might
> crash your machine because the NMI happens in kernel mode before the cr3
> switch. VM86 mode is also definitly broken.

Thanks for heads-up. I guess I can disable NMI avoid VM86.

CONFIG_X86_PTDUMP_CORE should be responsible for boot fail. Disabling
it is not at all easy, as CONFIG_EMBEDDED selects CONFIG_EXPERTS
selects CONFIG_DEBUG_KERNEL selects CONFIG_X86_PTDUMP_CORE. (Crazy, if
you ask me). You may want to test with that enabled. Patch below might
fix it. (Signed-off-by: me).

Tests so far: kernel boots in qemu. Whole system boots on thinkpad
T40p, vulnerabities/meltdown says mitigation: PTI.. so I guess it
works.

Tested-by: me. :-)

Best regards,
								Pavel


diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 2a4849e..896b53b 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -543,7 +543,11 @@ EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
 static void ptdump_walk_user_pgd_level_checkwx(void)
 {
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
+#ifdef CONFIG_X86_64
 	pgd_t *pgd = (pgd_t *) &init_top_pgt;
+#else
+	pgd_t *pgd = swapper_pg_dir;
+#endif
 
 	if (!static_cpu_has(X86_FEATURE_PTI))
 		return;

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-19  9:55           ` Joerg Roedel
@ 2018-01-19 16:30             ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-19 16:30 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Fri, Jan 19, 2018 at 1:55 AM, Joerg Roedel <joro@8bytes.org> wrote:
> Hey Andy,
>
> On Wed, Jan 17, 2018 at 10:10:23AM -0800, Andy Lutomirski wrote:
>> On Wed, Jan 17, 2018 at 1:18 AM, Joerg Roedel <joro@8bytes.org> wrote:
>
>> > Just read up on vm86 mode control transfers and the stack layout then.
>> > Looks like I need to check for eflags.vm=1 and copy four more registers
>> > from/to the entry stack. Thanks for pointing that out.
>>
>> You could just copy those slots unconditionally.  After all, you're
>> slowing down entries by an epic amount due to writing CR3 on with PCID
>> off, so four words copied should be entirely lost in the noise.  OTOH,
>> checking for VM86 mode is just a single bt against EFLAGS.
>>
>> With the modern (rewritten a year or two ago by Brian Gerst) vm86
>> code, all the slots (those actually in pt_regs) are in the same
>> location regardless of whether we're in VM86 mode or not, but we're
>> still fiddling with the bottom of the stack.  Since you're controlling
>> the switch to the kernel thread stack, you can easily just write the
>> frame to the correct location, so you should not need to context
>> switch sp1 -- you can do it sanely and leave sp1 as the actual bottom
>> of the kernel stack no matter what.  In fact, you could probably avoid
>> context switching sp0, either, which would be a nice cleanup.
>
> I am not sure what you mean by "not context switching sp0/sp1" ...

You're supposed to read what I meant, not what I said...

I meant that we could have sp0 have a genuinely constant value per
cpu.  That means that the entry trampoline ends up with RIP, etc in a
different place depending on whether VM was in use, but the entry
trampoline code should be able to handle that.  sp1 would have a value
that varies by task, but it could just point to the top of the stack
instead of being changed depending on whether VM is in use.  Instead,
the entry trampoline would offset the registers as needed to keep
pt_regs in the right place.

I think you already figured all of that out, though :)

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-19 16:30             ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-19 16:30 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Fri, Jan 19, 2018 at 1:55 AM, Joerg Roedel <joro@8bytes.org> wrote:
> Hey Andy,
>
> On Wed, Jan 17, 2018 at 10:10:23AM -0800, Andy Lutomirski wrote:
>> On Wed, Jan 17, 2018 at 1:18 AM, Joerg Roedel <joro@8bytes.org> wrote:
>
>> > Just read up on vm86 mode control transfers and the stack layout then.
>> > Looks like I need to check for eflags.vm=1 and copy four more registers
>> > from/to the entry stack. Thanks for pointing that out.
>>
>> You could just copy those slots unconditionally.  After all, you're
>> slowing down entries by an epic amount due to writing CR3 on with PCID
>> off, so four words copied should be entirely lost in the noise.  OTOH,
>> checking for VM86 mode is just a single bt against EFLAGS.
>>
>> With the modern (rewritten a year or two ago by Brian Gerst) vm86
>> code, all the slots (those actually in pt_regs) are in the same
>> location regardless of whether we're in VM86 mode or not, but we're
>> still fiddling with the bottom of the stack.  Since you're controlling
>> the switch to the kernel thread stack, you can easily just write the
>> frame to the correct location, so you should not need to context
>> switch sp1 -- you can do it sanely and leave sp1 as the actual bottom
>> of the kernel stack no matter what.  In fact, you could probably avoid
>> context switching sp0, either, which would be a nice cleanup.
>
> I am not sure what you mean by "not context switching sp0/sp1" ...

You're supposed to read what I meant, not what I said...

I meant that we could have sp0 have a genuinely constant value per
cpu.  That means that the entry trampoline ends up with RIP, etc in a
different place depending on whether VM was in use, but the entry
trampoline code should be able to handle that.  sp1 would have a value
that varies by task, but it could just point to the top of the stack
instead of being changed depending on whether VM is in use.  Instead,
the entry trampoline would offset the registers as needed to keep
pt_regs in the right place.

I think you already figured all of that out, though :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-21 20:13   ` Nadav Amit
  -1 siblings, 0 replies; 183+ messages in thread
From: Nadav Amit @ 2018-01-21 20:13 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

I am looking on PTI on x86-32, but I did not mange to get the PoC to work on
this setup (kaslr disabled, similar setup works on 64-bit).

Did you use any PoC to “test” the protection?

Thanks,
Nadav


Joerg Roedel <joro@8bytes.org> wrote:

> From: Joerg Roedel <jroedel@suse.de>
> 
> Hi,
> 
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)
> 
> The patches are split as follows:
> 
> 	- 1-3 contain the entry-code changes to enter and
> 	  exit the kernel via the sysenter trampoline stack.
> 
> 	- 4-7 are fixes to get the code compile on 32 bit
> 	  with CONFIG_PAGE_TABLE_ISOLATION=y.
> 
> 	- 8-14 adapt the existing PTI code to work properly
> 	  on 32 bit and add the needed parts to 32 bit
> 	  page-table code.
> 
> 	- 15 switches PTI on by adding the CR3 switches to
> 	  kernel entry/exit.
> 
> 	- 16 enables the Kconfig for all of X86
> 
> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.
> 
> With that post I'd like to ask for all kinds of constructive
> feedback on the approaches I have taken and of course the
> many things I broke with it :)
> 
> One of the things that are surely broken is XEN_PV support.
> I'd appreciate any help with testing and bugfixing on that
> front.
> 
> So please review and let me know your thoughts.
> 
> Thanks,
> 
> 	Joerg
> 
> Joerg Roedel (16):
>  x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
>  x86/entry/32: Enter the kernel via trampoline stack
>  x86/entry/32: Leave the kernel via the trampoline stack
>  x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
>  x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
>  x86/mm/ldt: Reserve high address-space range for the LDT
>  x86/mm: Move two more functions from pgtable_64.h to pgtable.h
>  x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
>  x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
>  x86/mm/pti: Populate valid user pud entries
>  x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
>  x86/mm/pae: Populate the user page-table with user pgd's
>  x86/mm/pti: Add an overflow check to pti_clone_pmds()
>  x86/mm/legacy: Populate the user page-table with user pgd's
>  x86/entry/32: Switch between kernel and user cr3 on entry/exit
>  x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
> 
> arch/x86/entry/entry_32.S               | 170 +++++++++++++++++++++++++++++---
> arch/x86/include/asm/pgtable-2level.h   |   3 +
> arch/x86/include/asm/pgtable-3level.h   |   3 +
> arch/x86/include/asm/pgtable.h          |  88 +++++++++++++++++
> arch/x86/include/asm/pgtable_32_types.h |   5 +-
> arch/x86/include/asm/pgtable_64.h       |  85 ----------------
> arch/x86/include/asm/processor-flags.h  |   8 +-
> arch/x86/include/asm/switch_to.h        |   6 +-
> arch/x86/kernel/asm-offsets_32.c        |   5 +-
> arch/x86/kernel/cpu/common.c            |   5 +-
> arch/x86/kernel/head_32.S               |  23 ++++-
> arch/x86/kernel/process.c               |   2 -
> arch/x86/kernel/process_32.c            |   6 ++
> arch/x86/mm/pgtable.c                   |  11 ++-
> arch/x86/mm/pti.c                       |  34 ++++++-
> security/Kconfig                        |   2 +-
> 16 files changed, 333 insertions(+), 123 deletions(-)
> 
> -- 
> 2.13.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-21 20:13   ` Nadav Amit
  0 siblings, 0 replies; 183+ messages in thread
From: Nadav Amit @ 2018-01-21 20:13 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

I am looking on PTI on x86-32, but I did not mange to get the PoC to work on
this setup (kaslr disabled, similar setup works on 64-bit).

Did you use any PoC to “test” the protection?

Thanks,
Nadav


Joerg Roedel <joro@8bytes.org> wrote:

> From: Joerg Roedel <jroedel@suse.de>
> 
> Hi,
> 
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)
> 
> The patches are split as follows:
> 
> 	- 1-3 contain the entry-code changes to enter and
> 	  exit the kernel via the sysenter trampoline stack.
> 
> 	- 4-7 are fixes to get the code compile on 32 bit
> 	  with CONFIG_PAGE_TABLE_ISOLATION=y.
> 
> 	- 8-14 adapt the existing PTI code to work properly
> 	  on 32 bit and add the needed parts to 32 bit
> 	  page-table code.
> 
> 	- 15 switches PTI on by adding the CR3 switches to
> 	  kernel entry/exit.
> 
> 	- 16 enables the Kconfig for all of X86
> 
> The code has not run on bare-metal yet, I'll test that in
> the next days once I setup a 32 bit box again. I also havn't
> tested Wine and DosEMU yet, so this might also be broken.
> 
> With that post I'd like to ask for all kinds of constructive
> feedback on the approaches I have taken and of course the
> many things I broke with it :)
> 
> One of the things that are surely broken is XEN_PV support.
> I'd appreciate any help with testing and bugfixing on that
> front.
> 
> So please review and let me know your thoughts.
> 
> Thanks,
> 
> 	Joerg
> 
> Joerg Roedel (16):
>  x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
>  x86/entry/32: Enter the kernel via trampoline stack
>  x86/entry/32: Leave the kernel via the trampoline stack
>  x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
>  x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
>  x86/mm/ldt: Reserve high address-space range for the LDT
>  x86/mm: Move two more functions from pgtable_64.h to pgtable.h
>  x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
>  x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
>  x86/mm/pti: Populate valid user pud entries
>  x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
>  x86/mm/pae: Populate the user page-table with user pgd's
>  x86/mm/pti: Add an overflow check to pti_clone_pmds()
>  x86/mm/legacy: Populate the user page-table with user pgd's
>  x86/entry/32: Switch between kernel and user cr3 on entry/exit
>  x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
> 
> arch/x86/entry/entry_32.S               | 170 +++++++++++++++++++++++++++++---
> arch/x86/include/asm/pgtable-2level.h   |   3 +
> arch/x86/include/asm/pgtable-3level.h   |   3 +
> arch/x86/include/asm/pgtable.h          |  88 +++++++++++++++++
> arch/x86/include/asm/pgtable_32_types.h |   5 +-
> arch/x86/include/asm/pgtable_64.h       |  85 ----------------
> arch/x86/include/asm/processor-flags.h  |   8 +-
> arch/x86/include/asm/switch_to.h        |   6 +-
> arch/x86/kernel/asm-offsets_32.c        |   5 +-
> arch/x86/kernel/cpu/common.c            |   5 +-
> arch/x86/kernel/head_32.S               |  23 ++++-
> arch/x86/kernel/process.c               |   2 -
> arch/x86/kernel/process_32.c            |   6 ++
> arch/x86/mm/pgtable.c                   |  11 ++-
> arch/x86/mm/pti.c                       |  34 ++++++-
> security/Kconfig                        |   2 +-
> 16 files changed, 333 insertions(+), 123 deletions(-)
> 
> -- 
> 2.13.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-21 20:13   ` Nadav Amit
@ 2018-01-21 20:44     ` Nadav Amit
  -1 siblings, 0 replies; 183+ messages in thread
From: Nadav Amit @ 2018-01-21 20:44 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Please ignore my previous email. I got it working… Sorry for the spam.


Nadav Amit <nadav.amit@gmail.com> wrote:

> I am looking on PTI on x86-32, but I did not mange to get the PoC to work on
> this setup (kaslr disabled, similar setup works on 64-bit).
> 
> Did you use any PoC to “test” the protection?
> 
> Thanks,
> Nadav
> 
> 
> Joerg Roedel <joro@8bytes.org> wrote:
> 
>> From: Joerg Roedel <jroedel@suse.de>
>> 
>> Hi,
>> 
>> here is my current WIP code to enable PTI on x86-32. It is
>> still in a pretty early state, but it successfully boots my
>> KVM guest with PAE and with legacy paging. The existing PTI
>> code for x86-64 already prepares a lot of the stuff needed
>> for 32 bit too, thanks for that to all the people involved
>> in its development :)
>> 
>> The patches are split as follows:
>> 
>> 	- 1-3 contain the entry-code changes to enter and
>> 	  exit the kernel via the sysenter trampoline stack.
>> 
>> 	- 4-7 are fixes to get the code compile on 32 bit
>> 	  with CONFIG_PAGE_TABLE_ISOLATION=y.
>> 
>> 	- 8-14 adapt the existing PTI code to work properly
>> 	  on 32 bit and add the needed parts to 32 bit
>> 	  page-table code.
>> 
>> 	- 15 switches PTI on by adding the CR3 switches to
>> 	  kernel entry/exit.
>> 
>> 	- 16 enables the Kconfig for all of X86
>> 
>> The code has not run on bare-metal yet, I'll test that in
>> the next days once I setup a 32 bit box again. I also havn't
>> tested Wine and DosEMU yet, so this might also be broken.
>> 
>> With that post I'd like to ask for all kinds of constructive
>> feedback on the approaches I have taken and of course the
>> many things I broke with it :)
>> 
>> One of the things that are surely broken is XEN_PV support.
>> I'd appreciate any help with testing and bugfixing on that
>> front.
>> 
>> So please review and let me know your thoughts.
>> 
>> Thanks,
>> 
>> 	Joerg
>> 
>> Joerg Roedel (16):
>> x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
>> x86/entry/32: Enter the kernel via trampoline stack
>> x86/entry/32: Leave the kernel via the trampoline stack
>> x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
>> x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
>> x86/mm/ldt: Reserve high address-space range for the LDT
>> x86/mm: Move two more functions from pgtable_64.h to pgtable.h
>> x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
>> x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
>> x86/mm/pti: Populate valid user pud entries
>> x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
>> x86/mm/pae: Populate the user page-table with user pgd's
>> x86/mm/pti: Add an overflow check to pti_clone_pmds()
>> x86/mm/legacy: Populate the user page-table with user pgd's
>> x86/entry/32: Switch between kernel and user cr3 on entry/exit
>> x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
>> 
>> arch/x86/entry/entry_32.S               | 170 +++++++++++++++++++++++++++++---
>> arch/x86/include/asm/pgtable-2level.h   |   3 +
>> arch/x86/include/asm/pgtable-3level.h   |   3 +
>> arch/x86/include/asm/pgtable.h          |  88 +++++++++++++++++
>> arch/x86/include/asm/pgtable_32_types.h |   5 +-
>> arch/x86/include/asm/pgtable_64.h       |  85 ----------------
>> arch/x86/include/asm/processor-flags.h  |   8 +-
>> arch/x86/include/asm/switch_to.h        |   6 +-
>> arch/x86/kernel/asm-offsets_32.c        |   5 +-
>> arch/x86/kernel/cpu/common.c            |   5 +-
>> arch/x86/kernel/head_32.S               |  23 ++++-
>> arch/x86/kernel/process.c               |   2 -
>> arch/x86/kernel/process_32.c            |   6 ++
>> arch/x86/mm/pgtable.c                   |  11 ++-
>> arch/x86/mm/pti.c                       |  34 ++++++-
>> security/Kconfig                        |   2 +-
>> 16 files changed, 333 insertions(+), 123 deletions(-)
>> 
>> -- 
>> 2.13.6
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-21 20:44     ` Nadav Amit
  0 siblings, 0 replies; 183+ messages in thread
From: Nadav Amit @ 2018-01-21 20:44 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Please ignore my previous email. I got it working… Sorry for the spam.


Nadav Amit <nadav.amit@gmail.com> wrote:

> I am looking on PTI on x86-32, but I did not mange to get the PoC to work on
> this setup (kaslr disabled, similar setup works on 64-bit).
> 
> Did you use any PoC to “test” the protection?
> 
> Thanks,
> Nadav
> 
> 
> Joerg Roedel <joro@8bytes.org> wrote:
> 
>> From: Joerg Roedel <jroedel@suse.de>
>> 
>> Hi,
>> 
>> here is my current WIP code to enable PTI on x86-32. It is
>> still in a pretty early state, but it successfully boots my
>> KVM guest with PAE and with legacy paging. The existing PTI
>> code for x86-64 already prepares a lot of the stuff needed
>> for 32 bit too, thanks for that to all the people involved
>> in its development :)
>> 
>> The patches are split as follows:
>> 
>> 	- 1-3 contain the entry-code changes to enter and
>> 	  exit the kernel via the sysenter trampoline stack.
>> 
>> 	- 4-7 are fixes to get the code compile on 32 bit
>> 	  with CONFIG_PAGE_TABLE_ISOLATION=y.
>> 
>> 	- 8-14 adapt the existing PTI code to work properly
>> 	  on 32 bit and add the needed parts to 32 bit
>> 	  page-table code.
>> 
>> 	- 15 switches PTI on by adding the CR3 switches to
>> 	  kernel entry/exit.
>> 
>> 	- 16 enables the Kconfig for all of X86
>> 
>> The code has not run on bare-metal yet, I'll test that in
>> the next days once I setup a 32 bit box again. I also havn't
>> tested Wine and DosEMU yet, so this might also be broken.
>> 
>> With that post I'd like to ask for all kinds of constructive
>> feedback on the approaches I have taken and of course the
>> many things I broke with it :)
>> 
>> One of the things that are surely broken is XEN_PV support.
>> I'd appreciate any help with testing and bugfixing on that
>> front.
>> 
>> So please review and let me know your thoughts.
>> 
>> Thanks,
>> 
>> 	Joerg
>> 
>> Joerg Roedel (16):
>> x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
>> x86/entry/32: Enter the kernel via trampoline stack
>> x86/entry/32: Leave the kernel via the trampoline stack
>> x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
>> x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
>> x86/mm/ldt: Reserve high address-space range for the LDT
>> x86/mm: Move two more functions from pgtable_64.h to pgtable.h
>> x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
>> x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
>> x86/mm/pti: Populate valid user pud entries
>> x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
>> x86/mm/pae: Populate the user page-table with user pgd's
>> x86/mm/pti: Add an overflow check to pti_clone_pmds()
>> x86/mm/legacy: Populate the user page-table with user pgd's
>> x86/entry/32: Switch between kernel and user cr3 on entry/exit
>> x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
>> 
>> arch/x86/entry/entry_32.S               | 170 +++++++++++++++++++++++++++++---
>> arch/x86/include/asm/pgtable-2level.h   |   3 +
>> arch/x86/include/asm/pgtable-3level.h   |   3 +
>> arch/x86/include/asm/pgtable.h          |  88 +++++++++++++++++
>> arch/x86/include/asm/pgtable_32_types.h |   5 +-
>> arch/x86/include/asm/pgtable_64.h       |  85 ----------------
>> arch/x86/include/asm/processor-flags.h  |   8 +-
>> arch/x86/include/asm/switch_to.h        |   6 +-
>> arch/x86/kernel/asm-offsets_32.c        |   5 +-
>> arch/x86/kernel/cpu/common.c            |   5 +-
>> arch/x86/kernel/head_32.S               |  23 ++++-
>> arch/x86/kernel/process.c               |   2 -
>> arch/x86/kernel/process_32.c            |   6 ++
>> arch/x86/mm/pgtable.c                   |  11 ++-
>> arch/x86/mm/pti.c                       |  34 ++++++-
>> security/Kconfig                        |   2 +-
>> 16 files changed, 333 insertions(+), 123 deletions(-)
>> 
>> -- 
>> 2.13.6
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-21 20:44     ` Nadav Amit
@ 2018-01-21 23:46       ` Nadav Amit
  -1 siblings, 0 replies; 183+ messages in thread
From: Nadav Amit @ 2018-01-21 23:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

I wanted to see whether segments protection can be a replacement for PTI
(yes, excluding SMEP emulation), or whether speculative execution “ignores”
limit checks, similarly to the way paging protection is skipped.

It does seem that segmentation provides sufficient protection from Meltdown.
The “reliability” test of Gratz PoC fails if the segment limit is set to
prevent access to the kernel memory. [ It passes if the limit is not set,
even if the DS is reloaded. ] My test is enclosed below.

So my question: wouldn’t it be much more efficient to use segmentation
protection for x86-32, and allow users to choose whether they want SMEP-like
protection if needed (and then enable PTI)?

[ There might be some corner cases in which setting a segment limit
introduces a problem, for example when modify_ldt() is used to set invalid
limit, but I presume that these are relatively uncommon, can be detected on
runtime, and PTI can then be used as a fallback mechanism. ]

Thanks,
Nadav

-- >8 --
Subject: [PATCH] Test segmentation protection

---
 libkdump/libkdump.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/libkdump/libkdump.c b/libkdump/libkdump.c
index c590391..db5bac3 100644
--- a/libkdump/libkdump.c
+++ b/libkdump/libkdump.c
@@ -10,6 +10,9 @@
 #include <stdarg.h>
 #include <stdlib.h>
 #include <unistd.h>
+#include <assert.h>
+#include <sys/types.h>
+#include <asm/ldt.h>
 
 libkdump_config_t libkdump_auto_config = {0};
 
@@ -500,6 +503,31 @@ int __attribute__((optimize("-Os"), noinline)) libkdump_read_tsx() {
   return 0;
 }
 
+extern int modify_ldt(int, void*, unsigned long);
+
+void change_ds(void)
+{
+	int r;
+	struct user_desc desc = {
+		.entry_number = 1,
+		.base_addr = 0,
+#ifdef NO_SEGMENTS
+		.limit = 0xffffeu,
+#else
+		.limit = 0xbffffu,
+#endif
+		.seg_32bit = 1,
+		.contents = 0,
+		.read_exec_only = 0,
+		.limit_in_pages = 1,
+		.seg_not_present = 0,
+	};
+
+	r = modify_ldt(1 /* write */, &desc, sizeof(desc));
+	assert(r == 0);
+	asm volatile ("mov %0, %%ds\n\t" : : "r"((1 << 3) | (1 << 2) | 3));
+}
+
 // ---------------------------------------------------------------------------
 int __attribute__((optimize("-Os"), noinline)) libkdump_read_signal_handler() {
   size_t retries = config.retries + 1;
@@ -507,6 +535,9 @@ int __attribute__((optimize("-Os"), noinline)) libkdump_read_signal_handler() {
 
   while (retries--) {
     if (!setjmp(buf)) {
+      /* longjmp reloads the original DS... */
+      change_ds();
+
       MELTDOWN;
     }

Nadav Amit <nadav.amit@gmail.com> wrote:

> Please ignore my previous email. I got it working… Sorry for the spam.
> 
> 
> Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>> I am looking on PTI on x86-32, but I did not mange to get the PoC to work on
>> this setup (kaslr disabled, similar setup works on 64-bit).
>> 
>> Did you use any PoC to “test” the protection?
>> 
>> Thanks,
>> Nadav
>> 
>> 
>> Joerg Roedel <joro@8bytes.org> wrote:
>> 
>>> From: Joerg Roedel <jroedel@suse.de>
>>> 
>>> Hi,
>>> 
>>> here is my current WIP code to enable PTI on x86-32. It is
>>> still in a pretty early state, but it successfully boots my
>>> KVM guest with PAE and with legacy paging. The existing PTI
>>> code for x86-64 already prepares a lot of the stuff needed
>>> for 32 bit too, thanks for that to all the people involved
>>> in its development :)
>>> 
>>> The patches are split as follows:
>>> 
>>> 	- 1-3 contain the entry-code changes to enter and
>>> 	  exit the kernel via the sysenter trampoline stack.
>>> 
>>> 	- 4-7 are fixes to get the code compile on 32 bit
>>> 	  with CONFIG_PAGE_TABLE_ISOLATION=y.
>>> 
>>> 	- 8-14 adapt the existing PTI code to work properly
>>> 	  on 32 bit and add the needed parts to 32 bit
>>> 	  page-table code.
>>> 
>>> 	- 15 switches PTI on by adding the CR3 switches to
>>> 	  kernel entry/exit.
>>> 
>>> 	- 16 enables the Kconfig for all of X86
>>> 
>>> The code has not run on bare-metal yet, I'll test that in
>>> the next days once I setup a 32 bit box again. I also havn't
>>> tested Wine and DosEMU yet, so this might also be broken.
>>> 
>>> With that post I'd like to ask for all kinds of constructive
>>> feedback on the approaches I have taken and of course the
>>> many things I broke with it :)
>>> 
>>> One of the things that are surely broken is XEN_PV support.
>>> I'd appreciate any help with testing and bugfixing on that
>>> front.
>>> 
>>> So please review and let me know your thoughts.
>>> 
>>> Thanks,
>>> 
>>> 	Joerg
>>> 
>>> Joerg Roedel (16):
>>> x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
>>> x86/entry/32: Enter the kernel via trampoline stack
>>> x86/entry/32: Leave the kernel via the trampoline stack
>>> x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
>>> x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
>>> x86/mm/ldt: Reserve high address-space range for the LDT
>>> x86/mm: Move two more functions from pgtable_64.h to pgtable.h
>>> x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
>>> x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
>>> x86/mm/pti: Populate valid user pud entries
>>> x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
>>> x86/mm/pae: Populate the user page-table with user pgd's
>>> x86/mm/pti: Add an overflow check to pti_clone_pmds()
>>> x86/mm/legacy: Populate the user page-table with user pgd's
>>> x86/entry/32: Switch between kernel and user cr3 on entry/exit
>>> x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
>>> 
>>> arch/x86/entry/entry_32.S               | 170 +++++++++++++++++++++++++++++---
>>> arch/x86/include/asm/pgtable-2level.h   |   3 +
>>> arch/x86/include/asm/pgtable-3level.h   |   3 +
>>> arch/x86/include/asm/pgtable.h          |  88 +++++++++++++++++
>>> arch/x86/include/asm/pgtable_32_types.h |   5 +-
>>> arch/x86/include/asm/pgtable_64.h       |  85 ----------------
>>> arch/x86/include/asm/processor-flags.h  |   8 +-
>>> arch/x86/include/asm/switch_to.h        |   6 +-
>>> arch/x86/kernel/asm-offsets_32.c        |   5 +-
>>> arch/x86/kernel/cpu/common.c            |   5 +-
>>> arch/x86/kernel/head_32.S               |  23 ++++-
>>> arch/x86/kernel/process.c               |   2 -
>>> arch/x86/kernel/process_32.c            |   6 ++
>>> arch/x86/mm/pgtable.c                   |  11 ++-
>>> arch/x86/mm/pti.c                       |  34 ++++++-
>>> security/Kconfig                        |   2 +-
>>> 16 files changed, 333 insertions(+), 123 deletions(-)
>>> 
>>> -- 
>>> 2.13.6
>>> 
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-21 23:46       ` Nadav Amit
  0 siblings, 0 replies; 183+ messages in thread
From: Nadav Amit @ 2018-01-21 23:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

I wanted to see whether segments protection can be a replacement for PTI
(yes, excluding SMEP emulation), or whether speculative execution “ignores”
limit checks, similarly to the way paging protection is skipped.

It does seem that segmentation provides sufficient protection from Meltdown.
The “reliability” test of Gratz PoC fails if the segment limit is set to
prevent access to the kernel memory. [ It passes if the limit is not set,
even if the DS is reloaded. ] My test is enclosed below.

So my question: wouldn’t it be much more efficient to use segmentation
protection for x86-32, and allow users to choose whether they want SMEP-like
protection if needed (and then enable PTI)?

[ There might be some corner cases in which setting a segment limit
introduces a problem, for example when modify_ldt() is used to set invalid
limit, but I presume that these are relatively uncommon, can be detected on
runtime, and PTI can then be used as a fallback mechanism. ]

Thanks,
Nadav

-- >8 --
Subject: [PATCH] Test segmentation protection

---
 libkdump/libkdump.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/libkdump/libkdump.c b/libkdump/libkdump.c
index c590391..db5bac3 100644
--- a/libkdump/libkdump.c
+++ b/libkdump/libkdump.c
@@ -10,6 +10,9 @@
 #include <stdarg.h>
 #include <stdlib.h>
 #include <unistd.h>
+#include <assert.h>
+#include <sys/types.h>
+#include <asm/ldt.h>
 
 libkdump_config_t libkdump_auto_config = {0};
 
@@ -500,6 +503,31 @@ int __attribute__((optimize("-Os"), noinline)) libkdump_read_tsx() {
   return 0;
 }
 
+extern int modify_ldt(int, void*, unsigned long);
+
+void change_ds(void)
+{
+	int r;
+	struct user_desc desc = {
+		.entry_number = 1,
+		.base_addr = 0,
+#ifdef NO_SEGMENTS
+		.limit = 0xffffeu,
+#else
+		.limit = 0xbffffu,
+#endif
+		.seg_32bit = 1,
+		.contents = 0,
+		.read_exec_only = 0,
+		.limit_in_pages = 1,
+		.seg_not_present = 0,
+	};
+
+	r = modify_ldt(1 /* write */, &desc, sizeof(desc));
+	assert(r == 0);
+	asm volatile ("mov %0, %%ds\n\t" : : "r"((1 << 3) | (1 << 2) | 3));
+}
+
 // ---------------------------------------------------------------------------
 int __attribute__((optimize("-Os"), noinline)) libkdump_read_signal_handler() {
   size_t retries = config.retries + 1;
@@ -507,6 +535,9 @@ int __attribute__((optimize("-Os"), noinline)) libkdump_read_signal_handler() {
 
   while (retries--) {
     if (!setjmp(buf)) {
+      /* longjmp reloads the original DS... */
+      change_ds();
+
       MELTDOWN;
     }

Nadav Amit <nadav.amit@gmail.com> wrote:

> Please ignore my previous email. I got it working… Sorry for the spam.
> 
> 
> Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>> I am looking on PTI on x86-32, but I did not mange to get the PoC to work on
>> this setup (kaslr disabled, similar setup works on 64-bit).
>> 
>> Did you use any PoC to “test” the protection?
>> 
>> Thanks,
>> Nadav
>> 
>> 
>> Joerg Roedel <joro@8bytes.org> wrote:
>> 
>>> From: Joerg Roedel <jroedel@suse.de>
>>> 
>>> Hi,
>>> 
>>> here is my current WIP code to enable PTI on x86-32. It is
>>> still in a pretty early state, but it successfully boots my
>>> KVM guest with PAE and with legacy paging. The existing PTI
>>> code for x86-64 already prepares a lot of the stuff needed
>>> for 32 bit too, thanks for that to all the people involved
>>> in its development :)
>>> 
>>> The patches are split as follows:
>>> 
>>> 	- 1-3 contain the entry-code changes to enter and
>>> 	  exit the kernel via the sysenter trampoline stack.
>>> 
>>> 	- 4-7 are fixes to get the code compile on 32 bit
>>> 	  with CONFIG_PAGE_TABLE_ISOLATION=y.
>>> 
>>> 	- 8-14 adapt the existing PTI code to work properly
>>> 	  on 32 bit and add the needed parts to 32 bit
>>> 	  page-table code.
>>> 
>>> 	- 15 switches PTI on by adding the CR3 switches to
>>> 	  kernel entry/exit.
>>> 
>>> 	- 16 enables the Kconfig for all of X86
>>> 
>>> The code has not run on bare-metal yet, I'll test that in
>>> the next days once I setup a 32 bit box again. I also havn't
>>> tested Wine and DosEMU yet, so this might also be broken.
>>> 
>>> With that post I'd like to ask for all kinds of constructive
>>> feedback on the approaches I have taken and of course the
>>> many things I broke with it :)
>>> 
>>> One of the things that are surely broken is XEN_PV support.
>>> I'd appreciate any help with testing and bugfixing on that
>>> front.
>>> 
>>> So please review and let me know your thoughts.
>>> 
>>> Thanks,
>>> 
>>> 	Joerg
>>> 
>>> Joerg Roedel (16):
>>> x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack
>>> x86/entry/32: Enter the kernel via trampoline stack
>>> x86/entry/32: Leave the kernel via the trampoline stack
>>> x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
>>> x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
>>> x86/mm/ldt: Reserve high address-space range for the LDT
>>> x86/mm: Move two more functions from pgtable_64.h to pgtable.h
>>> x86/pgtable/32: Allocate 8k page-tables when PTI is enabled
>>> x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32
>>> x86/mm/pti: Populate valid user pud entries
>>> x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h
>>> x86/mm/pae: Populate the user page-table with user pgd's
>>> x86/mm/pti: Add an overflow check to pti_clone_pmds()
>>> x86/mm/legacy: Populate the user page-table with user pgd's
>>> x86/entry/32: Switch between kernel and user cr3 on entry/exit
>>> x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32
>>> 
>>> arch/x86/entry/entry_32.S               | 170 +++++++++++++++++++++++++++++---
>>> arch/x86/include/asm/pgtable-2level.h   |   3 +
>>> arch/x86/include/asm/pgtable-3level.h   |   3 +
>>> arch/x86/include/asm/pgtable.h          |  88 +++++++++++++++++
>>> arch/x86/include/asm/pgtable_32_types.h |   5 +-
>>> arch/x86/include/asm/pgtable_64.h       |  85 ----------------
>>> arch/x86/include/asm/processor-flags.h  |   8 +-
>>> arch/x86/include/asm/switch_to.h        |   6 +-
>>> arch/x86/kernel/asm-offsets_32.c        |   5 +-
>>> arch/x86/kernel/cpu/common.c            |   5 +-
>>> arch/x86/kernel/head_32.S               |  23 ++++-
>>> arch/x86/kernel/process.c               |   2 -
>>> arch/x86/kernel/process_32.c            |   6 ++
>>> arch/x86/mm/pgtable.c                   |  11 ++-
>>> arch/x86/mm/pti.c                       |  34 ++++++-
>>> security/Kconfig                        |   2 +-
>>> 16 files changed, 333 insertions(+), 123 deletions(-)
>>> 
>>> -- 
>>> 2.13.6
>>> 
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-21 23:46       ` Nadav Amit
@ 2018-01-22  2:11         ` Linus Torvalds
  -1 siblings, 0 replies; 183+ messages in thread
From: Linus Torvalds @ 2018-01-22  2:11 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

On Sun, Jan 21, 2018 at 3:46 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> I wanted to see whether segments protection can be a replacement for PTI
> (yes, excluding SMEP emulation), or whether speculative execution “ignores”
> limit checks, similarly to the way paging protection is skipped.
>
> It does seem that segmentation provides sufficient protection from Meltdown.
> The “reliability” test of Gratz PoC fails if the segment limit is set to
> prevent access to the kernel memory. [ It passes if the limit is not set,
> even if the DS is reloaded. ] My test is enclosed below.

Interesting. It might not be entirely reliable for all
microarchitectures, though.

> So my question: wouldn’t it be much more efficient to use segmentation
> protection for x86-32, and allow users to choose whether they want SMEP-like
> protection if needed (and then enable PTI)?

That's what we did long long ago, with user space segments actually
using the limit (in fact, if you go back far enough, the kernel even
used the base).

You'd have to make sure that the LDT loading etc do not allow CPL3
segments with base+limit past TASK_SIZE, so that people can't generate
their own.  And the TLS segments also need to be limited (and
remember, the limit has to be TASK_SIZE-base, not just TASK_SIZE).

And we should check with Intel that segment limit checking really is
guaranteed to be done before any access.

Too bad x86-64 got rid of the segments ;)

               Linus

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-22  2:11         ` Linus Torvalds
  0 siblings, 0 replies; 183+ messages in thread
From: Linus Torvalds @ 2018-01-22  2:11 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

On Sun, Jan 21, 2018 at 3:46 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> I wanted to see whether segments protection can be a replacement for PTI
> (yes, excluding SMEP emulation), or whether speculative execution “ignores”
> limit checks, similarly to the way paging protection is skipped.
>
> It does seem that segmentation provides sufficient protection from Meltdown.
> The “reliability” test of Gratz PoC fails if the segment limit is set to
> prevent access to the kernel memory. [ It passes if the limit is not set,
> even if the DS is reloaded. ] My test is enclosed below.

Interesting. It might not be entirely reliable for all
microarchitectures, though.

> So my question: wouldn’t it be much more efficient to use segmentation
> protection for x86-32, and allow users to choose whether they want SMEP-like
> protection if needed (and then enable PTI)?

That's what we did long long ago, with user space segments actually
using the limit (in fact, if you go back far enough, the kernel even
used the base).

You'd have to make sure that the LDT loading etc do not allow CPL3
segments with base+limit past TASK_SIZE, so that people can't generate
their own.  And the TLS segments also need to be limited (and
remember, the limit has to be TASK_SIZE-base, not just TASK_SIZE).

And we should check with Intel that segment limit checking really is
guaranteed to be done before any access.

Too bad x86-64 got rid of the segments ;)

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-22  2:11         ` Linus Torvalds
@ 2018-01-22  2:20           ` hpa
  -1 siblings, 0 replies; 183+ messages in thread
From: hpa @ 2018-01-22  2:20 UTC (permalink / raw)
  To: Linus Torvalds, Nadav Amit
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

On January 21, 2018 6:11:07 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sun, Jan 21, 2018 at 3:46 PM, Nadav Amit <nadav.amit@gmail.com>
>wrote:
>> I wanted to see whether segments protection can be a replacement for
>PTI
>> (yes, excluding SMEP emulation), or whether speculative execution
>“ignores”
>> limit checks, similarly to the way paging protection is skipped.
>>
>> It does seem that segmentation provides sufficient protection from
>Meltdown.
>> The “reliability” test of Gratz PoC fails if the segment limit is set
>to
>> prevent access to the kernel memory. [ It passes if the limit is not
>set,
>> even if the DS is reloaded. ] My test is enclosed below.
>
>Interesting. It might not be entirely reliable for all
>microarchitectures, though.
>
>> So my question: wouldn’t it be much more efficient to use
>segmentation
>> protection for x86-32, and allow users to choose whether they want
>SMEP-like
>> protection if needed (and then enable PTI)?
>
>That's what we did long long ago, with user space segments actually
>using the limit (in fact, if you go back far enough, the kernel even
>used the base).
>
>You'd have to make sure that the LDT loading etc do not allow CPL3
>segments with base+limit past TASK_SIZE, so that people can't generate
>their own.  And the TLS segments also need to be limited (and
>remember, the limit has to be TASK_SIZE-base, not just TASK_SIZE).
>
>And we should check with Intel that segment limit checking really is
>guaranteed to be done before any access.
>
>Too bad x86-64 got rid of the segments ;)
>
>               Linus

No idea about Intel, but at least on Transmeta CPUs the limit check was asynchronous with the access.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-22  2:20           ` hpa
  0 siblings, 0 replies; 183+ messages in thread
From: hpa @ 2018-01-22  2:20 UTC (permalink / raw)
  To: Linus Torvalds, Nadav Amit
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

On January 21, 2018 6:11:07 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sun, Jan 21, 2018 at 3:46 PM, Nadav Amit <nadav.amit@gmail.com>
>wrote:
>> I wanted to see whether segments protection can be a replacement for
>PTI
>> (yes, excluding SMEP emulation), or whether speculative execution
>“ignores”
>> limit checks, similarly to the way paging protection is skipped.
>>
>> It does seem that segmentation provides sufficient protection from
>Meltdown.
>> The “reliability” test of Gratz PoC fails if the segment limit is set
>to
>> prevent access to the kernel memory. [ It passes if the limit is not
>set,
>> even if the DS is reloaded. ] My test is enclosed below.
>
>Interesting. It might not be entirely reliable for all
>microarchitectures, though.
>
>> So my question: wouldn’t it be much more efficient to use
>segmentation
>> protection for x86-32, and allow users to choose whether they want
>SMEP-like
>> protection if needed (and then enable PTI)?
>
>That's what we did long long ago, with user space segments actually
>using the limit (in fact, if you go back far enough, the kernel even
>used the base).
>
>You'd have to make sure that the LDT loading etc do not allow CPL3
>segments with base+limit past TASK_SIZE, so that people can't generate
>their own.  And the TLS segments also need to be limited (and
>remember, the limit has to be TASK_SIZE-base, not just TASK_SIZE).
>
>And we should check with Intel that segment limit checking really is
>guaranteed to be done before any access.
>
>Too bad x86-64 got rid of the segments ;)
>
>               Linus

No idea about Intel, but at least on Transmeta CPUs the limit check was asynchronous with the access.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-22  2:11         ` Linus Torvalds
@ 2018-01-22  2:27           ` Nadav Amit
  -1 siblings, 0 replies; 183+ messages in thread
From: Nadav Amit @ 2018-01-22  2:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, Jan 21, 2018 at 3:46 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> I wanted to see whether segments protection can be a replacement for PTI
>> (yes, excluding SMEP emulation), or whether speculative execution “ignores”
>> limit checks, similarly to the way paging protection is skipped.
>> 
>> It does seem that segmentation provides sufficient protection from Meltdown.
>> The “reliability” test of Gratz PoC fails if the segment limit is set to
>> prevent access to the kernel memory. [ It passes if the limit is not set,
>> even if the DS is reloaded. ] My test is enclosed below.
> 
> Interesting. It might not be entirely reliable for all
> microarchitectures, though.
> 
>> So my question: wouldn’t it be much more efficient to use segmentation
>> protection for x86-32, and allow users to choose whether they want SMEP-like
>> protection if needed (and then enable PTI)?
> 
> That's what we did long long ago, with user space segments actually
> using the limit (in fact, if you go back far enough, the kernel even
> used the base).
> 
> You'd have to make sure that the LDT loading etc do not allow CPL3
> segments with base+limit past TASK_SIZE, so that people can't generate
> their own.  And the TLS segments also need to be limited (and
> remember, the limit has to be TASK_SIZE-base, not just TASK_SIZE).
> 
> And we should check with Intel that segment limit checking really is
> guaranteed to be done before any access.

Thanks. I’ll try to check with Intel liaison people of VMware (my employer),
yet any feedback will be appreciated.

> Too bad x86-64 got rid of the segments ;)

Actually, as I noted in a different thread, running 32-bit binaries on
x86_64 in legacy-mode, without PTI, performs considerably better than x86_64
binaries with PTI for workloads that are hit the most (e.g., Redis). By
dynamically removing the 64-bit user-CS from the GDT, this mode should be
safe, as long as CS load is not done speculatively.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-22  2:27           ` Nadav Amit
  0 siblings, 0 replies; 183+ messages in thread
From: Nadav Amit @ 2018-01-22  2:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, Jan 21, 2018 at 3:46 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> I wanted to see whether segments protection can be a replacement for PTI
>> (yes, excluding SMEP emulation), or whether speculative execution “ignores”
>> limit checks, similarly to the way paging protection is skipped.
>> 
>> It does seem that segmentation provides sufficient protection from Meltdown.
>> The “reliability” test of Gratz PoC fails if the segment limit is set to
>> prevent access to the kernel memory. [ It passes if the limit is not set,
>> even if the DS is reloaded. ] My test is enclosed below.
> 
> Interesting. It might not be entirely reliable for all
> microarchitectures, though.
> 
>> So my question: wouldn’t it be much more efficient to use segmentation
>> protection for x86-32, and allow users to choose whether they want SMEP-like
>> protection if needed (and then enable PTI)?
> 
> That's what we did long long ago, with user space segments actually
> using the limit (in fact, if you go back far enough, the kernel even
> used the base).
> 
> You'd have to make sure that the LDT loading etc do not allow CPL3
> segments with base+limit past TASK_SIZE, so that people can't generate
> their own.  And the TLS segments also need to be limited (and
> remember, the limit has to be TASK_SIZE-base, not just TASK_SIZE).
> 
> And we should check with Intel that segment limit checking really is
> guaranteed to be done before any access.

Thanks. I’ll try to check with Intel liaison people of VMware (my employer),
yet any feedback will be appreciated.

> Too bad x86-64 got rid of the segments ;)

Actually, as I noted in a different thread, running 32-bit binaries on
x86_64 in legacy-mode, without PTI, performs considerably better than x86_64
binaries with PTI for workloads that are hit the most (e.g., Redis). By
dynamically removing the 64-bit user-CS from the GDT, this mode should be
safe, as long as CS load is not done speculatively.

Regards,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-21 23:46       ` Nadav Amit
@ 2018-01-22  8:56         ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-22  8:56 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Hey Nadav,

On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> It does seem that segmentation provides sufficient protection from Meltdown.

Thanks for testing this, if this turns out to be true for all affected
uarchs it would be a great and better way of protection than enabling
PTI.

But I'd like an official statement from Intel on that one, as their
recommended fix is still to use PTI.

And as you said, if it turns out that this works only on some Intel
uarchs, we can also detect it at runtime and then chose the fasted
meltdown protection mechanism.


Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-22  8:56         ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-22  8:56 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Hey Nadav,

On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> It does seem that segmentation provides sufficient protection from Meltdown.

Thanks for testing this, if this turns out to be true for all affected
uarchs it would be a great and better way of protection than enabling
PTI.

But I'd like an official statement from Intel on that one, as their
recommended fix is still to use PTI.

And as you said, if it turns out that this works only on some Intel
uarchs, we can also detect it at runtime and then chose the fasted
meltdown protection mechanism.


Thanks,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* RE: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-21 23:46       ` Nadav Amit
                         ` (2 preceding siblings ...)
  (?)
@ 2018-01-22  9:55       ` David Laight
  2018-01-22 10:04           ` Joerg Roedel
  -1 siblings, 1 reply; 183+ messages in thread
From: David Laight @ 2018-01-22  9:55 UTC (permalink / raw)
  To: 'Nadav Amit', Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel

From: Nadav Amit
> Sent: 21 January 2018 23:46
> 
> I wanted to see whether segments protection can be a replacement for PTI
> (yes, excluding SMEP emulation), or whether speculative execution “ignores”
> limit checks, similarly to the way paging protection is skipped.

That's made me remember something about segment limits applying in 64bit mode.
I really can't remember the details at all.
I'm sure it had something to do with one of the VM implementations restricting
memory accesses.

	David


^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-22  9:55       ` David Laight
@ 2018-01-22 10:04           ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-22 10:04 UTC (permalink / raw)
  To: David Laight
  Cc: 'Nadav Amit',
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel

On Mon, Jan 22, 2018 at 09:55:31AM +0000, David Laight wrote:
> That's made me remember something about segment limits applying in 64bit mode.
> I really can't remember the details at all.
> I'm sure it had something to do with one of the VM implementations restricting
> memory accesses.

Some AMD chips have long-mode segment limits, not sure if Intel has them
too. But they are useless here because the limit is 32 bit and can only
protect the upper 4GB of virtual address space. The limits also don't
apply to GS and CS segements.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-22 10:04           ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-22 10:04 UTC (permalink / raw)
  To: David Laight
  Cc: 'Nadav Amit',
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, jroedel

On Mon, Jan 22, 2018 at 09:55:31AM +0000, David Laight wrote:
> That's made me remember something about segment limits applying in 64bit mode.
> I really can't remember the details at all.
> I'm sure it had something to do with one of the VM implementations restricting
> memory accesses.

Some AMD chips have long-mode segment limits, not sure if Intel has them
too. But they are useless here because the limit is 32 bit and can only
protect the upper 4GB of virtual address space. The limits also don't
apply to GS and CS segements.


Regards,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-19 16:30             ` Andy Lutomirski
@ 2018-01-22 10:11               ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-22 10:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

Hey Andy,

On Fri, Jan 19, 2018 at 08:30:33AM -0800, Andy Lutomirski wrote:
> I meant that we could have sp0 have a genuinely constant value per
> cpu.  That means that the entry trampoline ends up with RIP, etc in a
> different place depending on whether VM was in use, but the entry
> trampoline code should be able to handle that.  sp1 would have a value
> that varies by task, but it could just point to the top of the stack
> instead of being changed depending on whether VM is in use.  Instead,
> the entry trampoline would offset the registers as needed to keep
> pt_regs in the right place.
> 
> I think you already figured all of that out, though :)

Yes, and after looking a while into it, it would make a nice cleanup for
the entry code. On the other side, it would change the layout for the
in-kernel 'struct pt_regs', so that the user-visible pt_regs ends up
with a different layout than the one we use in the the kernel.

This can certainly be all worked out, but it makes this nice entry-code
cleanup not so nice and clean anymore. At least the work required to
make it work without breaking user-space is not in the scope of this
patch-set.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-22 10:11               ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-22 10:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

Hey Andy,

On Fri, Jan 19, 2018 at 08:30:33AM -0800, Andy Lutomirski wrote:
> I meant that we could have sp0 have a genuinely constant value per
> cpu.  That means that the entry trampoline ends up with RIP, etc in a
> different place depending on whether VM was in use, but the entry
> trampoline code should be able to handle that.  sp1 would have a value
> that varies by task, but it could just point to the top of the stack
> instead of being changed depending on whether VM is in use.  Instead,
> the entry trampoline would offset the registers as needed to keep
> pt_regs in the right place.
> 
> I think you already figured all of that out, though :)

Yes, and after looking a while into it, it would make a nice cleanup for
the entry code. On the other side, it would change the layout for the
in-kernel 'struct pt_regs', so that the user-visible pt_regs ends up
with a different layout than the one we use in the the kernel.

This can certainly be all worked out, but it makes this nice entry-code
cleanup not so nice and clean anymore. At least the work required to
make it work without breaking user-space is not in the scope of this
patch-set.


Regards,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
  2018-01-22 10:11               ` Joerg Roedel
@ 2018-01-22 17:46                 ` Andy Lutomirski
  -1 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-22 17:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Mon, Jan 22, 2018 at 2:11 AM, Joerg Roedel <joro@8bytes.org> wrote:
> Hey Andy,
>
> On Fri, Jan 19, 2018 at 08:30:33AM -0800, Andy Lutomirski wrote:
>> I meant that we could have sp0 have a genuinely constant value per
>> cpu.  That means that the entry trampoline ends up with RIP, etc in a
>> different place depending on whether VM was in use, but the entry
>> trampoline code should be able to handle that.  sp1 would have a value
>> that varies by task, but it could just point to the top of the stack
>> instead of being changed depending on whether VM is in use.  Instead,
>> the entry trampoline would offset the registers as needed to keep
>> pt_regs in the right place.
>>
>> I think you already figured all of that out, though :)
>
> Yes, and after looking a while into it, it would make a nice cleanup for
> the entry code. On the other side, it would change the layout for the
> in-kernel 'struct pt_regs', so that the user-visible pt_regs ends up
> with a different layout than the one we use in the the kernel.

I don't think this is necessarily the case.  We end up with four more
fields that are logically there at the end of pt_regs (which is
already kind-of-sort-of the case), but we don't actually need to put
them in struct pt_regs.  We just end up with (regs + 1) != "top of
task stack", but even that has precedent -- it's already true for
tasks in vm86 mode.

>
> This can certainly be all worked out, but it makes this nice entry-code
> cleanup not so nice and clean anymore. At least the work required to
> make it work without breaking user-space is not in the scope of this
> patch-set.

Agreed.  This should probably be saved for later.  Except that your
patch set still needs to come up with some way to function correctly
on vm86.

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack
@ 2018-01-22 17:46                 ` Andy Lutomirski
  0 siblings, 0 replies; 183+ messages in thread
From: Andy Lutomirski @ 2018-01-22 17:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

On Mon, Jan 22, 2018 at 2:11 AM, Joerg Roedel <joro@8bytes.org> wrote:
> Hey Andy,
>
> On Fri, Jan 19, 2018 at 08:30:33AM -0800, Andy Lutomirski wrote:
>> I meant that we could have sp0 have a genuinely constant value per
>> cpu.  That means that the entry trampoline ends up with RIP, etc in a
>> different place depending on whether VM was in use, but the entry
>> trampoline code should be able to handle that.  sp1 would have a value
>> that varies by task, but it could just point to the top of the stack
>> instead of being changed depending on whether VM is in use.  Instead,
>> the entry trampoline would offset the registers as needed to keep
>> pt_regs in the right place.
>>
>> I think you already figured all of that out, though :)
>
> Yes, and after looking a while into it, it would make a nice cleanup for
> the entry code. On the other side, it would change the layout for the
> in-kernel 'struct pt_regs', so that the user-visible pt_regs ends up
> with a different layout than the one we use in the the kernel.

I don't think this is necessarily the case.  We end up with four more
fields that are logically there at the end of pt_regs (which is
already kind-of-sort-of the case), but we don't actually need to put
them in struct pt_regs.  We just end up with (regs + 1) != "top of
task stack", but even that has precedent -- it's already true for
tasks in vm86 mode.

>
> This can certainly be all worked out, but it makes this nice entry-code
> cleanup not so nice and clean anymore. At least the work required to
> make it work without breaking user-space is not in the scope of this
> patch-set.

Agreed.  This should probably be saved for later.  Except that your
patch set still needs to come up with some way to function correctly
on vm86.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-22  2:20           ` hpa
@ 2018-01-22 20:14             ` Linus Torvalds
  -1 siblings, 0 replies; 183+ messages in thread
From: Linus Torvalds @ 2018-01-22 20:14 UTC (permalink / raw)
  To: Peter Anvin
  Cc: Nadav Amit, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

On Sun, Jan 21, 2018 at 6:20 PM,  <hpa@zytor.com> wrote:
>
> No idea about Intel, but at least on Transmeta CPUs the limit check was asynchronous with the access.

Yes, but TMTA had a really odd uarch and didn't check segment limits natively.

When you do it in hardware. the limit check is actually fairly natural
to do early rather than late (since it acts on the linear address
_before_ base add and TLB lookup).

So it's not like it can't be done late, but there are reasons why a
traditional microarchitecture might always end up doing the limit
check early and so segmentation might be a good defense against
meltdown on 32-bit Intel.

             Linus

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-22 20:14             ` Linus Torvalds
  0 siblings, 0 replies; 183+ messages in thread
From: Linus Torvalds @ 2018-01-22 20:14 UTC (permalink / raw)
  To: Peter Anvin
  Cc: Nadav Amit, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

On Sun, Jan 21, 2018 at 6:20 PM,  <hpa@zytor.com> wrote:
>
> No idea about Intel, but at least on Transmeta CPUs the limit check was asynchronous with the access.

Yes, but TMTA had a really odd uarch and didn't check segment limits natively.

When you do it in hardware. the limit check is actually fairly natural
to do early rather than late (since it acts on the linear address
_before_ base add and TLB lookup).

So it's not like it can't be done late, but there are reasons why a
traditional microarchitecture might always end up doing the limit
check early and so segmentation might be a good defense against
meltdown on 32-bit Intel.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-22 20:14             ` Linus Torvalds
@ 2018-01-22 21:10               ` H. Peter Anvin
  -1 siblings, 0 replies; 183+ messages in thread
From: H. Peter Anvin @ 2018-01-22 21:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

On 01/22/18 12:14, Linus Torvalds wrote:
> On Sun, Jan 21, 2018 at 6:20 PM,  <hpa@zytor.com> wrote:
>>
>> No idea about Intel, but at least on Transmeta CPUs the limit check was asynchronous with the access.
> 
> Yes, but TMTA had a really odd uarch and didn't check segment limits natively.
> 

Only on TM3000 ("Wilma") and TM5000 ("Fred"), not on TM8000 ("Astro").
Astro might in fact have been more synchronous than most modern machines
(see below.)

> When you do it in hardware. the limit check is actually fairly natural
> to do early rather than late (since it acts on the linear address
> _before_ base add and TLB lookup).
> 
> So it's not like it can't be done late, but there are reasons why a
> traditional microarchitecture might always end up doing the limit
> check early and so segmentation might be a good defense against
> meltdown on 32-bit Intel.

I will try to investigate, but as you can imagine the amount of
bandwidth I might be able to get on this is definitely going to be limited.

All of the below is generic discussion that almost certainly can be
found in some form in Hennesey & Patterson, and so I don't have to worry
about giving away Intel secrets:

It isn't really true that it is natural to check this early.  One of the
most fundamental frequency limiters in a modern CPU architecture
(meaning anything from the last 20 years or so) has been the
data-dependent AGU-D$-AGU loop.  Note that this doesn't even include the
TLB: the TLB is looked up in parallel with the D$, and if the result was
*either* a cache-TLB mismatch or a TLB miss the result is prevented from
committing.

In the case of the x86, the AGU receives up to three sources plus the
segment base, and if possible given the target process and gates
available might be designed to have a unified 4-input adder, with the
3-input case for limit checks being done separately.

Misses and even more so exceptions (which are far less frequent than
misses) are demoted to a slower where the goal is to prevent commit
rather than trying to race to be in the data path.  So although it is
natural to *issue* the load and the limit check at the same time, the
limit check is still going to be deferred.  Whether or not it is
permitted to be fully asynchronous with the load is probably a tradeoff
of timing requirements vs complexity.  At least theoretically one could
imagine a machine which would take the trap after the speculative
machine had already chased the pointer loop several levels down; this
would most likely mean separate uops to allow for the existing
out-of-order machine to do the bookkeeping.

	-hpa

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-22 21:10               ` H. Peter Anvin
  0 siblings, 0 replies; 183+ messages in thread
From: H. Peter Anvin @ 2018-01-22 21:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Andy Lutomirski, Dave Hansen, Josh Poimboeuf, Juergen Gross,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg KH, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Joerg Roedel

On 01/22/18 12:14, Linus Torvalds wrote:
> On Sun, Jan 21, 2018 at 6:20 PM,  <hpa@zytor.com> wrote:
>>
>> No idea about Intel, but at least on Transmeta CPUs the limit check was asynchronous with the access.
> 
> Yes, but TMTA had a really odd uarch and didn't check segment limits natively.
> 

Only on TM3000 ("Wilma") and TM5000 ("Fred"), not on TM8000 ("Astro").
Astro might in fact have been more synchronous than most modern machines
(see below.)

> When you do it in hardware. the limit check is actually fairly natural
> to do early rather than late (since it acts on the linear address
> _before_ base add and TLB lookup).
> 
> So it's not like it can't be done late, but there are reasons why a
> traditional microarchitecture might always end up doing the limit
> check early and so segmentation might be a good defense against
> meltdown on 32-bit Intel.

I will try to investigate, but as you can imagine the amount of
bandwidth I might be able to get on this is definitely going to be limited.

All of the below is generic discussion that almost certainly can be
found in some form in Hennesey & Patterson, and so I don't have to worry
about giving away Intel secrets:

It isn't really true that it is natural to check this early.  One of the
most fundamental frequency limiters in a modern CPU architecture
(meaning anything from the last 20 years or so) has been the
data-dependent AGU-D$-AGU loop.  Note that this doesn't even include the
TLB: the TLB is looked up in parallel with the D$, and if the result was
*either* a cache-TLB mismatch or a TLB miss the result is prevented from
committing.

In the case of the x86, the AGU receives up to three sources plus the
segment base, and if possible given the target process and gates
available might be designed to have a unified 4-input adder, with the
3-input case for limit checks being done separately.

Misses and even more so exceptions (which are far less frequent than
misses) are demoted to a slower where the goal is to prevent commit
rather than trying to race to be in the data path.  So although it is
natural to *issue* the load and the limit check at the same time, the
limit check is still going to be deferred.  Whether or not it is
permitted to be fully asynchronous with the load is probably a tradeoff
of timing requirements vs complexity.  At least theoretically one could
imagine a machine which would take the trap after the speculative
machine had already chased the pointer loop several levels down; this
would most likely mean separate uops to allow for the existing
out-of-order machine to do the bookkeeping.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-22 21:10               ` H. Peter Anvin
@ 2018-01-23 14:38                 ` Alan Cox
  -1 siblings, 0 replies; 183+ messages in thread
From: Alan Cox @ 2018-01-23 14:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Nadav Amit, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, the arch/x86 maintainers, LKML,
	open list:MEMORY MANAGEMENT, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

> of timing requirements vs complexity.  At least theoretically one could
> imagine a machine which would take the trap after the speculative
> machine had already chased the pointer loop several levels down; this
> would most likely mean separate uops to allow for the existing
> out-of-order machine to do the bookkeeping.

It's not quite the same but in the IA-64 case you can write itanium code
that does exactly that. The speculation is expressed in software not
hardware (because you can trigger a load, then check later if it worked
out and respond appripriately).

Alan

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-23 14:38                 ` Alan Cox
  0 siblings, 0 replies; 183+ messages in thread
From: Alan Cox @ 2018-01-23 14:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Nadav Amit, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, the arch/x86 maintainers, LKML,
	open list:MEMORY MANAGEMENT, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Joerg Roedel

> of timing requirements vs complexity.  At least theoretically one could
> imagine a machine which would take the trap after the speculative
> machine had already chased the pointer loop several levels down; this
> would most likely mean separate uops to allow for the existing
> out-of-order machine to do the bookkeeping.

It's not quite the same but in the IA-64 case you can write itanium code
that does exactly that. The speculation is expressed in software not
hardware (because you can trigger a load, then check later if it worked
out and respond appripriately).

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-22  8:56         ` Joerg Roedel
@ 2018-01-23 14:57           ` Alan Cox
  -1 siblings, 0 replies; 183+ messages in thread
From: Alan Cox @ 2018-01-23 14:57 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Nadav Amit, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Mon, 22 Jan 2018 09:56:25 +0100
Joerg Roedel <joro@8bytes.org> wrote:

> Hey Nadav,
> 
> On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> > It does seem that segmentation provides sufficient protection from Meltdown.  
> 
> Thanks for testing this, if this turns out to be true for all affected
> uarchs it would be a great and better way of protection than enabling
> PTI.
> 
> But I'd like an official statement from Intel on that one, as their
> recommended fix is still to use PTI.
> 
> And as you said, if it turns out that this works only on some Intel
> uarchs, we can also detect it at runtime and then chose the fasted
> meltdown protection mechanism.

I'll follow this up and get an official statement.

Alan

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-23 14:57           ` Alan Cox
  0 siblings, 0 replies; 183+ messages in thread
From: Alan Cox @ 2018-01-23 14:57 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Nadav Amit, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Mon, 22 Jan 2018 09:56:25 +0100
Joerg Roedel <joro@8bytes.org> wrote:

> Hey Nadav,
> 
> On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> > It does seem that segmentation provides sufficient protection from Meltdown.  
> 
> Thanks for testing this, if this turns out to be true for all affected
> uarchs it would be a great and better way of protection than enabling
> PTI.
> 
> But I'd like an official statement from Intel on that one, as their
> recommended fix is still to use PTI.
> 
> And as you said, if it turns out that this works only on some Intel
> uarchs, we can also detect it at runtime and then chose the fasted
> meltdown protection mechanism.

I'll follow this up and get an official statement.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-16 16:36 ` Joerg Roedel
@ 2018-01-24 18:58   ` Krzysztof Mazur
  -1 siblings, 0 replies; 183+ messages in thread
From: Krzysztof Mazur @ 2018-01-24 18:58 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli

On Tue, Jan 16, 2018 at 05:36:43PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Hi,
> 
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

Hi,

I've waited for this patches for a long time, until I've tried to
exploit meltdown on some old 32-bit CPUs and failed. Pentium M
seems to speculatively execute the second load with eax
always equal to 0:

	movzx (%[addr]), %%eax
	shl $12, %%eax
	movzx (%[target], %%eax), %%eax

And on Pentium 4-based Xeon the second load seems to be never executed,
even without shift (shifts are slow on some or all Pentium 4's). Maybe
not all P6 and Netbursts CPUs are affected, but I'm not sure. Maybe the
kernel, at least on 32-bit, should try to exploit meltdown to test if
the CPU is really affected.


The series boots on Pentium M (and crashes when I've used perf,
but it is an already known issue). However, I don't like
the performance regression with CONFIG_PAGE_TABLE_ISOLATION=n
(about 7.2%), trivial "benchmark":

--- cut here ---
#include <unistd.h>
#include <fcntl.h>

int main(void)
{
	unsigned long i;
	int fd;

	fd = open("/dev/null", O_WRONLY);
	for (i = 0; i < 10000000; i++) {
		char x = 0;
		write(fd, &x, 1);
	}
	return 0;
}
--- cut here ---

Time (on Pentium M 1.73 GHz):

baseline (4.15.0-rc8-gdda3e152):		2.415 s (+/- 0.64%)
patched, without CONFIG_PAGE_TABLE_ISOLATION=n	2.588 s (+/- 0.01%)
patched, nopti					2.597 s (+/- 0.31%)
patched, pti					18.272 s
(some older kernel, pre 4.15)			2.378 s

Thanks,
Krzysiek
--
perf results:

baseline:

 Performance counter stats for './bench' (5 runs):

       2401.539139 task-clock:HG             #    0.995 CPUs utilized            ( +-  0.23% )
                23 context-switches:HG       #    0.009 K/sec                    ( +-  4.02% )
                 0 cpu-migrations:HG         #    0.000 K/sec                  
                30 page-faults:HG            #    0.013 K/sec                    ( +-  1.24% )
        4142375834 cycles:HG                 #    1.725 GHz                      ( +-  0.23% ) [39.99%]
         385110908 stalled-cycles-frontend:HG #    9.30% frontend cycles idle     ( +-  0.06% ) [40.01%]
   <not supported> stalled-cycles-backend:HG
        4142489274 instructions:HG           #    1.00  insns per cycle        
                                             #    0.09  stalled cycles per insn  ( +-  0.00% ) [40.00%]
         802270380 branches:HG               #  334.065 M/sec                    ( +-  0.00% ) [40.00%]
             34278 branch-misses:HG          #    0.00% of all branches          ( +-  1.94% ) [40.00%]

       2.414741497 seconds time elapsed                                          ( +-  0.64% )

patched, without CONFIG_PAGE_TABLE_ISOLATION=n

 Performance counter stats for './bench' (5 runs):

       2587.026405 task-clock:HG             #    1.000 CPUs utilized            ( +-  0.01% )
                28 context-switches:HG       #    0.011 K/sec                    ( +-  5.95% )
                 0 cpu-migrations:HG         #    0.000 K/sec                  
                31 page-faults:HG            #    0.012 K/sec                    ( +-  1.21% )
        4462401079 cycles:HG                 #    1.725 GHz                      ( +-  0.01% ) [39.98%]
         388646121 stalled-cycles-frontend:HG #    8.71% frontend cycles idle     ( +-  0.05% ) [40.01%]
   <not supported> stalled-cycles-backend:HG
        4283638646 instructions:HG           #    0.96  insns per cycle        
                                             #    0.09  stalled cycles per insn  ( +-  0.00% ) [40.03%]
         822484311 branches:HG               #  317.927 M/sec                    ( +-  0.00% ) [40.01%]
             39372 branch-misses:HG          #    0.00% of all branches          ( +-  2.33% ) [39.98%]

       2.587818354 seconds time elapsed                                          ( +-  0.01% )

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-24 18:58   ` Krzysztof Mazur
  0 siblings, 0 replies; 183+ messages in thread
From: Krzysztof Mazur @ 2018-01-24 18:58 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli

On Tue, Jan 16, 2018 at 05:36:43PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> Hi,
> 
> here is my current WIP code to enable PTI on x86-32. It is
> still in a pretty early state, but it successfully boots my
> KVM guest with PAE and with legacy paging. The existing PTI
> code for x86-64 already prepares a lot of the stuff needed
> for 32 bit too, thanks for that to all the people involved
> in its development :)

Hi,

I've waited for this patches for a long time, until I've tried to
exploit meltdown on some old 32-bit CPUs and failed. Pentium M
seems to speculatively execute the second load with eax
always equal to 0:

	movzx (%[addr]), %%eax
	shl $12, %%eax
	movzx (%[target], %%eax), %%eax

And on Pentium 4-based Xeon the second load seems to be never executed,
even without shift (shifts are slow on some or all Pentium 4's). Maybe
not all P6 and Netbursts CPUs are affected, but I'm not sure. Maybe the
kernel, at least on 32-bit, should try to exploit meltdown to test if
the CPU is really affected.


The series boots on Pentium M (and crashes when I've used perf,
but it is an already known issue). However, I don't like
the performance regression with CONFIG_PAGE_TABLE_ISOLATION=n
(about 7.2%), trivial "benchmark":

--- cut here ---
#include <unistd.h>
#include <fcntl.h>

int main(void)
{
	unsigned long i;
	int fd;

	fd = open("/dev/null", O_WRONLY);
	for (i = 0; i < 10000000; i++) {
		char x = 0;
		write(fd, &x, 1);
	}
	return 0;
}
--- cut here ---

Time (on Pentium M 1.73 GHz):

baseline (4.15.0-rc8-gdda3e152):		2.415 s (+/- 0.64%)
patched, without CONFIG_PAGE_TABLE_ISOLATION=n	2.588 s (+/- 0.01%)
patched, nopti					2.597 s (+/- 0.31%)
patched, pti					18.272 s
(some older kernel, pre 4.15)			2.378 s

Thanks,
Krzysiek
--
perf results:

baseline:

 Performance counter stats for './bench' (5 runs):

       2401.539139 task-clock:HG             #    0.995 CPUs utilized            ( +-  0.23% )
                23 context-switches:HG       #    0.009 K/sec                    ( +-  4.02% )
                 0 cpu-migrations:HG         #    0.000 K/sec                  
                30 page-faults:HG            #    0.013 K/sec                    ( +-  1.24% )
        4142375834 cycles:HG                 #    1.725 GHz                      ( +-  0.23% ) [39.99%]
         385110908 stalled-cycles-frontend:HG #    9.30% frontend cycles idle     ( +-  0.06% ) [40.01%]
   <not supported> stalled-cycles-backend:HG
        4142489274 instructions:HG           #    1.00  insns per cycle        
                                             #    0.09  stalled cycles per insn  ( +-  0.00% ) [40.00%]
         802270380 branches:HG               #  334.065 M/sec                    ( +-  0.00% ) [40.00%]
             34278 branch-misses:HG          #    0.00% of all branches          ( +-  1.94% ) [40.00%]

       2.414741497 seconds time elapsed                                          ( +-  0.64% )

patched, without CONFIG_PAGE_TABLE_ISOLATION=n

 Performance counter stats for './bench' (5 runs):

       2587.026405 task-clock:HG             #    1.000 CPUs utilized            ( +-  0.01% )
                28 context-switches:HG       #    0.011 K/sec                    ( +-  5.95% )
                 0 cpu-migrations:HG         #    0.000 K/sec                  
                31 page-faults:HG            #    0.012 K/sec                    ( +-  1.21% )
        4462401079 cycles:HG                 #    1.725 GHz                      ( +-  0.01% ) [39.98%]
         388646121 stalled-cycles-frontend:HG #    8.71% frontend cycles idle     ( +-  0.05% ) [40.01%]
   <not supported> stalled-cycles-backend:HG
        4283638646 instructions:HG           #    0.96  insns per cycle        
                                             #    0.09  stalled cycles per insn  ( +-  0.00% ) [40.03%]
         822484311 branches:HG               #  317.927 M/sec                    ( +-  0.00% ) [40.01%]
             39372 branch-misses:HG          #    0.00% of all branches          ( +-  2.33% ) [39.98%]

       2.587818354 seconds time elapsed                                          ( +-  0.01% )

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-22  8:56         ` Joerg Roedel
@ 2018-01-25 17:09           ` Alan Cox
  -1 siblings, 0 replies; 183+ messages in thread
From: Alan Cox @ 2018-01-25 17:09 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Nadav Amit, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Mon, 22 Jan 2018 09:56:25 +0100
Joerg Roedel <joro@8bytes.org> wrote:

> Hey Nadav,
> 
> On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> > It does seem that segmentation provides sufficient protection from Meltdown.  
> 
> Thanks for testing this, if this turns out to be true for all affected
> uarchs it would be a great and better way of protection than enabling
> PTI.
> 
> But I'd like an official statement from Intel on that one, as their
> recommended fix is still to use PTI.

It is: we don't think segmentation works on all processors as a defence.

Alan

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-25 17:09           ` Alan Cox
  0 siblings, 0 replies; 183+ messages in thread
From: Alan Cox @ 2018-01-25 17:09 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Nadav Amit, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

On Mon, 22 Jan 2018 09:56:25 +0100
Joerg Roedel <joro@8bytes.org> wrote:

> Hey Nadav,
> 
> On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> > It does seem that segmentation provides sufficient protection from Meltdown.  
> 
> Thanks for testing this, if this turns out to be true for all affected
> uarchs it would be a great and better way of protection than enabling
> PTI.
> 
> But I'd like an official statement from Intel on that one, as their
> recommended fix is still to use PTI.

It is: we don't think segmentation works on all processors as a defence.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-24 18:58   ` Krzysztof Mazur
@ 2018-01-25 22:09     ` Nadav Amit
  -1 siblings, 0 replies; 183+ messages in thread
From: Nadav Amit @ 2018-01-25 22:09 UTC (permalink / raw)
  To: Krzysztof Mazur
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli

Krzysztof Mazur <krzysiek@podlesie.net> wrote:

> On Tue, Jan 16, 2018 at 05:36:43PM +0100, Joerg Roedel wrote:
>> From: Joerg Roedel <jroedel@suse.de>
>> 
>> Hi,
>> 
>> here is my current WIP code to enable PTI on x86-32. It is
>> still in a pretty early state, but it successfully boots my
>> KVM guest with PAE and with legacy paging. The existing PTI
>> code for x86-64 already prepares a lot of the stuff needed
>> for 32 bit too, thanks for that to all the people involved
>> in its development :)
> 
> Hi,
> 
> I've waited for this patches for a long time, until I've tried to
> exploit meltdown on some old 32-bit CPUs and failed. Pentium M
> seems to speculatively execute the second load with eax
> always equal to 0:
> 
> 	movzx (%[addr]), %%eax
> 	shl $12, %%eax
> 	movzx (%[target], %%eax), %%eax
> 
> And on Pentium 4-based Xeon the second load seems to be never executed,
> even without shift (shifts are slow on some or all Pentium 4's). Maybe
> not all P6 and Netbursts CPUs are affected, but I'm not sure. Maybe the
> kernel, at least on 32-bit, should try to exploit meltdown to test if
> the CPU is really affected.

The PoC apparently does not work with 3GB of memory or more on 32-bit. Does
you setup has more? Can you try the attack while setting max_addr=1G ?

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-25 22:09     ` Nadav Amit
  0 siblings, 0 replies; 183+ messages in thread
From: Nadav Amit @ 2018-01-25 22:09 UTC (permalink / raw)
  To: Krzysztof Mazur
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli

Krzysztof Mazur <krzysiek@podlesie.net> wrote:

> On Tue, Jan 16, 2018 at 05:36:43PM +0100, Joerg Roedel wrote:
>> From: Joerg Roedel <jroedel@suse.de>
>> 
>> Hi,
>> 
>> here is my current WIP code to enable PTI on x86-32. It is
>> still in a pretty early state, but it successfully boots my
>> KVM guest with PAE and with legacy paging. The existing PTI
>> code for x86-64 already prepares a lot of the stuff needed
>> for 32 bit too, thanks for that to all the people involved
>> in its development :)
> 
> Hi,
> 
> I've waited for this patches for a long time, until I've tried to
> exploit meltdown on some old 32-bit CPUs and failed. Pentium M
> seems to speculatively execute the second load with eax
> always equal to 0:
> 
> 	movzx (%[addr]), %%eax
> 	shl $12, %%eax
> 	movzx (%[target], %%eax), %%eax
> 
> And on Pentium 4-based Xeon the second load seems to be never executed,
> even without shift (shifts are slow on some or all Pentium 4's). Maybe
> not all P6 and Netbursts CPUs are affected, but I'm not sure. Maybe the
> kernel, at least on 32-bit, should try to exploit meltdown to test if
> the CPU is really affected.

The PoC apparently does not work with 3GB of memory or more on 32-bit. Does
you setup has more? Can you try the attack while setting max_addr=1G ?

Thanks,
Nadav

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-25 22:09     ` Nadav Amit
@ 2018-01-26  9:28       ` Krzysztof Mazur
  -1 siblings, 0 replies; 183+ messages in thread
From: Krzysztof Mazur @ 2018-01-26  9:28 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli

On Thu, Jan 25, 2018 at 02:09:40PM -0800, Nadav Amit wrote:
> The PoC apparently does not work with 3GB of memory or more on 32-bit. Does
> you setup has more? Can you try the attack while setting max_addr=1G ?

No, I tested on:

Pentium M (Dothan): 1.5 GB RAM, PAE for NX, 2GB/2GB split

CONFIG_NOHIGHMEM=y
CONFIG_VMSPLIT_2G=y
CONFIG_PAGE_OFFSET=0x80000000
CONFIG_X86_PAE=y

and

Xeon (Pentium 4): 2 GB RAM, no PAE, 1.75GB/2.25GB split
CONFIG_NOHIGHMEM=y
CONFIG_VMSPLIT_2G_OPT=y
CONFIG_PAGE_OFFSET=0x78000000


Now I'm testing with standard settings on
Pentium M: 1.5 GB RAM, no PAE, 3GB/1GB split, ~890 MB RAM available

CONFIG_NOHIGHMEM=y
CONFIG_PAGE_OFFSET=0xc0000000
CONFIG_X86_PAE=n

and it still does not work.

reliability from https://github.com/IAIK/meltdown reports 0.38%
(1/256 = 0.39%, "true" random), and other libkdump tools does not work.

https://github.com/paboldin/meltdown-exploit (on linux_proc_banner
symbol) reports:
cached = 46, uncached = 515, threshold 153
read c0897020 = ff   (score=0/1000)
read c0897021 = ff   (score=0/1000)
read c0897022 = ff   (score=0/1000)
read c0897023 = ff   (score=0/1000)
read c0897024 = ff   (score=0/1000)
NOT VULNERABLE

and my exploit with:

	for (i = 0; i < 256; i++) {
		unsigned char *px = p + (i << 12);

		t = rdtsc();
		readb(px);
		t = rdtsc() - t;
		if (t < 100)
			printf("%02x %lld\n", i, t);
	}

loop returns only "00 45". When I change the exploit code (now based
on paboldin code to be sure) to:

	movzx (%[addr]), %%eax
	movl $0xaa, %%eax
	shl $12, %%eax
	movzx (%[target], %%eax), %%eax

I always get "0xaa 51", so the CPU is speculatively executing the second
load with (0xaa << 12) in eax, and without the movl instruction, eax seems
to be always 0. I even tried to remove the shift:

	movzx (%[addr]), %%eax
	movzx (%[target], %%eax), %%eax

and I've been reading known value (from /dev/mem, for instance 0x20),
I've modified target array offset, and the CPU is still touching "wrong"
cacheline, eax == 0 instead of 0x20. I've also tested movl instead
of movzx (with and 0xff).


On Core 2 Quad in 64-bit mode everything works as expected, vulnerable
to Meltdown (I did not test it in 32-bit mode). I don't have any Core
"1" to test.

On that Pentium M syscall slowdown caused by PTI is huge, 7.5 times slower
(7 times compared to patched kernel with disabled PTI), on Skylake with
PCID the same trivial benchmark is "only" 3.5 times slower (and 5.2
times slower without PCID).

Krzysiek

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-26  9:28       ` Krzysztof Mazur
  0 siblings, 0 replies; 183+ messages in thread
From: Krzysztof Mazur @ 2018-01-26  9:28 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli

On Thu, Jan 25, 2018 at 02:09:40PM -0800, Nadav Amit wrote:
> The PoC apparently does not work with 3GB of memory or more on 32-bit. Does
> you setup has more? Can you try the attack while setting max_addr=1G ?

No, I tested on:

Pentium M (Dothan): 1.5 GB RAM, PAE for NX, 2GB/2GB split

CONFIG_NOHIGHMEM=y
CONFIG_VMSPLIT_2G=y
CONFIG_PAGE_OFFSET=0x80000000
CONFIG_X86_PAE=y

and

Xeon (Pentium 4): 2 GB RAM, no PAE, 1.75GB/2.25GB split
CONFIG_NOHIGHMEM=y
CONFIG_VMSPLIT_2G_OPT=y
CONFIG_PAGE_OFFSET=0x78000000


Now I'm testing with standard settings on
Pentium M: 1.5 GB RAM, no PAE, 3GB/1GB split, ~890 MB RAM available

CONFIG_NOHIGHMEM=y
CONFIG_PAGE_OFFSET=0xc0000000
CONFIG_X86_PAE=n

and it still does not work.

reliability from https://github.com/IAIK/meltdown reports 0.38%
(1/256 = 0.39%, "true" random), and other libkdump tools does not work.

https://github.com/paboldin/meltdown-exploit (on linux_proc_banner
symbol) reports:
cached = 46, uncached = 515, threshold 153
read c0897020 = ff   (score=0/1000)
read c0897021 = ff   (score=0/1000)
read c0897022 = ff   (score=0/1000)
read c0897023 = ff   (score=0/1000)
read c0897024 = ff   (score=0/1000)
NOT VULNERABLE

and my exploit with:

	for (i = 0; i < 256; i++) {
		unsigned char *px = p + (i << 12);

		t = rdtsc();
		readb(px);
		t = rdtsc() - t;
		if (t < 100)
			printf("%02x %lld\n", i, t);
	}

loop returns only "00 45". When I change the exploit code (now based
on paboldin code to be sure) to:

	movzx (%[addr]), %%eax
	movl $0xaa, %%eax
	shl $12, %%eax
	movzx (%[target], %%eax), %%eax

I always get "0xaa 51", so the CPU is speculatively executing the second
load with (0xaa << 12) in eax, and without the movl instruction, eax seems
to be always 0. I even tried to remove the shift:

	movzx (%[addr]), %%eax
	movzx (%[target], %%eax), %%eax

and I've been reading known value (from /dev/mem, for instance 0x20),
I've modified target array offset, and the CPU is still touching "wrong"
cacheline, eax == 0 instead of 0x20. I've also tested movl instead
of movzx (with and 0xff).


On Core 2 Quad in 64-bit mode everything works as expected, vulnerable
to Meltdown (I did not test it in 32-bit mode). I don't have any Core
"1" to test.

On that Pentium M syscall slowdown caused by PTI is huge, 7.5 times slower
(7 times compared to patched kernel with disabled PTI), on Skylake with
PCID the same trivial benchmark is "only" 3.5 times slower (and 5.2
times slower without PCID).

Krzysiek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
  2018-01-25 17:09           ` Alan Cox
@ 2018-01-26 12:36             ` Joerg Roedel
  -1 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-26 12:36 UTC (permalink / raw)
  To: Alan Cox
  Cc: Nadav Amit, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Hi Alan,

On Thu, Jan 25, 2018 at 05:09:25PM +0000, Alan Cox wrote:
> On Mon, 22 Jan 2018 09:56:25 +0100
> Joerg Roedel <joro@8bytes.org> wrote:
> 
> > Hey Nadav,
> > 
> > On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> > > It does seem that segmentation provides sufficient protection from Meltdown.  
> > 
> > Thanks for testing this, if this turns out to be true for all affected
> > uarchs it would be a great and better way of protection than enabling
> > PTI.
> > 
> > But I'd like an official statement from Intel on that one, as their
> > recommended fix is still to use PTI.
> 
> It is: we don't think segmentation works on all processors as a defence.

Thanks for checking and the official statement. So the official
mitigation recommendation is still to use PTI.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 183+ messages in thread

* Re: [RFC PATCH 00/16] PTI support for x86-32
@ 2018-01-26 12:36             ` Joerg Roedel
  0 siblings, 0 replies; 183+ messages in thread
From: Joerg Roedel @ 2018-01-26 12:36 UTC (permalink / raw)
  To: Alan Cox
  Cc: Nadav Amit, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers, LKML, open list:MEMORY MANAGEMENT,
	Linus Torvalds, Andy Lutomirski, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, aliguori, daniel.gruss,
	hughd, keescook, Andrea Arcangeli, Waiman Long, jroedel

Hi Alan,

On Thu, Jan 25, 2018 at 05:09:25PM +0000, Alan Cox wrote:
> On Mon, 22 Jan 2018 09:56:25 +0100
> Joerg Roedel <joro@8bytes.org> wrote:
> 
> > Hey Nadav,
> > 
> > On Sun, Jan 21, 2018 at 03:46:24PM -0800, Nadav Amit wrote:
> > > It does seem that segmentation provides sufficient protection from Meltdown.  
> > 
> > Thanks for testing this, if this turns out to be true for all affected
> > uarchs it would be a great and better way of protection than enabling
> > PTI.
> > 
> > But I'd like an official statement from Intel on that one, as their
> > recommended fix is still to use PTI.
> 
> It is: we don't think segmentation works on all processors as a defence.

Thanks for checking and the official statement. So the official
mitigation recommendation is still to use PTI.


Regards,

	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 183+ messages in thread

end of thread, other threads:[~2018-01-26 12:36 UTC | newest]

Thread overview: 183+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-16 16:36 [RFC PATCH 00/16] PTI support for x86-32 Joerg Roedel
2018-01-16 16:36 ` Joerg Roedel
2018-01-16 16:36 ` [PATCH 01/16] x86/entry/32: Rename TSS_sysenter_sp0 to TSS_sysenter_stack Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 18:35   ` Thomas Gleixner
2018-01-16 18:35     ` Thomas Gleixner
2018-01-16 16:36 ` [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 20:30   ` Thomas Gleixner
2018-01-16 20:30     ` Thomas Gleixner
2018-01-16 22:37     ` Andy Lutomirski
2018-01-16 22:37       ` Andy Lutomirski
2018-01-16 22:45   ` Andy Lutomirski
2018-01-16 22:45     ` Andy Lutomirski
2018-01-17  9:18     ` Joerg Roedel
2018-01-17  9:18       ` Joerg Roedel
2018-01-17 18:10       ` Andy Lutomirski
2018-01-17 18:10         ` Andy Lutomirski
2018-01-19  9:55         ` Joerg Roedel
2018-01-19  9:55           ` Joerg Roedel
2018-01-19 16:30           ` Andy Lutomirski
2018-01-19 16:30             ` Andy Lutomirski
2018-01-22 10:11             ` Joerg Roedel
2018-01-22 10:11               ` Joerg Roedel
2018-01-22 17:46               ` Andy Lutomirski
2018-01-22 17:46                 ` Andy Lutomirski
2018-01-17  2:47   ` Boris Ostrovsky
2018-01-17  2:47     ` Boris Ostrovsky
2018-01-17  9:02     ` Joerg Roedel
2018-01-17  9:02       ` Joerg Roedel
2018-01-17 14:04       ` Andrew Cooper
2018-01-17 14:04         ` Andrew Cooper
2018-01-17 15:22         ` Boris Ostrovsky
2018-01-17 15:22           ` Boris Ostrovsky
2018-01-16 16:36 ` [PATCH 03/16] x86/entry/32: Leave the kernel via the " Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 22:48   ` Andy Lutomirski
2018-01-16 22:48     ` Andy Lutomirski
2018-01-17  9:24     ` Joerg Roedel
2018-01-17  9:24       ` Joerg Roedel
2018-01-17 13:57       ` Brian Gerst
2018-01-17 13:57         ` Brian Gerst
2018-01-17 14:00         ` Brian Gerst
2018-01-17 14:00           ` Brian Gerst
2018-01-17 14:14           ` Joerg Roedel
2018-01-17 14:14             ` Joerg Roedel
2018-01-17 14:45             ` Josh Poimboeuf
2018-01-17 14:45               ` Josh Poimboeuf
2018-01-17 14:10         ` Joerg Roedel
2018-01-17 14:10           ` Joerg Roedel
2018-01-17 18:12           ` Andy Lutomirski
2018-01-17 18:12             ` Andy Lutomirski
2018-01-19  9:57             ` Joerg Roedel
2018-01-19  9:57               ` Joerg Roedel
2018-01-16 16:36 ` [PATCH 04/16] x86/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32 Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 22:46   ` Andy Lutomirski
2018-01-16 22:46     ` Andy Lutomirski
2018-01-17  9:26     ` Joerg Roedel
2018-01-17  9:26       ` Joerg Roedel
2018-01-16 16:36 ` [PATCH 05/16] x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 16:36 ` [PATCH 06/16] x86/mm/ldt: Reserve high address-space range for the LDT Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 16:52   ` Peter Zijlstra
2018-01-16 16:52     ` Peter Zijlstra
2018-01-16 17:13     ` Joerg Roedel
2018-01-16 17:13       ` Joerg Roedel
2018-01-16 17:31       ` Peter Zijlstra
2018-01-16 17:31         ` Peter Zijlstra
2018-01-16 17:34         ` Waiman Long
2018-01-16 17:34           ` Waiman Long
2018-01-16 22:51     ` Andy Lutomirski
2018-01-16 22:51       ` Andy Lutomirski
2018-01-17  7:59       ` Peter Zijlstra
2018-01-17  7:59         ` Peter Zijlstra
2018-01-16 16:36 ` [PATCH 07/16] x86/mm: Move two more functions from pgtable_64.h to pgtable.h Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 18:03   ` Dave Hansen
2018-01-16 18:03     ` Dave Hansen
2018-01-16 19:11     ` Joerg Roedel
2018-01-16 19:11       ` Joerg Roedel
2018-01-16 19:34       ` Thomas Gleixner
2018-01-16 19:34         ` Thomas Gleixner
2018-01-16 16:36 ` [PATCH 08/16] x86/pgtable/32: Allocate 8k page-tables when PTI is enabled Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-17 23:43   ` Andy Lutomirski
2018-01-17 23:43     ` Andy Lutomirski
2018-01-19  9:57     ` Joerg Roedel
2018-01-19  9:57       ` Joerg Roedel
2018-01-16 16:36 ` [PATCH 09/16] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32 Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 21:03   ` Thomas Gleixner
2018-01-16 21:03     ` Thomas Gleixner
2018-01-16 16:36 ` [PATCH 10/16] x86/mm/pti: Populate valid user pud entries Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 18:06   ` Dave Hansen
2018-01-16 18:06     ` Dave Hansen
2018-01-16 19:41     ` Joerg Roedel
2018-01-16 19:41       ` Joerg Roedel
2018-01-16 21:06   ` Thomas Gleixner
2018-01-16 21:06     ` Thomas Gleixner
2018-01-16 16:36 ` [PATCH 11/16] x86/mm/pgtable: Move pti_set_user_pgd() to pgtable.h Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 16:36 ` [PATCH 12/16] x86/mm/pae: Populate the user page-table with user pgd's Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 18:11   ` Dave Hansen
2018-01-16 18:11     ` Dave Hansen
2018-01-16 19:44     ` Joerg Roedel
2018-01-16 19:44       ` Joerg Roedel
2018-01-16 21:10   ` Thomas Gleixner
2018-01-16 21:10     ` Thomas Gleixner
2018-01-16 21:15     ` Dave Hansen
2018-01-16 21:15       ` Dave Hansen
2018-01-16 16:36 ` [PATCH 13/16] x86/mm/pti: Add an overflow check to pti_clone_pmds() Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 16:36 ` [PATCH 14/16] x86/mm/legacy: Populate the user page-table with user pgd's Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-17 23:41   ` Andy Lutomirski
2018-01-17 23:41     ` Andy Lutomirski
2018-01-16 16:36 ` [PATCH 15/16] x86/entry/32: Switch between kernel and user cr3 on entry/exit Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 16:36 ` [PATCH 16/16] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32 Joerg Roedel
2018-01-16 16:36   ` Joerg Roedel
2018-01-16 18:14 ` [RFC PATCH 00/16] PTI support for x86-32 Dave Hansen
2018-01-16 18:14   ` Dave Hansen
2018-01-16 19:46   ` Joerg Roedel
2018-01-16 19:46     ` Joerg Roedel
2018-01-16 18:59 ` Linus Torvalds
2018-01-16 18:59   ` Linus Torvalds
2018-01-16 19:02   ` Dave Hansen
2018-01-16 19:02     ` Dave Hansen
2018-01-16 19:21   ` Andrew Cooper
2018-01-16 19:21     ` Andrew Cooper
2018-01-16 19:55   ` Joerg Roedel
2018-01-16 19:55     ` Joerg Roedel
2018-01-16 21:20 ` Thomas Gleixner
2018-01-16 21:20   ` Thomas Gleixner
2018-01-17  9:55   ` Joerg Roedel
2018-01-17  9:55     ` Joerg Roedel
2018-01-16 22:26 ` Andy Lutomirski
2018-01-16 22:26   ` Andy Lutomirski
2018-01-17  9:33   ` Joerg Roedel
2018-01-17  9:33     ` Joerg Roedel
2018-01-19 10:55 ` Pavel Machek
2018-01-19 11:07   ` Joerg Roedel
2018-01-19 11:07     ` Joerg Roedel
2018-01-19 12:58     ` Pavel Machek
2018-01-21 20:13 ` Nadav Amit
2018-01-21 20:13   ` Nadav Amit
2018-01-21 20:44   ` Nadav Amit
2018-01-21 20:44     ` Nadav Amit
2018-01-21 23:46     ` Nadav Amit
2018-01-21 23:46       ` Nadav Amit
2018-01-22  2:11       ` Linus Torvalds
2018-01-22  2:11         ` Linus Torvalds
2018-01-22  2:20         ` hpa
2018-01-22  2:20           ` hpa
2018-01-22 20:14           ` Linus Torvalds
2018-01-22 20:14             ` Linus Torvalds
2018-01-22 21:10             ` H. Peter Anvin
2018-01-22 21:10               ` H. Peter Anvin
2018-01-23 14:38               ` Alan Cox
2018-01-23 14:38                 ` Alan Cox
2018-01-22  2:27         ` Nadav Amit
2018-01-22  2:27           ` Nadav Amit
2018-01-22  8:56       ` Joerg Roedel
2018-01-22  8:56         ` Joerg Roedel
2018-01-23 14:57         ` Alan Cox
2018-01-23 14:57           ` Alan Cox
2018-01-25 17:09         ` Alan Cox
2018-01-25 17:09           ` Alan Cox
2018-01-26 12:36           ` Joerg Roedel
2018-01-26 12:36             ` Joerg Roedel
2018-01-22  9:55       ` David Laight
2018-01-22 10:04         ` Joerg Roedel
2018-01-22 10:04           ` Joerg Roedel
2018-01-24 18:58 ` Krzysztof Mazur
2018-01-24 18:58   ` Krzysztof Mazur
2018-01-25 22:09   ` Nadav Amit
2018-01-25 22:09     ` Nadav Amit
2018-01-26  9:28     ` Krzysztof Mazur
2018-01-26  9:28       ` Krzysztof Mazur

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.