linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] PTI for x86-32 Fixes and Updates
@ 2018-07-20 16:22 Joerg Roedel
  2018-07-20 16:22 ` [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables Joerg Roedel
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Joerg Roedel @ 2018-07-20 16:22 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek,
	David H . Gutteridge, jroedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, joro

Hi,

here are 3 patches which update the PTI-x86-32 patches recently merged
into the tip-tree. The patches are ordered by importance:

	Patch 1: Very important, it fixes a vmalloc-fault in NMI context
		 when PTI is enabled. This is pretty unlikely to hit
		 when starting perf on an idle machine, which is why I
		 didn't find it earlier in my testing. I always started
		 'perf top' first :/ But when I start 'perf top' last
		 when the kernel-compile already runs, it hits almost
		 immediatly.

	Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK
	         to also take VM86 into account. This is not strictly
		 necessary because the slow-path also works for VM86
		 mode but it is not how the code was intended to work.
		 And it breaks when Patch 3 is applied on-top.

	Patch 3: Implement the reduced copying in the paranoid
		 entry/exit path as suggested by Andy Lutomirski while
		 reviewing version 7 of the original patches.

I have the x86/tip branch with these patches on-top running my test for
6h now, with no issues so far. So for now it looks like there are no
scheduling points or irq-enabled sections reached from the paranoid
entry/exit paths and we always return to the entry-stack we came from.

I keep the test running over the weekend at least.

Please review.

[ If Patch 1 looks good to the maintainers I suggest applying it soon,
  before too many linux-next testers run into this issue. It is actually
  the reason why I send out the patches _now_ and didn't wait until next
  week when the other two patches got more testing from my side. ]

Thanks,

	Joerg

Joerg Roedel (3):
  perf/core: Make sure the ring-buffer is mapped in all page-tables
  x86/entry/32: Check for VM86 mode in slow-path check
  x86/entry/32: Copy only ptregs on paranoid entry/exit path

 arch/x86/entry/entry_32.S   | 82 ++++++++++++++++++++++++++-------------------
 kernel/events/ring_buffer.c | 10 ++++++
 2 files changed, 58 insertions(+), 34 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 16:22 [PATCH 0/3] PTI for x86-32 Fixes and Updates Joerg Roedel
@ 2018-07-20 16:22 ` Joerg Roedel
  2018-07-20 17:06   ` Andy Lutomirski
                     ` (2 more replies)
  2018-07-20 16:22 ` [PATCH 2/3] x86/entry/32: Check for VM86 mode in slow-path check Joerg Roedel
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 34+ messages in thread
From: Joerg Roedel @ 2018-07-20 16:22 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek,
	David H . Gutteridge, jroedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, joro

From: Joerg Roedel <jroedel@suse.de>

The ring-buffer is accessed in the NMI handler, so we better
avoid faulting on it. Sync the vmalloc range with all
page-tables in system to make sure everyone has it mapped.

This fixes a WARN_ON_ONCE() that can be triggered with PTI
enabled on x86-32:

	WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230

This triggers because with PTI enabled on an PAE kernel the
PMDs are no longer shared between the page-tables, so the
vmalloc changes do not propagate automatically.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/events/ring_buffer.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5d3cf40..7b0e9aa 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -814,6 +814,9 @@ static void rb_free_work(struct work_struct *work)
 
 	vfree(base);
 	kfree(rb);
+
+	/* Make sure buffer is unmapped in all page-tables */
+	vmalloc_sync_all();
 }
 
 void rb_free(struct ring_buffer *rb)
@@ -840,6 +843,13 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
 	if (!all_buf)
 		goto fail_all_buf;
 
+	/*
+	 * The buffer is accessed in NMI handlers, make sure it is
+	 * mapped in all page-tables in the system so that we don't
+	 * fault on the range in an NMI handler.
+	 */
+	vmalloc_sync_all();
+
 	rb->user_page = all_buf;
 	rb->data_pages[0] = all_buf + PAGE_SIZE;
 	if (nr_pages) {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/3] x86/entry/32: Check for VM86 mode in slow-path check
  2018-07-20 16:22 [PATCH 0/3] PTI for x86-32 Fixes and Updates Joerg Roedel
  2018-07-20 16:22 ` [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables Joerg Roedel
@ 2018-07-20 16:22 ` Joerg Roedel
  2018-07-20 19:37   ` [tip:x86/pti] " tip-bot for Joerg Roedel
                     ` (2 more replies)
  2018-07-20 16:22 ` [PATCH 3/3] x86/entry/32: Copy only ptregs on paranoid entry/exit path Joerg Roedel
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 34+ messages in thread
From: Joerg Roedel @ 2018-07-20 16:22 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek,
	David H . Gutteridge, jroedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, joro

From: Joerg Roedel <jroedel@suse.de>

The SWITCH_TO_KERNEL_STACK macro only checks for CPL == 0 to
go down the slow and paranoid entry path. The problem is
that this check also returns true when coming from VM86
mode. This is not a problem by itself, as the paranoid path
handles VM86 stack-frames just fine, but it is not necessary
as the normal code path handles VM86 mode as well (and
faster).

Extend the check to include VM86 mode. This also makes an
optimization of the paranoid path possible.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 010cdb4..2767c62 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -414,8 +414,16 @@
 	andl	$(0x0000ffff), PT_CS(%esp)
 
 	/* Special case - entry from kernel mode via entry stack */
-	testl	$SEGMENT_RPL_MASK, PT_CS(%esp)
-	jz	.Lentry_from_kernel_\@
+#ifdef CONFIG_VM86
+	movl	PT_EFLAGS(%esp), %ecx		# mix EFLAGS and CS
+	movb	PT_CS(%esp), %cl
+	andl	$(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %ecx
+#else
+	movl	PT_CS(%esp), %ecx
+	andl	$SEGMENT_RPL_MASK, %ecx
+#endif
+	cmpl	$USER_RPL, %ecx
+	jb	.Lentry_from_kernel_\@
 
 	/* Bytes to copy */
 	movl	$PTREGS_SIZE, %ecx
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/3] x86/entry/32: Copy only ptregs on paranoid entry/exit path
  2018-07-20 16:22 [PATCH 0/3] PTI for x86-32 Fixes and Updates Joerg Roedel
  2018-07-20 16:22 ` [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables Joerg Roedel
  2018-07-20 16:22 ` [PATCH 2/3] x86/entry/32: Check for VM86 mode in slow-path check Joerg Roedel
@ 2018-07-20 16:22 ` Joerg Roedel
  2018-07-20 17:09   ` Andy Lutomirski
  2018-07-23  3:49 ` [PATCH 0/3] PTI for x86-32 Fixes and Updates David H. Gutteridge
  2018-07-23 14:09 ` Pavel Machek
  4 siblings, 1 reply; 34+ messages in thread
From: Joerg Roedel @ 2018-07-20 16:22 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek,
	David H . Gutteridge, jroedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, joro

From: Joerg Roedel <jroedel@suse.de>

The code that switches from entry- to task-stack when we
enter from kernel-mode copies the full entry-stack contents
to the task-stack.

That is because we don't trust that the entry-stack
contents. But actually we can trust its contents if we are
not scheduled between entry and exit.

So do less copying and move only the ptregs over to the
task-stack in this code-path.

Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 arch/x86/entry/entry_32.S | 70 +++++++++++++++++++++++++----------------------
 1 file changed, 38 insertions(+), 32 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 2767c62..90166b2 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -469,33 +469,48 @@
 	 * segment registers on the way back to user-space or when the
 	 * sysenter handler runs with eflags.tf set.
 	 *
-	 * When we switch to the task-stack here, we can't trust the
-	 * contents of the entry-stack anymore, as the exception handler
-	 * might be scheduled out or moved to another CPU. Therefore we
-	 * copy the complete entry-stack to the task-stack and set a
-	 * marker in the iret-frame (bit 31 of the CS dword) to detect
-	 * what we've done on the iret path.
+	 * When we switch to the task-stack here, we extend the
+	 * stack-frame we copy to include the entry-stack %esp and a
+	 * pseudo %ss value so that we have a full ptregs struct on the
+	 * stack. We set a marker in the frame (bit 31 of the CS dword).
 	 *
-	 * On the iret path we copy everything back and switch to the
-	 * entry-stack, so that the interrupted kernel code-path
-	 * continues on the same stack it was interrupted with.
+	 * On the iret path we read %esp from the PT_OLDESP slot on the
+	 * stack and copy ptregs (except oldesp and oldss) to it, when
+	 * we find the marker set. Then we switch to the %esp we read,
+	 * so that the interrupted kernel code-path continues on the
+	 * same stack it was interrupted with.
 	 *
 	 * Be aware that an NMI can happen anytime in this code.
 	 *
+	 * Register values here are:
+	 *
 	 * %esi: Entry-Stack pointer (same as %esp)
 	 * %edi: Top of the task stack
 	 * %eax: CR3 on kernel entry
 	 */
 
-	/* Calculate number of bytes on the entry stack in %ecx */
-	movl	%esi, %ecx
+	/* Allocate full pt_regs on task-stack */
+	subl	$PTREGS_SIZE, %edi
+
+	/* Switch to task-stack */
+	movl	%edi, %esp
 
-	/* %ecx to the top of entry-stack */
-	andl	$(MASK_entry_stack), %ecx
-	addl	$(SIZEOF_entry_stack), %ecx
+	/* Populate pt_regs on task-stack */
+	movl	$__KERNEL_DS, PT_OLDSS(%esp)	/* Check: Is this needed? */
 
-	/* Number of bytes on the entry stack to %ecx */
-	sub	%esi, %ecx
+	/*
+	 * Save entry-stack pointer on task-stack so that we can switch back to
+	 * it on the the iret path.
+	 */
+	movl	%esi, PT_OLDESP(%esp)
+
+	/* sizeof(pt_regs) minus space for %esp and %ss to %ecx */
+	movl	$(PTREGS_SIZE - 8), %ecx
+
+	/* Copy rest */
+	shrl	$2, %ecx
+	cld
+	rep movsl
 
 	/* Mark stackframe as coming from entry stack */
 	orl	$CS_FROM_ENTRY_STACK, PT_CS(%esp)
@@ -505,16 +520,9 @@
 	 * so that we can switch back to it before iret.
 	 */
 	testl	$PTI_SWITCH_MASK, %eax
-	jz	.Lcopy_pt_regs_\@
+	jz	.Lend_\@
 	orl	$CS_FROM_USER_CR3, PT_CS(%esp)
 
-	/*
-	 * %esi and %edi are unchanged, %ecx contains the number of
-	 * bytes to copy. The code at .Lcopy_pt_regs_\@ will allocate
-	 * the stack-frame on task-stack and copy everything over
-	 */
-	jmp .Lcopy_pt_regs_\@
-
 .Lend_\@:
 .endm
 
@@ -594,16 +602,14 @@
 	/* Clear marker from stack-frame */
 	andl	$(~CS_FROM_ENTRY_STACK), PT_CS(%esp)
 
-	/* Copy the remaining task-stack contents to entry-stack */
+	/*
+	 * Copy the remaining 'struct ptregs' to entry-stack. Leave out
+	 * OLDESP and OLDSS as we didn't copy that over on entry.
+	 */
 	movl	%esp, %esi
-	movl	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi
+	movl	PT_OLDESP(%esp), %edi
 
-	/* Bytes on the task-stack to ecx */
-	movl	PER_CPU_VAR(cpu_tss_rw + TSS_sp1), %ecx
-	subl	%esi, %ecx
-
-	/* Allocate stack-frame on entry-stack */
-	subl	%ecx, %edi
+	movl	$(PTREGS_SIZE - 8), %ecx
 
 	/*
 	 * Save future stack-pointer, we must not switch until the
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 16:22 ` [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables Joerg Roedel
@ 2018-07-20 17:06   ` Andy Lutomirski
  2018-07-20 17:48     ` Joerg Roedel
  2018-07-20 19:27     ` Thomas Gleixner
  2018-07-20 19:37   ` [tip:x86/pti] " tip-bot for Joerg Roedel
  2018-07-20 20:36   ` tip-bot for Joerg Roedel
  2 siblings, 2 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-20 17:06 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Pavel Machek, David H . Gutteridge, Joerg Roedel,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

> On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
>
> From: Joerg Roedel <jroedel@suse.de>
>
> The ring-buffer is accessed in the NMI handler, so we better
> avoid faulting on it. Sync the vmalloc range with all
> page-tables in system to make sure everyone has it mapped.
>
> This fixes a WARN_ON_ONCE() that can be triggered with PTI
> enabled on x86-32:
>
>    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
>
> This triggers because with PTI enabled on an PAE kernel the
> PMDs are no longer shared between the page-tables, so the
> vmalloc changes do not propagate automatically.

It seems like it would be much more robust to fix the vmalloc_fault()
code instead.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] x86/entry/32: Copy only ptregs on paranoid entry/exit path
  2018-07-20 16:22 ` [PATCH 3/3] x86/entry/32: Copy only ptregs on paranoid entry/exit path Joerg Roedel
@ 2018-07-20 17:09   ` Andy Lutomirski
  2018-07-20 21:42     ` Joerg Roedel
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-20 17:09 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Pavel Machek, David H . Gutteridge, Joerg Roedel,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

On Fri, Jul 20, 2018 at 9:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
> From: Joerg Roedel <jroedel@suse.de>
>
> The code that switches from entry- to task-stack when we
> enter from kernel-mode copies the full entry-stack contents
> to the task-stack.
>
> That is because we don't trust that the entry-stack
> contents. But actually we can trust its contents if we are
> not scheduled between entry and exit.
>
> So do less copying and move only the ptregs over to the
> task-stack in this code-path.
>
> Suggested-by: Andy Lutomirski <luto@amacapital.net>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/entry/entry_32.S | 70 +++++++++++++++++++++++++----------------------
>  1 file changed, 38 insertions(+), 32 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 2767c62..90166b2 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -469,33 +469,48 @@
>          * segment registers on the way back to user-space or when the
>          * sysenter handler runs with eflags.tf set.
>          *
> -        * When we switch to the task-stack here, we can't trust the
> -        * contents of the entry-stack anymore, as the exception handler
> -        * might be scheduled out or moved to another CPU. Therefore we
> -        * copy the complete entry-stack to the task-stack and set a
> -        * marker in the iret-frame (bit 31 of the CS dword) to detect
> -        * what we've done on the iret path.
> +        * When we switch to the task-stack here, we extend the
> +        * stack-frame we copy to include the entry-stack %esp and a
> +        * pseudo %ss value so that we have a full ptregs struct on the
> +        * stack. We set a marker in the frame (bit 31 of the CS dword).
>          *
> -        * On the iret path we copy everything back and switch to the
> -        * entry-stack, so that the interrupted kernel code-path
> -        * continues on the same stack it was interrupted with.
> +        * On the iret path we read %esp from the PT_OLDESP slot on the
> +        * stack and copy ptregs (except oldesp and oldss) to it, when
> +        * we find the marker set. Then we switch to the %esp we read,
> +        * so that the interrupted kernel code-path continues on the
> +        * same stack it was interrupted with.


Can you give an example of the exact scenario in which any of this
copying happens and why it's needed?  IMO you should just be able to
*run* on the entry stack without copying anything at all.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 17:06   ` Andy Lutomirski
@ 2018-07-20 17:48     ` Joerg Roedel
  2018-07-20 19:32       ` Andy Lutomirski
  2018-07-20 19:27     ` Thomas Gleixner
  1 sibling, 1 reply; 34+ messages in thread
From: Joerg Roedel @ 2018-07-20 17:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Pavel Machek, David H . Gutteridge, Joerg Roedel,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

On Fri, Jul 20, 2018 at 10:06:54AM -0700, Andy Lutomirski wrote:
> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >
> > From: Joerg Roedel <jroedel@suse.de>
> >
> > The ring-buffer is accessed in the NMI handler, so we better
> > avoid faulting on it. Sync the vmalloc range with all
> > page-tables in system to make sure everyone has it mapped.
> >
> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> > enabled on x86-32:
> >
> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> >
> > This triggers because with PTI enabled on an PAE kernel the
> > PMDs are no longer shared between the page-tables, so the
> > vmalloc changes do not propagate automatically.
> 
> It seems like it would be much more robust to fix the vmalloc_fault()
> code instead.

The question is whether the NMI path is nesting-safe, then we can remove
the WARN_ON_ONCE(in_nmi()) in the vmalloc_fault path. It should be
nesting-safe on x86-32 because of the way the stack-switch happens
there. If its also nesting-safe on x86-64 the warning there can be
removed.

Or did you think of something else to fix there?


Thanks,

	Joerg


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 17:06   ` Andy Lutomirski
  2018-07-20 17:48     ` Joerg Roedel
@ 2018-07-20 19:27     ` Thomas Gleixner
  2018-07-20 19:33       ` Andy Lutomirski
  1 sibling, 1 reply; 34+ messages in thread
From: Thomas Gleixner @ 2018-07-20 19:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Pavel Machek, David H . Gutteridge, Joerg Roedel,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >
> > From: Joerg Roedel <jroedel@suse.de>
> >
> > The ring-buffer is accessed in the NMI handler, so we better
> > avoid faulting on it. Sync the vmalloc range with all
> > page-tables in system to make sure everyone has it mapped.
> >
> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> > enabled on x86-32:
> >
> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> >
> > This triggers because with PTI enabled on an PAE kernel the
> > PMDs are no longer shared between the page-tables, so the
> > vmalloc changes do not propagate automatically.
> 
> It seems like it would be much more robust to fix the vmalloc_fault()
> code instead.

Right, but now the obvious fix for the issue at hand is this. We surely
should revisit this.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 17:48     ` Joerg Roedel
@ 2018-07-20 19:32       ` Andy Lutomirski
  2018-07-20 21:37         ` Joerg Roedel
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-20 19:32 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Pavel Machek, David H . Gutteridge, Joerg Roedel,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

On Fri, Jul 20, 2018 at 10:48 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Fri, Jul 20, 2018 at 10:06:54AM -0700, Andy Lutomirski wrote:
>> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> >
>> > From: Joerg Roedel <jroedel@suse.de>
>> >
>> > The ring-buffer is accessed in the NMI handler, so we better
>> > avoid faulting on it. Sync the vmalloc range with all
>> > page-tables in system to make sure everyone has it mapped.
>> >
>> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
>> > enabled on x86-32:
>> >
>> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
>> >
>> > This triggers because with PTI enabled on an PAE kernel the
>> > PMDs are no longer shared between the page-tables, so the
>> > vmalloc changes do not propagate automatically.
>>
>> It seems like it would be much more robust to fix the vmalloc_fault()
>> code instead.
>
> The question is whether the NMI path is nesting-safe, then we can remove
> the WARN_ON_ONCE(in_nmi()) in the vmalloc_fault path. It should be
> nesting-safe on x86-32 because of the way the stack-switch happens
> there. If its also nesting-safe on x86-64 the warning there can be
> removed.
>
> Or did you think of something else to fix there?

I'm just reading your changelog, and you said the PMDs are no longer
shared between the page tables.  So this presumably means that
vmalloc_fault() no longer actually works correctly on PTI systems.  I
didn't read the code to figure out *why* it doesn't work, but throwing
random vmalloc_sync_all() calls around is wrong.

Or maybe the bug really just is the warning.  The warning can probably go.

>
>
> Thanks,
>
>         Joerg
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 19:27     ` Thomas Gleixner
@ 2018-07-20 19:33       ` Andy Lutomirski
  2018-07-20 19:43         ` Thomas Gleixner
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-20 19:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Joerg Roedel, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Pavel Machek, David H . Gutteridge, Joerg Roedel,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

On Fri, Jul 20, 2018 at 12:27 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 20 Jul 2018, Andy Lutomirski wrote:
>> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> >
>> > From: Joerg Roedel <jroedel@suse.de>
>> >
>> > The ring-buffer is accessed in the NMI handler, so we better
>> > avoid faulting on it. Sync the vmalloc range with all
>> > page-tables in system to make sure everyone has it mapped.
>> >
>> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
>> > enabled on x86-32:
>> >
>> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
>> >
>> > This triggers because with PTI enabled on an PAE kernel the
>> > PMDs are no longer shared between the page-tables, so the
>> > vmalloc changes do not propagate automatically.
>>
>> It seems like it would be much more robust to fix the vmalloc_fault()
>> code instead.
>
> Right, but now the obvious fix for the issue at hand is this. We surely
> should revisit this.

If you commit this under this reasoning, then please at least make it say:

/* XXX: The vmalloc_fault() code is buggy on PTI+PAE systems, and this
is a workaround. */

Let's not have code in the kernel that pretends to make sense but is
actually voodoo magic that works around bugs elsewhere.  It's no fun
to maintain down the road.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [tip:x86/pti] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 16:22 ` [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables Joerg Roedel
  2018-07-20 17:06   ` Andy Lutomirski
@ 2018-07-20 19:37   ` tip-bot for Joerg Roedel
  2018-07-20 20:36   ` tip-bot for Joerg Roedel
  2 siblings, 0 replies; 34+ messages in thread
From: tip-bot for Joerg Roedel @ 2018-07-20 19:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: will.deacon, acme, namhyung, jpoimboe, jgross, bp, dhgutteridge,
	pavel, David.Laight, boris.ostrovsky, gregkh, jolsa, brgerst,
	jkosina, eduval, aarcange, torvalds, alexander.shishkin,
	linux-kernel, hpa, luto, peterz, mingo, tglx, dvlasenk, jroedel,
	llong, dave.hansen

Commit-ID:  cdbaf0a372db2bc3c3127e8b63fd15bd6e6757ee
Gitweb:     https://git.kernel.org/tip/cdbaf0a372db2bc3c3127e8b63fd15bd6e6757ee
Author:     Joerg Roedel <jroedel@suse.de>
AuthorDate: Fri, 20 Jul 2018 18:22:22 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 21:32:08 +0200

perf/core: Make sure the ring-buffer is mapped in all page-tables

The ring-buffer is accessed in the NMI handler, so it's better to avoid
faulting on it. Sync the vmalloc range with all page-tables in system to
make sure everyone has it mapped.

This fixes a WARN_ON_ONCE() that can be triggered with PTI enabled on
x86-32:

  WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230

This triggers because with PTI enabled on an PAE kernel the PMDs are no
longer shared between the page-tables, so the vmalloc changes do not
propagate automatically.

Note: Andy said rightfully that we should try to fix the vmalloc code for
that case, but that's not a hot fix for the issue at hand.

Fixes: 7757d607c6b3 ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1532103744-31902-2-git-send-email-joro@8bytes.org

---
 kernel/events/ring_buffer.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5d3cf407e374..7b0e9aafafdf 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -814,6 +814,9 @@ static void rb_free_work(struct work_struct *work)
 
 	vfree(base);
 	kfree(rb);
+
+	/* Make sure buffer is unmapped in all page-tables */
+	vmalloc_sync_all();
 }
 
 void rb_free(struct ring_buffer *rb)
@@ -840,6 +843,13 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
 	if (!all_buf)
 		goto fail_all_buf;
 
+	/*
+	 * The buffer is accessed in NMI handlers, make sure it is
+	 * mapped in all page-tables in the system so that we don't
+	 * fault on the range in an NMI handler.
+	 */
+	vmalloc_sync_all();
+
 	rb->user_page = all_buf;
 	rb->data_pages[0] = all_buf + PAGE_SIZE;
 	if (nr_pages) {

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip:x86/pti] x86/entry/32: Check for VM86 mode in slow-path check
  2018-07-20 16:22 ` [PATCH 2/3] x86/entry/32: Check for VM86 mode in slow-path check Joerg Roedel
@ 2018-07-20 19:37   ` tip-bot for Joerg Roedel
  2018-07-20 20:37   ` tip-bot for Joerg Roedel
  2018-07-21 16:06   ` [PATCH 2/3] " Pavel Machek
  2 siblings, 0 replies; 34+ messages in thread
From: tip-bot for Joerg Roedel @ 2018-07-20 19:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, jpoimboe, bp, dhgutteridge, luto, aarcange,
	torvalds, dvlasenk, pavel, brgerst, jgross, hpa, mingo, jroedel,
	gregkh, peterz, David.Laight, jolsa, boris.ostrovsky, acme,
	namhyung, will.deacon, jkosina, alexander.shishkin, eduval, tglx,
	dave.hansen, llong

Commit-ID:  9cd342705877526f387cfcb5df8a964ab5873deb
Gitweb:     https://git.kernel.org/tip/9cd342705877526f387cfcb5df8a964ab5873deb
Author:     Joerg Roedel <jroedel@suse.de>
AuthorDate: Fri, 20 Jul 2018 18:22:23 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 21:32:08 +0200

x86/entry/32: Check for VM86 mode in slow-path check

The SWITCH_TO_KERNEL_STACK macro only checks for CPL == 0 to go down the
slow and paranoid entry path. The problem is that this check also returns
true when coming from VM86 mode. This is not a problem by itself, as the
paranoid path handles VM86 stack-frames just fine, but it is not necessary
as the normal code path handles VM86 mode as well (and faster).

Extend the check to include VM86 mode. This also makes an optimization of
the paranoid path possible.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1532103744-31902-3-git-send-email-joro@8bytes.org

---
 arch/x86/entry/entry_32.S | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 010cdb41e3c7..2767c625a52c 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -414,8 +414,16 @@
 	andl	$(0x0000ffff), PT_CS(%esp)
 
 	/* Special case - entry from kernel mode via entry stack */
-	testl	$SEGMENT_RPL_MASK, PT_CS(%esp)
-	jz	.Lentry_from_kernel_\@
+#ifdef CONFIG_VM86
+	movl	PT_EFLAGS(%esp), %ecx		# mix EFLAGS and CS
+	movb	PT_CS(%esp), %cl
+	andl	$(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %ecx
+#else
+	movl	PT_CS(%esp), %ecx
+	andl	$SEGMENT_RPL_MASK, %ecx
+#endif
+	cmpl	$USER_RPL, %ecx
+	jb	.Lentry_from_kernel_\@
 
 	/* Bytes to copy */
 	movl	$PTREGS_SIZE, %ecx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 19:33       ` Andy Lutomirski
@ 2018-07-20 19:43         ` Thomas Gleixner
  2018-07-20 19:53           ` Thomas Gleixner
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Gleixner @ 2018-07-20 19:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Pavel Machek, David H . Gutteridge, Joerg Roedel,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> On Fri, Jul 20, 2018 at 12:27 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> >> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >> >
> >> > From: Joerg Roedel <jroedel@suse.de>
> >> >
> >> > The ring-buffer is accessed in the NMI handler, so we better
> >> > avoid faulting on it. Sync the vmalloc range with all
> >> > page-tables in system to make sure everyone has it mapped.
> >> >
> >> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> >> > enabled on x86-32:
> >> >
> >> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> >> >
> >> > This triggers because with PTI enabled on an PAE kernel the
> >> > PMDs are no longer shared between the page-tables, so the
> >> > vmalloc changes do not propagate automatically.
> >>
> >> It seems like it would be much more robust to fix the vmalloc_fault()
> >> code instead.
> >
> > Right, but now the obvious fix for the issue at hand is this. We surely
> > should revisit this.
> 
> If you commit this under this reasoning, then please at least make it say:
> 
> /* XXX: The vmalloc_fault() code is buggy on PTI+PAE systems, and this
> is a workaround. */
> 
> Let's not have code in the kernel that pretends to make sense but is
> actually voodoo magic that works around bugs elsewhere.  It's no fun
> to maintain down the road.

Fair enough. Lemme amend it. Joerg is looking into it, but I surely want to
get that stuff some exposure in next ASAP.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 19:43         ` Thomas Gleixner
@ 2018-07-20 19:53           ` Thomas Gleixner
  0 siblings, 0 replies; 34+ messages in thread
From: Thomas Gleixner @ 2018-07-20 19:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Pavel Machek, David H . Gutteridge, Joerg Roedel,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

On Fri, 20 Jul 2018, Thomas Gleixner wrote:
> On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> > On Fri, Jul 20, 2018 at 12:27 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> > >> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > >> >
> > >> > From: Joerg Roedel <jroedel@suse.de>
> > >> >
> > >> > The ring-buffer is accessed in the NMI handler, so we better
> > >> > avoid faulting on it. Sync the vmalloc range with all
> > >> > page-tables in system to make sure everyone has it mapped.
> > >> >
> > >> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> > >> > enabled on x86-32:
> > >> >
> > >> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> > >> >
> > >> > This triggers because with PTI enabled on an PAE kernel the
> > >> > PMDs are no longer shared between the page-tables, so the
> > >> > vmalloc changes do not propagate automatically.
> > >>
> > >> It seems like it would be much more robust to fix the vmalloc_fault()
> > >> code instead.
> > >
> > > Right, but now the obvious fix for the issue at hand is this. We surely
> > > should revisit this.
> > 
> > If you commit this under this reasoning, then please at least make it say:
> > 
> > /* XXX: The vmalloc_fault() code is buggy on PTI+PAE systems, and this
> > is a workaround. */
> > 
> > Let's not have code in the kernel that pretends to make sense but is
> > actually voodoo magic that works around bugs elsewhere.  It's no fun
> > to maintain down the road.
> 
> Fair enough. Lemme amend it. Joerg is looking into it, but I surely want to
> get that stuff some exposure in next ASAP.

Delta patch below.

Thanks.

	tglx

8<-------------
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -815,8 +815,12 @@ static void rb_free_work(struct work_str
 	vfree(base);
 	kfree(rb);
 
-	/* Make sure buffer is unmapped in all page-tables */
-	vmalloc_sync_all();
+	/*
+	 * FIXME: PAE workaround for vmalloc_fault(): Make sure buffer is
+	 * unmapped in all page-tables.
+	 */
+	if (IS_ENABLED(CONFIG_X86_PAE))
+		vmalloc_sync_all();
 }
 
 void rb_free(struct ring_buffer *rb)
@@ -844,11 +848,13 @@ struct ring_buffer *rb_alloc(int nr_page
 		goto fail_all_buf;
 
 	/*
-	 * The buffer is accessed in NMI handlers, make sure it is
-	 * mapped in all page-tables in the system so that we don't
-	 * fault on the range in an NMI handler.
+	 * FIXME: PAE workaround for vmalloc_fault(): The buffer is
+	 * accessed in NMI handlers, make sure it is mapped in all
+	 * page-tables in the system so that we don't fault on the range in
+	 * an NMI handler.
 	 */
-	vmalloc_sync_all();
+	if (IS_ENABLED(CONFIG_X86_PAE))
+		vmalloc_sync_all();
 
 	rb->user_page = all_buf;
 	rb->data_pages[0] = all_buf + PAGE_SIZE;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [tip:x86/pti] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 16:22 ` [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables Joerg Roedel
  2018-07-20 17:06   ` Andy Lutomirski
  2018-07-20 19:37   ` [tip:x86/pti] " tip-bot for Joerg Roedel
@ 2018-07-20 20:36   ` tip-bot for Joerg Roedel
  2 siblings, 0 replies; 34+ messages in thread
From: tip-bot for Joerg Roedel @ 2018-07-20 20:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: brgerst, torvalds, jroedel, hpa, jkosina, acme, tglx, eduval,
	jpoimboe, dhgutteridge, namhyung, llong, boris.ostrovsky, jolsa,
	dave.hansen, mingo, bp, jgross, luto, aarcange, dvlasenk,
	linux-kernel, gregkh, pavel, will.deacon, David.Laight, peterz,
	alexander.shishkin

Commit-ID:  77754cfa09a6c528c38cbca9ee4cc4f7cf6ad6f2
Gitweb:     https://git.kernel.org/tip/77754cfa09a6c528c38cbca9ee4cc4f7cf6ad6f2
Author:     Joerg Roedel <jroedel@suse.de>
AuthorDate: Fri, 20 Jul 2018 18:22:22 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 22:33:41 +0200

perf/core: Make sure the ring-buffer is mapped in all page-tables

The ring-buffer is accessed in the NMI handler, so it's better to avoid
faulting on it. Sync the vmalloc range with all page-tables in system to
make sure everyone has it mapped.

This fixes a WARN_ON_ONCE() that can be triggered with PTI enabled on
x86-32:

  WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230

This triggers because with PTI enabled on an PAE kernel the PMDs are no
longer shared between the page-tables, so the vmalloc changes do not
propagate automatically.

Note: Andy said rightfully that we should try to fix the vmalloc code for
that case, but that's not a hot fix for the issue at hand.

Fixes: 7757d607c6b3 ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1532103744-31902-2-git-send-email-joro@8bytes.org

---
 kernel/events/ring_buffer.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5d3cf407e374..df2d8cf0072c 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -814,6 +814,13 @@ static void rb_free_work(struct work_struct *work)
 
 	vfree(base);
 	kfree(rb);
+
+	/*
+	 * FIXME: PAE workaround for vmalloc_fault(): Make sure buffer is
+	 * unmapped in all page-tables.
+	 */
+	if (IS_ENABLED(CONFIG_X86_PAE))
+		vmalloc_sync_all();
 }
 
 void rb_free(struct ring_buffer *rb)
@@ -840,6 +847,15 @@ struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
 	if (!all_buf)
 		goto fail_all_buf;
 
+	/*
+	 * FIXME: PAE workaround for vmalloc_fault(): The buffer is
+	 * accessed in NMI handlers, make sure it is mapped in all
+	 * page-tables in the system so that we don't fault on the range in
+	 * an NMI handler.
+	 */
+	if (IS_ENABLED(CONFIG_X86_PAE))
+		vmalloc_sync_all();
+
 	rb->user_page = all_buf;
 	rb->data_pages[0] = all_buf + PAGE_SIZE;
 	if (nr_pages) {

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip:x86/pti] x86/entry/32: Check for VM86 mode in slow-path check
  2018-07-20 16:22 ` [PATCH 2/3] x86/entry/32: Check for VM86 mode in slow-path check Joerg Roedel
  2018-07-20 19:37   ` [tip:x86/pti] " tip-bot for Joerg Roedel
@ 2018-07-20 20:37   ` tip-bot for Joerg Roedel
  2018-07-21 16:06   ` [PATCH 2/3] " Pavel Machek
  2 siblings, 0 replies; 34+ messages in thread
From: tip-bot for Joerg Roedel @ 2018-07-20 20:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, jpoimboe, David.Laight, dvlasenk, alexander.shishkin, acme,
	jolsa, linux-kernel, gregkh, jgross, mingo, brgerst, bp,
	dave.hansen, dhgutteridge, luto, hpa, jroedel, peterz, pavel,
	llong, jkosina, eduval, aarcange, will.deacon, namhyung,
	torvalds, boris.ostrovsky

Commit-ID:  d5e84c21dbf5ea458897f88346dc979909eed913
Gitweb:     https://git.kernel.org/tip/d5e84c21dbf5ea458897f88346dc979909eed913
Author:     Joerg Roedel <jroedel@suse.de>
AuthorDate: Fri, 20 Jul 2018 18:22:23 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 20 Jul 2018 22:33:41 +0200

x86/entry/32: Check for VM86 mode in slow-path check

The SWITCH_TO_KERNEL_STACK macro only checks for CPL == 0 to go down the
slow and paranoid entry path. The problem is that this check also returns
true when coming from VM86 mode. This is not a problem by itself, as the
paranoid path handles VM86 stack-frames just fine, but it is not necessary
as the normal code path handles VM86 mode as well (and faster).

Extend the check to include VM86 mode. This also makes an optimization of
the paranoid path possible.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1532103744-31902-3-git-send-email-joro@8bytes.org

---
 arch/x86/entry/entry_32.S | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 010cdb41e3c7..2767c625a52c 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -414,8 +414,16 @@
 	andl	$(0x0000ffff), PT_CS(%esp)
 
 	/* Special case - entry from kernel mode via entry stack */
-	testl	$SEGMENT_RPL_MASK, PT_CS(%esp)
-	jz	.Lentry_from_kernel_\@
+#ifdef CONFIG_VM86
+	movl	PT_EFLAGS(%esp), %ecx		# mix EFLAGS and CS
+	movb	PT_CS(%esp), %cl
+	andl	$(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %ecx
+#else
+	movl	PT_CS(%esp), %ecx
+	andl	$SEGMENT_RPL_MASK, %ecx
+#endif
+	cmpl	$USER_RPL, %ecx
+	jb	.Lentry_from_kernel_\@
 
 	/* Bytes to copy */
 	movl	$PTREGS_SIZE, %ecx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 19:32       ` Andy Lutomirski
@ 2018-07-20 21:37         ` Joerg Roedel
  2018-07-20 22:20           ` Andy Lutomirski
  0 siblings, 1 reply; 34+ messages in thread
From: Joerg Roedel @ 2018-07-20 21:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	X86 ML, LKML, Linux-MM, Linus Torvalds, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, Liguori,
	Anthony, Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Pavel Machek, David H . Gutteridge,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

On Fri, Jul 20, 2018 at 12:32:10PM -0700, Andy Lutomirski wrote:
> I'm just reading your changelog, and you said the PMDs are no longer
> shared between the page tables.  So this presumably means that
> vmalloc_fault() no longer actually works correctly on PTI systems.  I
> didn't read the code to figure out *why* it doesn't work, but throwing
> random vmalloc_sync_all() calls around is wrong.

Hmm, so the whole point of vmalloc_fault() fault is to sync changes from
swapper_pg_dir to process page-tables when the relevant parts of the
kernel page-table are not shared, no?

That is also the reason we don't see this on 64 bit, because there these
parts *are* shared.

So with that reasoning vmalloc_fault() works as designed, except that
a warning is issued when it's happens in the NMI path. That warning comes
from

	ebc8827f75954 x86: Barf when vmalloc and kmemcheck faults happen in NMI

which went into 2.6.37 and was added because the NMI handler were not
nesting-safe back then. Reason probably was that the handler on 64 bit
has to use an IST stack and a nested NMI would overwrite the stack of
the upper handler.  We don't have this problem on 32 bit as a nested NMI
will not do another stack-switch there.

I am not sure about 64 bit, but there is a lot of assembly magic to make
NMIs nesting-safe, so I guess the problem should be gone there too.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] x86/entry/32: Copy only ptregs on paranoid entry/exit path
  2018-07-20 17:09   ` Andy Lutomirski
@ 2018-07-20 21:42     ` Joerg Roedel
  0 siblings, 0 replies; 34+ messages in thread
From: Joerg Roedel @ 2018-07-20 21:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, X86 ML, LKML,
	Linux-MM, Linus Torvalds, Dave Hansen, Josh Poimboeuf,
	Juergen Gross, Peter Zijlstra, Borislav Petkov, Jiri Kosina,
	Boris Ostrovsky, Brian Gerst, David Laight, Denys Vlasenko,
	Eduardo Valentin, Greg KH, Will Deacon, Liguori, Anthony,
	Daniel Gruss, Hugh Dickins, Kees Cook, Andrea Arcangeli,
	Waiman Long, Pavel Machek, David H . Gutteridge, Joerg Roedel,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

[ Re-sending because I accidentially replied only to Andy ]

On Fri, Jul 20, 2018 at 10:09:26AM -0700, Andy Lutomirski wrote:
> Can you give an example of the exact scenario in which any of this
> copying happens and why it's needed?  IMO you should just be able to
> *run* on the entry stack without copying anything at all.

So for example when we execute RESTORE_REGS on the path back to
user-space and get an exception while loading the user segment
registers.

When that happens we are already on the entry-stack and on user-cr3.
There is no question that when we return from the exception we need to
get back to entry-stack and user-cr3, despite we are returning to kernel
mode. Otherwise we enter user-space with kernel-cr3 or get a page-fault
and panic.

The exception runs through the common_exception path, and finally ends
up calling C code. And correct me if I am wrong, but calling into C code
from the entry-stack is a bad idea for multiple reasons.

First reason is the size of the stack. We can make it larger, but how
large does it need to be?

Next problem is that current_pt_regs doesn't work in the C code when
pt_regs are on the entry-stack.

These problems can all be solved, but it wouldn't be a robust solution
because when changes to the C code are made they are usually not tested
while on the entry-stack. That case is hard to trigger, so it can easily
break again.

For me, only the x86 selftests triggered all these corner-cases, but not
all developers run them on 32 bit when making changes to generic x86
code.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 21:37         ` Joerg Roedel
@ 2018-07-20 22:20           ` Andy Lutomirski
  2018-07-21 21:06             ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-20 22:20 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Andy Lutomirski, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, X86 ML, LKML, Linux-MM, Linus Torvalds,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, Liguori, Anthony, Daniel Gruss, Hugh Dickins,
	Kees Cook, Andrea Arcangeli, Waiman Long, Pavel Machek,
	David H . Gutteridge, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim


> On Jul 20, 2018, at 11:37 AM, Joerg Roedel <jroedel@suse.de> wrote:
> 
>> On Fri, Jul 20, 2018 at 12:32:10PM -0700, Andy Lutomirski wrote:
>> I'm just reading your changelog, and you said the PMDs are no longer
>> shared between the page tables.  So this presumably means that
>> vmalloc_fault() no longer actually works correctly on PTI systems.  I
>> didn't read the code to figure out *why* it doesn't work, but throwing
>> random vmalloc_sync_all() calls around is wrong.
> 
> Hmm, so the whole point of vmalloc_fault() fault is to sync changes from
> swapper_pg_dir to process page-tables when the relevant parts of the
> kernel page-table are not shared, no?
> 
> That is also the reason we don't see this on 64 bit, because there these
> parts *are* shared.
> 
> So with that reasoning vmalloc_fault() works as designed, except that
> a warning is issued when it's happens in the NMI path. That warning comes
> from
> 
>    ebc8827f75954 x86: Barf when vmalloc and kmemcheck faults happen in NMI
> 
> which went into 2.6.37 and was added because the NMI handler were not
> nesting-safe back then. Reason probably was that the handler on 64 bit
> has to use an IST stack and a nested NMI would overwrite the stack of
> the upper handler.  We don't have this problem on 32 bit as a nested NMI
> will not do another stack-switch there.
> 

Thanks for digging!  The problem was presumably that vmalloc_fault() will IRET and re-enable NMIs on the way out.  But we’ve supported page faults on user memory in NMI handlers on 32-bit and 64-bit for quite a while, and it’s fine now.

I would remove the warning, re-test, and revert the other patch.

The one case we can’t handle in vmalloc_fault() is a fault on a stack access. I don’t expect this to be a problem for PTI. It was a problem for CONFIG_VMAP_STACK, though.

> I am not sure about 64 bit, but there is a lot of assembly magic to make
> NMIs nesting-safe, so I guess the problem should be gone there too.
> 
> 
> Regards,
> 
>    Joerg

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] x86/entry/32: Check for VM86 mode in slow-path check
  2018-07-20 16:22 ` [PATCH 2/3] x86/entry/32: Check for VM86 mode in slow-path check Joerg Roedel
  2018-07-20 19:37   ` [tip:x86/pti] " tip-bot for Joerg Roedel
  2018-07-20 20:37   ` tip-bot for Joerg Roedel
@ 2018-07-21 16:06   ` Pavel Machek
  2 siblings, 0 replies; 34+ messages in thread
From: Pavel Machek @ 2018-07-21 16:06 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, jroedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

[-- Attachment #1: Type: text/plain, Size: 1730 bytes --]

Hi!

> The SWITCH_TO_KERNEL_STACK macro only checks for CPL == 0 to
> go down the slow and paranoid entry path. The problem is
> that this check also returns true when coming from VM86
> mode. This is not a problem by itself, as the paranoid path
> handles VM86 stack-frames just fine, but it is not necessary
> as the normal code path handles VM86 mode as well (and
> faster).
> 
> Extend the check to include VM86 mode. This also makes an
> optimization of the paranoid path possible.
> 
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  arch/x86/entry/entry_32.S | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 010cdb4..2767c62 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -414,8 +414,16 @@
>  	andl	$(0x0000ffff), PT_CS(%esp)
>  
>  	/* Special case - entry from kernel mode via entry stack */
> -	testl	$SEGMENT_RPL_MASK, PT_CS(%esp)
> -	jz	.Lentry_from_kernel_\@
> +#ifdef CONFIG_VM86
> +	movl	PT_EFLAGS(%esp), %ecx		# mix EFLAGS and CS
> +	movb	PT_CS(%esp), %cl
> +	andl	$(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %ecx
> +#else
> +	movl	PT_CS(%esp), %ecx
> +	andl	$SEGMENT_RPL_MASK, %ecx
> +#endif
> +	cmpl	$USER_RPL, %ecx
> +	jb	.Lentry_from_kernel_\@

Would it make sense to jump to the slow path as we did, and them jump
back if VM86 is detected?

Because VM86 is not really used often these days, and moving partial
registers results in short code but IIRC it will be rather slow.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables
  2018-07-20 22:20           ` Andy Lutomirski
@ 2018-07-21 21:06             ` Linus Torvalds
  0 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2018-07-21 21:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Joerg Roedel, Andrew Lutomirski, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Peter Anvin, the arch/x86 maintainers,
	Linux Kernel Mailing List, linux-mm, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	Pavel Machek, David H . Gutteridge, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

On Fri, Jul 20, 2018 at 3:20 PM Andy Lutomirski <luto@amacapital.net> wrote:
> Thanks for digging!  The problem was presumably that vmalloc_fault() will IRET and re-enable NMIs on the way out.
>  But we’ve supported page faults on user memory in NMI handlers on 32-bit and 64-bit for quite a while, and it’s fine now.
>
> I would remove the warning, re-test, and revert the other patch.

Agreed. I don't think we have any issues with page faults during NMI
any more.  Afaik the kprobe people depend on it.

That said, 64-bit mode has that scary PV-op case
(arch_flush_lazy_mmu_mode). Being PV mode, I can't find it in myself
to worry about it, I'm assuming it's ok.

                Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-20 16:22 [PATCH 0/3] PTI for x86-32 Fixes and Updates Joerg Roedel
                   ` (2 preceding siblings ...)
  2018-07-20 16:22 ` [PATCH 3/3] x86/entry/32: Copy only ptregs on paranoid entry/exit path Joerg Roedel
@ 2018-07-23  3:49 ` David H. Gutteridge
  2018-07-23  7:29   ` Joerg Roedel
  2018-07-23 14:09 ` Pavel Machek
  4 siblings, 1 reply; 34+ messages in thread
From: David H. Gutteridge @ 2018-07-23  3:49 UTC (permalink / raw)
  To: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek, jroedel,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

On Fri, 2018-07-20 at 18:22 +0200, Joerg Roedel wrote:
> Hi,
> 
> here are 3 patches which update the PTI-x86-32 patches recently merged
> into the tip-tree. The patches are ordered by importance:
> 
> 	Patch 1: Very important, it fixes a vmalloc-fault in NMI context
> 		 when PTI is enabled. This is pretty unlikely to hit
> 		 when starting perf on an idle machine, which is why I
> 		 didn't find it earlier in my testing. I always started
> 		 'perf top' first :/ But when I start 'perf top' last
> 		 when the kernel-compile already runs, it hits almost
> 		 immediatly.
> 
> 	Patch 2: Fix the 'from-kernel-check' in SWITCH_TO_KERNEL_STACK
> 	         to also take VM86 into account. This is not strictly
> 		 necessary because the slow-path also works for VM86
> 		 mode but it is not how the code was intended to work.
> 		 And it breaks when Patch 3 is applied on-top.
> 
> 	Patch 3: Implement the reduced copying in the paranoid
> 		 entry/exit path as suggested by Andy Lutomirski while
> 		 reviewing version 7 of the original patches.
> 
> I have the x86/tip branch with these patches on-top running my test
> for
> 6h now, with no issues so far. So for now it looks like there are no
> scheduling points or irq-enabled sections reached from the paranoid
> entry/exit paths and we always return to the entry-stack we came from.
> 
> I keep the test running over the weekend at least.
> 
> Please review.
> 
> [ If Patch 1 looks good to the maintainers I suggest applying it soon,
>   before too many linux-next testers run into this issue. It is
> actually
>   the reason why I send out the patches _now_ and didn't wait until
> next
>   week when the other two patches got more testing from my side. ]
> 
> Thanks,
> 
> 	Joerg
> 
> Joerg Roedel (3):
>   perf/core: Make sure the ring-buffer is mapped in all page-tables
>   x86/entry/32: Check for VM86 mode in slow-path check
>   x86/entry/32: Copy only ptregs on paranoid entry/exit path
> 
>  arch/x86/entry/entry_32.S   | 82 ++++++++++++++++++++++++++--------
> -----------
>  kernel/events/ring_buffer.c | 10 ++++++
>  2 files changed, 58 insertions(+), 34 deletions(-)

Hi Joerg,

I tested again using the tip "x86/pti" branch (with two of the three
patches in your change set already applied), and manually applied
your third patch on top of it (I see there was some debate about it,
but I thought I'd include it), plus I had to manually apply the patch
to fix booting (d1b47a7c9efcf3c3384b70f6e3c8f1423b44d8c7: "mm: don't
do zero_resv_unavail if memmap is not allocated"), since "x86/pti"
doesn't include it yet.

Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
driver. (This is the same VM definition I shared with you in a PM
back on Feb. 20th, except note that 4.18 kernels won't successfully
boot with QEMU's IDE device, so I'm using SATA instead. That's a
regression totally unrelated to your change sets, or to the general
booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.)

[drm] Found bochs VGA, ID 0xb0c0.
[drm] Framebuffer size 16384 kB @ 0xfd000000, mmio @ 0xfebd4000.
[TTM] Zone  kernel: Available graphics memory: 390536 kiB
[TTM] Zone highmem: Available graphics memory: 4659530 kiB
[TTM] Initializing pool allocator
[TTM] Initializing DMA pool allocator
------------[ cut here ]------------
kernel BUG at arch/x86/mm/fault.c:269!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 349 Comm: systemd-udevd Tainted: G        W         4.18.0-
rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1
04/01/2014
EIP: vmalloc_fault+0x1d7/0x200
Code: 00 f0 1f 00 81 ea 00 00 20 00 21 d0 8b 55 e8 89 c6 81 e2 ff 0f 00
00 0f ac d6 0c 8d 04 b6 c1 e0 03 39 45 ec 0f 84 37 ff ff ff <0f> 0b 8d
b4 26 00 00 00 00 83 c4 0c b8 ff ff ff ff 5b 5e 5f 5d c3 
EAX: 02788000 EBX: c85b6de8 ECX: 00000080 EDX: 00000000
ESI: 000fd000 EDI: fd0000f3 EBP: f4743994 ESP: f474397c
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010083
CR0: 80050033 CR2: f7a00000 CR3: 347f6000 CR4: 000006f0
Call Trace:
 __do_page_fault+0x340/0x4c0
 do_page_fault+0x25/0xf0
 ? ttm_mem_reg_ioremap+0xe5/0x100 [ttm]
 ? kvm_async_pf_task_wait+0x1b0/0x1b0
 do_async_page_fault+0x55/0x80
 common_exception+0x13f/0x146
EIP: memset+0xb/0x20
Code: f9 01 72 0b 8a 0e 88 0f 8d b4 26 00 00 00 00 8b 45 f0 83 c4 04 5b
5e 5f 5d c3 90 8d 74 26 00 55 89 e5 57 89 c7 53 89 c3 89 d0 <f3> aa 89
d8 5b 5f 5d c3 90 90 90 90 90 90 90 90 90 90 90 90 90 66 
EAX: 00000000 EBX: f7a00000 ECX: 00300000 EDX: 00000000
ESI: f4743b9c EDI: f7a00000 EBP: f4743a4c ESP: f4743a44
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206
 ttm_bo_move_memcpy+0x49a/0x4c0 [ttm]
 ? _cond_resched+0x17/0x40
 ttm_bo_handle_move_mem+0x554/0x570 [ttm]
 ? ttm_bo_mem_space+0x211/0x440 [ttm]
 ttm_bo_validate+0xf5/0x110 [ttm]
 bochs_bo_pin+0xde/0x1c0 [bochs_drm]
 bochsfb_create+0xce/0x310 [bochs_drm]
 __drm_fb_helper_initial_config_and_unlock+0x1cc/0x470 [drm_kms_helper]
 drm_fb_helper_initial_config+0x35/0x40 [drm_kms_helper]
 bochs_fbdev_init+0x74/0x80 [bochs_drm]
 bochs_load+0x7a/0x90 [bochs_drm]
 drm_dev_register+0x103/0x180 [drm]
 drm_get_pci_dev+0x80/0x160 [drm]
 bochs_pci_probe+0xcb/0x100 [bochs_drm]
 ? bochs_load+0x90/0x90 [bochs_drm]
 pci_device_probe+0xc7/0x160
 driver_probe_device+0x2dc/0x470
 __driver_attach+0xe1/0x110
 ? driver_probe_device+0x470/0x470
 bus_for_each_dev+0x5a/0x90
 driver_attach+0x19/0x20
 ? driver_probe_device+0x470/0x470
 bus_add_driver+0x12f/0x230
 ? pci_bus_num_vf+0x20/0x20
 driver_register+0x56/0xf0
 ? 0xf789c000
 __pci_register_driver+0x3d/0x40
 bochs_init+0x41/0x1000 [bochs_drm]
 do_one_initcall+0x42/0x1a9
 ? free_unref_page_commit+0x6f/0xe0
 ? _cond_resched+0x17/0x40
 ? kmem_cache_alloc_trace+0x3b/0x1f0
 ? do_init_module+0x21/0x1dc
 ? do_init_module+0x21/0x1dc
 do_init_module+0x50/0x1dc
 load_module+0x22ae/0x2940
 sys_finit_module+0x8a/0xe0
 do_fast_syscall_32+0x7f/0x1b0
 entry_SYSENTER_32+0x79/0xda
EIP: 0xb7eecd09
Code: 08 8b 80 64 cd ff ff 85 d2 74 02 89 02 5d c3 8b 04 24 c3 8b 0c 24
c3 8b 1c 24 c3 8b 3c 24 c3 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59
c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 
EAX: ffffffda EBX: 00000010 ECX: b7afee75 EDX: 00000000
ESI: 019a91e0 EDI: 0199a6d0 EBP: 0199a760 ESP: bfba3d8c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: bochs_drm(+) drm_kms_helper ttm virtio_console drm
8139too virtio_pci serio_raw crc32c_intel virtio_ring 8139cp ata_generic
qemu_fw_cfg virtio mii floppy pata_acpi
---[ end trace 9fc4a94c280952eb ]---
EIP: vmalloc_fault+0x1d7/0x200
Code: 00 f0 1f 00 81 ea 00 00 20 00 21 d0 8b 55 e8 89 c6 81 e2 ff 0f 00
00 0f ac d6 0c 8d 04 b6 c1 e0 03 39 45 ec 0f 84 37 ff ff ff <0f> 0b 8d
b4 26 00 00 00 00 83 c4 0c b8 ff ff ff ff 5b 5e 5f 5d c3 
EAX: 02788000 EBX: c85b6de8 ECX: 00000080 EDX: 00000000
ESI: 000fd000 EDI: fd0000f3 EBP: f4743994 ESP: c85c1ddc
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010083
CR0: 80050033 CR2: f7a00000 CR3: 347f6000 CR4: 000006f0

Regards,

Dave



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-23  3:49 ` [PATCH 0/3] PTI for x86-32 Fixes and Updates David H. Gutteridge
@ 2018-07-23  7:29   ` Joerg Roedel
  2018-07-26  3:47     ` David H. Gutteridge
  0 siblings, 1 reply; 34+ messages in thread
From: Joerg Roedel @ 2018-07-23  7:29 UTC (permalink / raw)
  To: David H. Gutteridge
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86,
	linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

Hey David,

On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote:
> Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
> driver. (This is the same VM definition I shared with you in a PM
> back on Feb. 20th, except note that 4.18 kernels won't successfully
> boot with QEMU's IDE device, so I'm using SATA instead. That's a
> regression totally unrelated to your change sets, or to the general
> booting issue with 4.18 RC5, since it occurs in vanilla RC4 as well.)

Yes, this needs the fixes in the tip/x86/mm branch as well. Can you that
branch in and test again, please?


Thanks,

	Joerg


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-20 16:22 [PATCH 0/3] PTI for x86-32 Fixes and Updates Joerg Roedel
                   ` (3 preceding siblings ...)
  2018-07-23  3:49 ` [PATCH 0/3] PTI for x86-32 Fixes and Updates David H. Gutteridge
@ 2018-07-23 14:09 ` Pavel Machek
  2018-07-23 19:00   ` Linus Torvalds
  4 siblings, 1 reply; 34+ messages in thread
From: Pavel Machek @ 2018-07-23 14:09 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86, linux-kernel,
	linux-mm, Linus Torvalds, Andy Lutomirski, Dave Hansen,
	Josh Poimboeuf, Juergen Gross, Peter Zijlstra, Borislav Petkov,
	Jiri Kosina, Boris Ostrovsky, Brian Gerst, David Laight,
	Denys Vlasenko, Eduardo Valentin, Greg KH, Will Deacon, aliguori,
	daniel.gruss, hughd, keescook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, jroedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

[-- Attachment #1: Type: text/plain, Size: 2489 bytes --]

Hi!

> here are 3 patches which update the PTI-x86-32 patches recently merged
> into the tip-tree. The patches are ordered by importance:

It seems PTI is now in -next. I'll test that soon.

Meanwhile... it looks like gcc is not slowed down significantly, but
other stuff sees 30% .. 40% slowdowns... which is rather
significant.

Would it be possible to have per-process control of kpti? I have
some processes where trading of speed for security would make sense.

Best regards,
								Pavel

cd ~/g/tui/nowcast
time ./nowcast -x (30%)
KPTI: 139.25user 73.65system 269.90 (4m29.901s) elapsed 78.88%CPU
      133.35user 73.15system 228.80 (3m48.802s) elapsed 90.25%CPU
      140.51user 74.21system 218.33 (3m38.338s) elapsed 98.34%CPU
      133.85user 75.89system 212.02 (3m32.026s) elapsed 98.93%CPU (no chromium)
      139.34user 75.00system 235.75 (3m55.752s) elapsed 90.92%CPU
      
4.18: 116.99user 43.79system 217.65 (3m37.653s) elapsed 73.87%CPU
      115.14user 43.97system 178.85 (2m58.855s) elapsed 88.96%CPU
      128.47user 47.22system 178.24 (2m58.245s) elapsed 98.57%CPU
      132.30user 49.27system 184.40 (3m4.408s) elapsed 98.46%CPU
      134.88user 48.59system 186.67 (3m6.673s) elapsed 98.29%CPU
      132.15user 48.65system 524.68 (8m44.684s) elapsed 34.46%CPU
      120.38user 45.45system 168.72 (2m48.720s) elapsed 98.29%CPU
      
time cat /dev/urandom | head -c 10000000 |  bzip2 -9 - | wc -c (40%)
v4.18: 4.57user 0.23system 4.64 (0m4.644s) elapsed 103.53%CPU
       4.86user 0.23system 4.95 (0m4.952s) elapsed 102.81%CPU
       5.13user 0.22system 5.19 (0m5.190s) elapsed 103.14%CPU
KPTI:  6.39user 0.48system 6.74 (0m6.747s) elapsed 101.96%CPU
       6.66user 0.41system 6.91 (0m6.912s) elapsed 102.51%CPU
       6.53user 0.51system 6.91 (0m6.919s) elapsed 101.99%CPU

v4l-utils: make clean, time make
v4.18: 191.93user 11.00system 211.19 (3m31.191s) elapsed 96.09%CPU
       221.21user 14.69system 248.73 (4m8.734s) elapsed 94.84%CPU
       198.35user 11.61system 211.39 (3m31.392s) elapsed 99.32%CPU
       204.87user 11.69system 217.97 (3m37.971s) elapsed 99.35%CPU
       203.68user 11.88system 217.29 (3m37.291s) elapsed 99.20%CPU
KPTI:  156.45user 40.08system 204.77 (3m24.777s) elapsed 95.97%CPU
       183.32user 38.64system 225.03 (3m45.031s) elapsed 98.63%CPU

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-23 14:09 ` Pavel Machek
@ 2018-07-23 19:00   ` Linus Torvalds
  2018-07-23 21:38     ` Pavel Machek
  2018-07-24 13:39     ` Pavel Machek
  0 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2018-07-23 19:00 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, Joerg Roedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <pavel@ucw.cz> wrote:
>
> Meanwhile... it looks like gcc is not slowed down significantly, but
> other stuff sees 30% .. 40% slowdowns... which is rather
> significant.

That is more or less expected.

Gcc spends about 90+% of its time in user space, and the system calls
it *does* do tend to be "real work" (open/read/etc). And modern gcc's
no longer have the pipe between cpp and cc1, so they don't have that
issue either (which would have sjhown the PTI slowdown a lot more)

Some other loads will do a lot more time traversing the user/kernel
boundary, and in 32-bit mode you won't be able to take advantage of
the address space ID's, so you really get the full effect.

> Would it be possible to have per-process control of kpti? I have
> some processes where trading of speed for security would make sense.

That was pretty extensively discussed, and no sane model for it was
ever agreed upon.  Some people wanted it per-thread, others per-mm,
and it wasn't clear how to set it either and how it should inherit
across fork/exec, and what the namespace rules etc should be.

You absolutely need to inherit it (so that you can say "I trust this
session" or whatever), but at the same time you *don't* want to
inherit if you have a server you trust that then spawns user processes
(think "I want systemd to not have the overhead, but the user
processes it spawns obviously do need protection").

It was just a morass. Nothing came out of it.  I guess people can
discuss it again, but it's not simple.

               Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-23 19:00   ` Linus Torvalds
@ 2018-07-23 21:38     ` Pavel Machek
  2018-07-23 21:50       ` Andy Lutomirski
  2018-07-23 21:59       ` Josh Poimboeuf
  2018-07-24 13:39     ` Pavel Machek
  1 sibling, 2 replies; 34+ messages in thread
From: Pavel Machek @ 2018-07-23 21:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, Joerg Roedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

[-- Attachment #1: Type: text/plain, Size: 4789 bytes --]

On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <pavel@ucw.cz> wrote:
> >
> > Meanwhile... it looks like gcc is not slowed down significantly, but
> > other stuff sees 30% .. 40% slowdowns... which is rather
> > significant.
> 
> That is more or less expected.
> 
> Gcc spends about 90+% of its time in user space, and the system calls
> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
> no longer have the pipe between cpp and cc1, so they don't have that
> issue either (which would have sjhown the PTI slowdown a lot more)
> 
> Some other loads will do a lot more time traversing the user/kernel
> boundary, and in 32-bit mode you won't be able to take advantage of
> the address space ID's, so you really get the full effect.

Understood. Just -- bzip2 should include quite a lot of time in
userspace, too. 

> > Would it be possible to have per-process control of kpti? I have
> > some processes where trading of speed for security would make sense.
> 
> That was pretty extensively discussed, and no sane model for it was
> ever agreed upon.  Some people wanted it per-thread, others per-mm,
> and it wasn't clear how to set it either and how it should inherit
> across fork/exec, and what the namespace rules etc should be.
> 
> You absolutely need to inherit it (so that you can say "I trust this
> session" or whatever), but at the same time you *don't* want to
> inherit if you have a server you trust that then spawns user processes
> (think "I want systemd to not have the overhead, but the user
> processes it spawns obviously do need protection").
> 
> It was just a morass. Nothing came out of it.  I guess people can
> discuss it again, but it's not simple.

I agree it is not easy. OTOH -- 30% of user-visible performance is a
_lot_. That is worth spending man-years on...  Ok, problem is not as
severe on modern CPUs with address space ID's, but...

What I want is "if A can ptrace B, and B has pti disabled, A can have
pti disabled as well". Now.. I see someone may want to have it
per-thread, because for stuff like javascript JIT, thread may have
rights to call ptrace, but is unable to call ptrace because JIT
removed that ability... hmm...

But for now I'd like at least "global" option of turning pti on/off
during runtime for benchmarking. Let me see...

Something like this, or is it going to be way more complex? Does
anyone have patch by chance?

									Pavel

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index dfb975b..719e39a 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -162,6 +162,9 @@
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 
+	cmpl	$1, PER_CPU_VAR(pti_enabled)
+	jne	.Lend_\@
+	
 	movl	%cr3, \scratch_reg
 	orl	$PTI_SWITCH_MASK, \scratch_reg
 	movl	\scratch_reg, %cr3
@@ -176,6 +179,8 @@
 	testl	$SEGMENT_RPL_MASK, PT_CS(%esp)
 	jz	.Lend_\@
 	.endif
+	cmpl	$1, PER_CPU_VAR(pti_enabled)
+	jne	.Lend_\@
 	/* On user-cr3? */
 	movl	%cr3, %eax
 	testl	$PTI_SWITCH_MASK, %eax
@@ -192,6 +197,10 @@
  */
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+	cmpl	$1, PER_CPU_VAR(pti_enabled)
+	jne	.Lend_\@
+
 	movl	%cr3, \scratch_reg
 	/* Test if we are already on kernel CR3 */
 	testl	$PTI_SWITCH_MASK, \scratch_reg
@@ -302,6 +311,9 @@
 	 */
 	ALTERNATIVE "jmp .Lswitched_\@", "", X86_FEATURE_PTI
 
+	cmpl	$1, PER_CPU_VAR(pti_enabled)
+	jne	.Lswitched_\@
+
 	testl	$PTI_SWITCH_MASK, \cr3_reg
 	jz	.Lswitched_\@
 
diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index 4a7884b..8c92ae2 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -59,6 +59,7 @@ struct cpu_entry_area {
 #define CPU_ENTRY_AREA_TOT_SIZE	(CPU_ENTRY_AREA_SIZE * NR_CPUS)
 
 DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+DECLARE_PER_CPU(int, pti_enabled);
 
 extern void setup_cpu_entry_areas(void);
 extern void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f73fa6f..da34a21 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,9 @@ void load_percpu_segment(int cpu)
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 #endif
 
+DEFINE_PER_CPU(int, pti_enabled);
+
+
 #ifdef CONFIG_X86_64
 /*
  * Special IST stacks which the CPU switches to when it calls

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-23 21:38     ` Pavel Machek
@ 2018-07-23 21:50       ` Andy Lutomirski
  2018-07-23 21:55         ` Pavel Machek
  2018-07-24 21:18         ` Pavel Machek
  2018-07-23 21:59       ` Josh Poimboeuf
  1 sibling, 2 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-23 21:50 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Peter Anvin, the arch/x86 maintainers, Linux Kernel Mailing List,
	linux-mm, Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, Joerg Roedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim



> On Jul 23, 2018, at 2:38 PM, Pavel Machek <pavel@ucw.cz> wrote:
> 
>> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
>>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <pavel@ucw.cz> wrote:
>>> 
>>> Meanwhile... it looks like gcc is not slowed down significantly, but
>>> other stuff sees 30% .. 40% slowdowns... which is rather
>>> significant.
>> 
>> That is more or less expected.
>> 
>> Gcc spends about 90+% of its time in user space, and the system calls
>> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
>> no longer have the pipe between cpp and cc1, so they don't have that
>> issue either (which would have sjhown the PTI slowdown a lot more)
>> 
>> Some other loads will do a lot more time traversing the user/kernel
>> boundary, and in 32-bit mode you won't be able to take advantage of
>> the address space ID's, so you really get the full effect.
> 
> Understood. Just -- bzip2 should include quite a lot of time in
> userspace, too. 
> 
>>> Would it be possible to have per-process control of kpti? I have
>>> some processes where trading of speed for security would make sense.
>> 
>> That was pretty extensively discussed, and no sane model for it was
>> ever agreed upon.  Some people wanted it per-thread, others per-mm,
>> and it wasn't clear how to set it either and how it should inherit
>> across fork/exec, and what the namespace rules etc should be.
>> 
>> You absolutely need to inherit it (so that you can say "I trust this
>> session" or whatever), but at the same time you *don't* want to
>> inherit if you have a server you trust that then spawns user processes
>> (think "I want systemd to not have the overhead, but the user
>> processes it spawns obviously do need protection").
>> 
>> It was just a morass. Nothing came out of it.  I guess people can
>> discuss it again, but it's not simple.
> 
> I agree it is not easy. OTOH -- 30% of user-visible performance is a
> _lot_. That is worth spending man-years on...  Ok, problem is not as
> severe on modern CPUs with address space ID's, but...
> 
> What I want is "if A can ptrace B, and B has pti disabled, A can have
> pti disabled as well". Now.. I see someone may want to have it
> per-thread, because for stuff like javascript JIT, thread may have
> rights to call ptrace, but is unable to call ptrace because JIT
> removed that ability... hmm...

No, you don’t want that. The problem is that Meltdown isn’t a problem that exists in isolation. It’s very plausible that JavaScript code could trigger a speculation attack that, with PTI off, could read kernel memory.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-23 21:50       ` Andy Lutomirski
@ 2018-07-23 21:55         ` Pavel Machek
  2018-07-24 21:18         ` Pavel Machek
  1 sibling, 0 replies; 34+ messages in thread
From: Pavel Machek @ 2018-07-23 21:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Peter Anvin, the arch/x86 maintainers, Linux Kernel Mailing List,
	linux-mm, Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, Joerg Roedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

[-- Attachment #1: Type: text/plain, Size: 1112 bytes --]

Hi!

> > What I want is "if A can ptrace B, and B has pti disabled, A can have
> > pti disabled as well". Now.. I see someone may want to have it
> > per-thread, because for stuff like javascript JIT, thread may have
> > rights to call ptrace, but is unable to call ptrace because JIT
> > removed that ability... hmm...
> 
> No, you don’t want that. The problem is that Meltdown isn’t a problem that exists in isolation. It’s very plausible that JavaScript code could trigger a speculation attack that, with PTI off, could read kernel memory.

Yeah, the web browser threads that run javascript code should have PTI
on. But maybe I want the rest of web browser with PTI off.

So... yes, I see why someone may want it per-thread (and not
per-process).

I guess per-process would be good enough for me. Actually, maybe even
per-uid. I don't have any fancy security here, so anything running uid
0 and 1000 is close enough to trusted.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-23 21:38     ` Pavel Machek
  2018-07-23 21:50       ` Andy Lutomirski
@ 2018-07-23 21:59       ` Josh Poimboeuf
  2018-07-23 22:07         ` Dave Hansen
  1 sibling, 1 reply; 34+ messages in thread
From: Josh Poimboeuf @ 2018-07-23 21:59 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Peter Anvin, the arch/x86 maintainers, Linux Kernel Mailing List,
	linux-mm, Andrew Lutomirski, Dave Hansen, Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, Joerg Roedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote:
> But for now I'd like at least "global" option of turning pti on/off
> during runtime for benchmarking. Let me see...
> 
> Something like this, or is it going to be way more complex? Does
> anyone have patch by chance?

RHEL/CentOS has a global PTI enable/disable, which uses stop_machine().

-- 
Josh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-23 21:59       ` Josh Poimboeuf
@ 2018-07-23 22:07         ` Dave Hansen
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Hansen @ 2018-07-23 22:07 UTC (permalink / raw)
  To: Josh Poimboeuf, Pavel Machek
  Cc: Linus Torvalds, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Peter Anvin, the arch/x86 maintainers, Linux Kernel Mailing List,
	linux-mm, Andrew Lutomirski, Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, Joerg Roedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

On 07/23/2018 02:59 PM, Josh Poimboeuf wrote:
> On Mon, Jul 23, 2018 at 11:38:30PM +0200, Pavel Machek wrote:
>> But for now I'd like at least "global" option of turning pti on/off
>> during runtime for benchmarking. Let me see...
>>
>> Something like this, or is it going to be way more complex? Does
>> anyone have patch by chance?
> RHEL/CentOS has a global PTI enable/disable, which uses stop_machine().

Let's not forget PTI's NX-for-userspace in the kernel page tables.  That
provides Spectre V2 mitigation as well as Meltdown.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-23 19:00   ` Linus Torvalds
  2018-07-23 21:38     ` Pavel Machek
@ 2018-07-24 13:39     ` Pavel Machek
  2018-07-24 14:39       ` Andy Lutomirski
  1 sibling, 1 reply; 34+ messages in thread
From: Pavel Machek @ 2018-07-24 13:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, Linux Kernel Mailing List, linux-mm,
	Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, Joerg Roedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

[-- Attachment #1: Type: text/plain, Size: 3456 bytes --]

On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <pavel@ucw.cz> wrote:
> >
> > Meanwhile... it looks like gcc is not slowed down significantly, but
> > other stuff sees 30% .. 40% slowdowns... which is rather
> > significant.
> 
> That is more or less expected.

Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a
loop, I get (on v4.18) that, too.

That tells me that something is wrong with machine I'm using for
benchmarking. Whether KPTI  is enabled can still be measured with the
bzip2 pipeline, but the effect is far more subtle.

							Pavel

pavel@amd:~$ while true; do time cat /dev/urandom | head -c 10000000 |
bzip2 -9 - | wc -c ; done10044031
3.87user 0.91system 4.62 (0m4.622s) elapsed 103.48%CPU
10044234
4.03user 0.82system 4.68 (0m4.688s) elapsed 103.67%CPU
10043664
4.28user 0.85system 4.99 (0m4.994s) elapsed 102.90%CPU
10045959
4.43user 0.85system 5.12 (0m5.121s) elapsed 103.44%CPU
10043829
4.50user 0.89system 5.22 (0m5.228s) elapsed 103.22%CPU
10044296
4.65user 0.93system 5.39 (0m5.398s) elapsed 103.61%CPU
10045311
4.76user 0.93system 5.47 (0m5.479s) elapsed 103.98%CPU
10043819
4.81user 0.93system 5.55 (0m5.556s) elapsed 103.37%CPU
10045097
4.72user 1.04system 5.59 (0m5.597s) elapsed 103.01%CPU
10044012
4.86user 0.97system 5.68 (0m5.684s) elapsed 102.79%CPU
10044569
4.93user 0.96system 5.72 (0m5.728s) elapsed 102.92%CPU
10044141
4.94user 0.98system 5.75 (0m5.752s) elapsed 102.97%CPU
10043695
4.97user 0.95system 5.76 (0m5.768s) elapsed 102.87%CPU
10045690
5.12user 0.94system 5.90 (0m5.901s) elapsed 102.79%CPU
10045153
5.06user 1.00system 5.88 (0m5.883s) elapsed 103.21%CPU
10044560
5.10user 1.01system 5.92 (0m5.927s) elapsed 103.31%CPU
10044845
5.17user 0.99system 5.96 (0m5.960s) elapsed 103.44%CPU
10043884
5.15user 1.03system 6.00 (0m6.004s) elapsed 103.14%CPU
10044286
5.18user 1.01system 6.00 (0m6.002s) elapsed 103.40%CPU
10045749
5.00user 1.22system 6.04 (0m6.044s) elapsed 102.98%CPU
10044098
5.22user 1.02system 6.05 (0m6.053s) elapsed 103.21%CPU
10045326
5.20user 1.01system 6.04 (0m6.048s) elapsed 102.72%CPU
10042365
5.22user 1.03system 6.06 (0m6.061s) elapsed 103.30%CPU
10043952
5.24user 1.00system 6.06 (0m6.069s) elapsed 102.97%CPU
10044569
5.30user 1.00system 6.09 (0m6.099s) elapsed 103.46%CPU
10043241
5.26user 1.00system 6.09 (0m6.097s) elapsed 102.79%CPU
10044797
5.30user 1.01system 6.11 (0m6.114s) elapsed 103.46%CPU
10043711
5.25user 1.02system 6.09 (0m6.093s) elapsed 103.03%CPU
10043882
5.31user 1.01system 6.13 (0m6.131s) elapsed 103.28%CPU
10043571
5.26user 1.05system 6.13 (0m6.133s) elapsed 103.06%CPU
10044742
5.29user 1.03system 6.12 (0m6.122s) elapsed 103.25%CPU
10044170
5.35user 1.04system 6.18 (0m6.183s) elapsed 103.60%CPU
10043542
5.22user 1.12system 6.17 (0m6.172s) elapsed 102.89%CPU
10042985
5.25user 1.13system 6.19 (0m6.193s) elapsed 103.09%CPU
10044102
5.36user 1.01system 6.17 (0m6.177s) elapsed 103.17%CPU
10044609
5.48user 0.99system 6.28 (0m6.284s) elapsed 103.11%CPU
10045185
5.40user 1.03system 6.23 (0m6.236s) elapsed 103.29%CPU
10044444
5.41user 1.06system 6.25 (0m6.255s) elapsed 103.55%CPU
10044859
5.35user 1.04system 6.20 (0m6.201s) elapsed 103.17%CPU
10045613

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-24 13:39     ` Pavel Machek
@ 2018-07-24 14:39       ` Andy Lutomirski
  0 siblings, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-24 14:39 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Peter Anvin, the arch/x86 maintainers, Linux Kernel Mailing List,
	linux-mm, Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, Joerg Roedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim



> On Jul 24, 2018, at 6:39 AM, Pavel Machek <pavel@ucw.cz> wrote:
> 
>> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
>>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <pavel@ucw.cz> wrote:
>>> 
>>> Meanwhile... it looks like gcc is not slowed down significantly, but
>>> other stuff sees 30% .. 40% slowdowns... which is rather
>>> significant.
>> 
>> That is more or less expected.
> 
> Ok, so I was wrong. bzip2 showed 30% slowdown, but running test in a
> loop, I get (on v4.18) that, too.
> 
> 

...

The obvious cause would be thermal issues, which are increasingly common in laptops.  You could get cycle counts from perf stat, perhaps.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-23 21:50       ` Andy Lutomirski
  2018-07-23 21:55         ` Pavel Machek
@ 2018-07-24 21:18         ` Pavel Machek
  1 sibling, 0 replies; 34+ messages in thread
From: Pavel Machek @ 2018-07-24 21:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Peter Anvin, the arch/x86 maintainers, Linux Kernel Mailing List,
	linux-mm, Andrew Lutomirski, Dave Hansen, Josh Poimboeuf,
	Jürgen Groß,
	Peter Zijlstra, Borislav Petkov, Jiri Kosina, Boris Ostrovsky,
	Brian Gerst, David Laight, Denys Vlasenko, Eduardo Valentin,
	Greg Kroah-Hartman, Will Deacon, Liguori, Anthony, Daniel Gruss,
	Hugh Dickins, Kees Cook, Andrea Arcangeli, Waiman Long,
	David H . Gutteridge, Joerg Roedel, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim

[-- Attachment #1: Type: text/plain, Size: 3589 bytes --]

On Mon 2018-07-23 14:50:50, Andy Lutomirski wrote:
> 
> 
> > On Jul 23, 2018, at 2:38 PM, Pavel Machek <pavel@ucw.cz> wrote:
> > 
> >> On Mon 2018-07-23 12:00:08, Linus Torvalds wrote:
> >>> On Mon, Jul 23, 2018 at 7:09 AM Pavel Machek <pavel@ucw.cz> wrote:
> >>> 
> >>> Meanwhile... it looks like gcc is not slowed down significantly, but
> >>> other stuff sees 30% .. 40% slowdowns... which is rather
> >>> significant.
> >> 
> >> That is more or less expected.
> >> 
> >> Gcc spends about 90+% of its time in user space, and the system calls
> >> it *does* do tend to be "real work" (open/read/etc). And modern gcc's
> >> no longer have the pipe between cpp and cc1, so they don't have that
> >> issue either (which would have sjhown the PTI slowdown a lot more)
> >> 
> >> Some other loads will do a lot more time traversing the user/kernel
> >> boundary, and in 32-bit mode you won't be able to take advantage of
> >> the address space ID's, so you really get the full effect.
> > 
> > Understood. Just -- bzip2 should include quite a lot of time in
> > userspace, too. 
> > 
> >>> Would it be possible to have per-process control of kpti? I have
> >>> some processes where trading of speed for security would make sense.
> >> 
> >> That was pretty extensively discussed, and no sane model for it was
> >> ever agreed upon.  Some people wanted it per-thread, others per-mm,
> >> and it wasn't clear how to set it either and how it should inherit
> >> across fork/exec, and what the namespace rules etc should be.
> >> 
> >> You absolutely need to inherit it (so that you can say "I trust this
> >> session" or whatever), but at the same time you *don't* want to
> >> inherit if you have a server you trust that then spawns user processes
> >> (think "I want systemd to not have the overhead, but the user
> >> processes it spawns obviously do need protection").
> >> 
> >> It was just a morass. Nothing came out of it.  I guess people can
> >> discuss it again, but it's not simple.
> > 
> > I agree it is not easy. OTOH -- 30% of user-visible performance is a
> > _lot_. That is worth spending man-years on...  Ok, problem is not as
> > severe on modern CPUs with address space ID's, but...
> > 
> > What I want is "if A can ptrace B, and B has pti disabled, A can have
> > pti disabled as well". Now.. I see someone may want to have it
> > per-thread, because for stuff like javascript JIT, thread may have
> > rights to call ptrace, but is unable to call ptrace because JIT
> > removed that ability... hmm...
> 
> No, you don’t want that. The problem is that Meltdown isn’t a problem that exists in isolation. It’s very plausible that JavaScript code could trigger a speculation attack that, with PTI off, could read kernel memory.

Ok, you are right. It is more tricky then I thought.

Still, I probably don't need to run grep's and cat's with PTI
on. Chromium (etc) probably needs it. Python interpretter running my
own code probably does not.

Yes, my Thinkpad X60 is probably thermal-throttled. It is not really
new machine. I switched to T40p for now :-).

What is the "worst" case people are seeing?
time dd if=/dev/zero of=/dev/null bs=1 count=10000000
can reproduce 3x slowdown, but that's basically microbenchmark.

root-only per-process enable/disable of kpti should not be too hard to
do. Would there be interest in that?

Best regards,

								Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] PTI for x86-32 Fixes and Updates
  2018-07-23  7:29   ` Joerg Roedel
@ 2018-07-26  3:47     ` David H. Gutteridge
  0 siblings, 0 replies; 34+ messages in thread
From: David H. Gutteridge @ 2018-07-26  3:47 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Joerg Roedel, Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86,
	linux-kernel, linux-mm, Linus Torvalds, Andy Lutomirski,
	Dave Hansen, Josh Poimboeuf, Juergen Gross, Peter Zijlstra,
	Borislav Petkov, Jiri Kosina, Boris Ostrovsky, Brian Gerst,
	David Laight, Denys Vlasenko, Eduardo Valentin, Greg KH,
	Will Deacon, aliguori, daniel.gruss, hughd, keescook,
	Andrea Arcangeli, Waiman Long, Pavel Machek,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim

On Mon, 2018-07-23 at 09:29 +0200, Joerg Roedel wrote:
> Hey David,
> 
> On Sun, Jul 22, 2018 at 11:49:00PM -0400, David H. Gutteridge wrote:
> > Unfortunately, I can trigger a bug in KVM+QEMU with the Bochs VGA
> > driver. (This is the same VM definition I shared with you in a PM
> > back on Feb. 20th, except note that 4.18 kernels won't successfully
> > boot with QEMU's IDE device, so I'm using SATA instead. That's a
> > regression totally unrelated to your change sets, or to the general
> > booting issue with 4.18 RC5, since it occurs in vanilla RC4 as
> > well.)
> 
> Yes, this needs the fixes in the tip/x86/mm branch as well. Can you
> that branch in and test again, please?

Sorry, I didn't realize I needed those changes, too. I've re-tested
with those applied and haven't encountered any issues. I'm now
re-testing again with your newer patch set from the 25th. No issues
so far with those, either; I'll confirm in that email thread after
the laptop has seen some more use.

Dave



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-07-26  3:47 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-20 16:22 [PATCH 0/3] PTI for x86-32 Fixes and Updates Joerg Roedel
2018-07-20 16:22 ` [PATCH 1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables Joerg Roedel
2018-07-20 17:06   ` Andy Lutomirski
2018-07-20 17:48     ` Joerg Roedel
2018-07-20 19:32       ` Andy Lutomirski
2018-07-20 21:37         ` Joerg Roedel
2018-07-20 22:20           ` Andy Lutomirski
2018-07-21 21:06             ` Linus Torvalds
2018-07-20 19:27     ` Thomas Gleixner
2018-07-20 19:33       ` Andy Lutomirski
2018-07-20 19:43         ` Thomas Gleixner
2018-07-20 19:53           ` Thomas Gleixner
2018-07-20 19:37   ` [tip:x86/pti] " tip-bot for Joerg Roedel
2018-07-20 20:36   ` tip-bot for Joerg Roedel
2018-07-20 16:22 ` [PATCH 2/3] x86/entry/32: Check for VM86 mode in slow-path check Joerg Roedel
2018-07-20 19:37   ` [tip:x86/pti] " tip-bot for Joerg Roedel
2018-07-20 20:37   ` tip-bot for Joerg Roedel
2018-07-21 16:06   ` [PATCH 2/3] " Pavel Machek
2018-07-20 16:22 ` [PATCH 3/3] x86/entry/32: Copy only ptregs on paranoid entry/exit path Joerg Roedel
2018-07-20 17:09   ` Andy Lutomirski
2018-07-20 21:42     ` Joerg Roedel
2018-07-23  3:49 ` [PATCH 0/3] PTI for x86-32 Fixes and Updates David H. Gutteridge
2018-07-23  7:29   ` Joerg Roedel
2018-07-26  3:47     ` David H. Gutteridge
2018-07-23 14:09 ` Pavel Machek
2018-07-23 19:00   ` Linus Torvalds
2018-07-23 21:38     ` Pavel Machek
2018-07-23 21:50       ` Andy Lutomirski
2018-07-23 21:55         ` Pavel Machek
2018-07-24 21:18         ` Pavel Machek
2018-07-23 21:59       ` Josh Poimboeuf
2018-07-23 22:07         ` Dave Hansen
2018-07-24 13:39     ` Pavel Machek
2018-07-24 14:39       ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).