All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] Per-task PTI activation
@ 2018-01-08 16:12 Willy Tarreau
  2018-01-08 16:12 ` [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task Willy Tarreau
                   ` (5 more replies)
  0 siblings, 6 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 16:12 UTC (permalink / raw)
  To: linux-kernel, x86; +Cc: tglx, gnomes, torvalds, Willy Tarreau

Hi!

I could experiment a bit with the possibility to enable/disable PTI per
task. Please keep in mind that it's not my area of experitise at all, but
doing so I could recover the initial performance without disabling PTI on
the whole system.

So what I did in this series consists in the following :
  - addition of a new per-task TIF_NOPTI flag. Please note that I'm not
    proud of the way I did it, as 32 flags were already taken. The flags
    are declared as "long" so there are 32 more flags available on x86_64
    but C and asm disagree on the type of 1<<32 so I had to declare the
    hex value by hand... By the way I even suspect that _TIF_FSCHECK is
    wrong once cast to a long, I think it causes sign extension into the
    32 upper bits since it's supposed to be signed.

  - addition of a set of arch_prctl() calls (ARCH_GET_NOPTI and
    ARCH_SET_NOPTI), to check and change the activation of the
    protection. The change requires CAP_SYS_RAWIO and can be done in
    a wrapper (that's how I tested)

  - the user PGD was marked with _PAGE_NX to prevent an accidental leak
    of CR3 from not being detected. I obviously had to disable this since
    in this case we do want such a user task to run without switching the
    PGD. I think this could be performed per-task maybe. Another approach
    might consist in dealing with 3 PGDs and using a different one for
    unprotected tasks but that really starts to sound overkill.

  - upon return to userspace, I check if the task's flags contain the
    new TIF_NOPTI or not. If it does contain it, then we don't switch
    the CR3.

  - upon entry into the kernel from userspace, we can't access the task's
    flags but we can already check if CR3 points to the kernel or user PGD,
    and we refrain from switching if it's already the system one.

By doing so I could recover the initial performance of haproxy in a VM,
going from 12400 connections per second to 21000 once started with this
trivial wrapper :

  #include <asm/prctl.h>
  #include <sys/prctl.h>
  
  #ifndef ARCH_SET_NOPTI
  #define ARCH_SET_NOPTI 0x1022
  #endif
  
  int main(int argc, char **argv)
  {
          arch_prctl(ARCH_SET_NOPTI, 1);
          argv++;
          return execvp(argv[0], argv);
  }

I have not yet run it on real hardware. Before trying to go a bit further
I'd like to know if such an approach is acceptable or if I'm doing anything
stupid and looking in the wrong direction.

Thanks!
Willy


Willy Tarreau (4):
  x86/thread_info: add TIF_NOPTI to disable PTI per task
  x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to
    enable/disable PTI
  x86/pti: don't mark the user PGD with _PAGE_NX.
  x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI

 arch/x86/entry/calling.h           | 23 +++++++++++++++++++++++
 arch/x86/include/asm/thread_info.h |  8 ++++++++
 arch/x86/include/uapi/asm/prctl.h  |  3 +++
 arch/x86/kernel/process_64.c       | 24 ++++++++++++++++++++++++
 arch/x86/mm/pti.c                  |  2 ++
 5 files changed, 60 insertions(+)

-- 
1.7.12.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task
  2018-01-08 16:12 [PATCH RFC 0/4] Per-task PTI activation Willy Tarreau
@ 2018-01-08 16:12 ` Willy Tarreau
  2018-01-08 16:57   ` Thomas Gleixner
  2018-01-08 16:12 ` [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI Willy Tarreau
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 16:12 UTC (permalink / raw)
  To: linux-kernel, x86; +Cc: tglx, gnomes, torvalds, Willy Tarreau

This flag indicates that the task will not use isolated page tables.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 arch/x86/include/asm/thread_info.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 0022333..2f92cf1 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -126,6 +126,14 @@ struct thread_info {
 #define _TIF_X32		(1 << TIF_X32)
 #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
 
+/* The following flags only exist on x86-64. We can't use the shift anymore
+ * due to C using signed ints by default and asm using unsigned longs.
+ */
+#ifdef CONFIG_X86_64
+# define  TIF_NOPTI		32	/* disable PTI for this task */
+# define _TIF_NOPTI		0x0000000100000000
+#endif
+
 /*
  * work to do in syscall_trace_enter().  Also includes TIF_NOHZ for
  * enter_from_user_mode()
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 16:12 [PATCH RFC 0/4] Per-task PTI activation Willy Tarreau
  2018-01-08 16:12 ` [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task Willy Tarreau
@ 2018-01-08 16:12 ` Willy Tarreau
  2018-01-08 16:49   ` Peter Zijlstra
                     ` (4 more replies)
  2018-01-08 16:12 ` [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX Willy Tarreau
                   ` (3 subsequent siblings)
  5 siblings, 5 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 16:12 UTC (permalink / raw)
  To: linux-kernel, x86; +Cc: tglx, gnomes, torvalds, Willy Tarreau

This allows to report the current state of the PTI protection and to
enable or disable it for the current task.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 arch/x86/include/uapi/asm/prctl.h |  3 +++
 arch/x86/kernel/process_64.c      | 24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9..1f1b5bc 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -10,6 +10,9 @@
 #define ARCH_GET_CPUID		0x1011
 #define ARCH_SET_CPUID		0x1012
 
+#define ARCH_GET_NOPTI		0x1021
+#define ARCH_SET_NOPTI		0x1022
+
 #define ARCH_MAP_VDSO_X32	0x2001
 #define ARCH_MAP_VDSO_32	0x2002
 #define ARCH_MAP_VDSO_64	0x2003
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index c754662..1686d3d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 		ret = put_user(base, (unsigned long __user *)arg2);
 		break;
 	}
+	case ARCH_GET_NOPTI: {
+		unsigned long flag;
+
+		printk(KERN_DEBUG "get1: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
+		flag = !!(task_thread_info(task)->flags & _TIF_NOPTI);
+		ret = put_user(flag, (unsigned long __user *)arg2);
+		break;
+	}
+
+	case ARCH_SET_NOPTI:
+		if (!capable(CAP_SYS_RAWIO))
+			return -EPERM;
+
+		printk(KERN_DEBUG "set1: task=%p ti=%p fl=%16lx doit=%d arg2=%ld\n", task, task_thread_info(task), task_thread_info(task)->flags, doit, arg2);
+
+		if (doit) {
+			if (arg2)
+				task_thread_info(task)->flags |= _TIF_NOPTI;
+			else
+				task_thread_info(task)->flags &= ~_TIF_NOPTI;
+
+			printk(KERN_DEBUG "set2: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
+		}
+		break;
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 # ifdef CONFIG_X86_X32_ABI
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 16:12 [PATCH RFC 0/4] Per-task PTI activation Willy Tarreau
  2018-01-08 16:12 ` [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task Willy Tarreau
  2018-01-08 16:12 ` [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI Willy Tarreau
@ 2018-01-08 16:12 ` Willy Tarreau
  2018-01-08 17:03   ` Dave Hansen
  2018-01-08 17:05   ` Thomas Gleixner
  2018-01-08 16:12 ` [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI Willy Tarreau
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 16:12 UTC (permalink / raw)
  To: linux-kernel, x86; +Cc: tglx, gnomes, torvalds, Willy Tarreau

Since we're going to keep running on the same PGD when returning to
userspace for certain performance-critical tasks, we'll need the user
pages to be executable. So this code disables the extra protection
that was added consisting in marking user pages _PAGE_NX so that this
pgd remains usable for userspace.

Note: it isn't necessarily the best approach, but one way or another
      if we want to be able to return to userspace from the kernel,
      we'll have to have this executable anyway. Another approach
      might consist in using another pgd for userland+kernel but
      the current core really looks like an extra careful measure
      to catch early bugs if any.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 arch/x86/mm/pti.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 43d4a4a..9e2dca0 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -135,9 +135,11 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 	 *  - we don't have NX support
 	 *  - we're clearing the PGD (i.e. the new pgd is not present).
 	 */
+#if 0
 	if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
 	    (__supported_pte_mask & _PAGE_NX))
 		pgd.pgd |= _PAGE_NX;
+#endif
 
 	/* return the copy of the PGD we want the kernel to use: */
 	return pgd;
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI
  2018-01-08 16:12 [PATCH RFC 0/4] Per-task PTI activation Willy Tarreau
                   ` (2 preceding siblings ...)
  2018-01-08 16:12 ` [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX Willy Tarreau
@ 2018-01-08 16:12 ` Willy Tarreau
  2018-01-08 17:11   ` Ingo Molnar
                     ` (2 more replies)
  2018-01-08 16:59 ` [PATCH RFC 0/4] Per-task PTI activation Dave Hansen
  2018-01-09 15:31 ` Eric W. Biederman
  5 siblings, 3 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 16:12 UTC (permalink / raw)
  To: linux-kernel, x86; +Cc: tglx, gnomes, torvalds, Willy Tarreau

If a task has the TIF_NOPTI flag set, it doesn't want to experience
page table isolation. In this case, returns from kernel to user will
not switch the CR3, leaving it to the kernel one which already maps
both user and kernel pages. Upon entry in the kernel, we can't check
this flag so we simply check if CR3 was pointing to the kernel's PGD,
indicating an earlier absence of switch, and in this case we don't
change it.

Thanks to these changes, haproxy running under KVM went back from
12400 conn/s to 21000 once loaded after calling prctl().

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 arch/x86/entry/calling.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 45a63e0..054b8b7 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <linux/jump_label.h>
+#include <asm/thread_info.h>
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
 #include <asm/page_types.h>
@@ -214,6 +215,11 @@
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 	mov	%cr3, \scratch_reg
+
+	/* if we're already on the kernel PGD, we don't switch */
+	testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
+	jz .Lend_\@
+
 	ADJUST_KERNEL_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
 .Lend_\@:
@@ -224,6 +230,12 @@
 
 .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+	/* "NOPTI" taskflag avoids the switch */
+	movq	PER_CPU_VAR(current_task), \scratch_reg
+	btq	$TIF_NOPTI, TASK_TI_flags(\scratch_reg)
+	jc	.Lend_\@
+
 	mov	%cr3, \scratch_reg
 
 	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
@@ -262,6 +274,13 @@
 	ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
 	movq	%cr3, \scratch_reg
 	movq	\scratch_reg, \save_reg
+
+	/* if we're already on the kernel PGD, we don't switch,
+	 * we just save the current cr3.
+	 */
+	testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
+	jz .Ldone_\@
+
 	/*
 	 * Is the "switch mask" all zero?  That means that both of
 	 * these are zero:
@@ -284,6 +303,10 @@
 .macro RESTORE_CR3 scratch_reg:req save_reg:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 
+	/* if we saved a kernel context, we didn't switch so we don't switch */
+	testq $(PTI_SWITCH_PGTABLES_MASK), \save_reg
+	jz .Lend_\@
+
 	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
 
 	/*
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 16:12 ` [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI Willy Tarreau
@ 2018-01-08 16:49   ` Peter Zijlstra
  2018-01-08 16:56     ` Willy Tarreau
  2018-01-08 17:02   ` Thomas Gleixner
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2018-01-08 16:49 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel, x86, tglx, gnomes, torvalds

On Mon, Jan 08, 2018 at 05:12:17PM +0100, Willy Tarreau wrote:
> +		if (doit) {
> +			if (arg2)
> +				task_thread_info(task)->flags |= _TIF_NOPTI;
> +			else
> +				task_thread_info(task)->flags &= ~_TIF_NOPTI;

{set,clear}_thread_flag() please, the above is not SMP safe.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 16:49   ` Peter Zijlstra
@ 2018-01-08 16:56     ` Willy Tarreau
  0 siblings, 0 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 16:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, x86, tglx, gnomes, torvalds

On Mon, Jan 08, 2018 at 05:49:00PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 08, 2018 at 05:12:17PM +0100, Willy Tarreau wrote:
> > +		if (doit) {
> > +			if (arg2)
> > +				task_thread_info(task)->flags |= _TIF_NOPTI;
> > +			else
> > +				task_thread_info(task)->flags &= ~_TIF_NOPTI;
> 
> {set,clear}_thread_flag() please, the above is not SMP safe.

Oops, thank you. I initially thought about it initially but didn't
know the names of the functions to use, and have left it there
thinking I'd fix it later. Usual source of bugs :-/

Bah and I left my debugging printk() in the patch as well! I'll
rework this a bit.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task
  2018-01-08 16:12 ` [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task Willy Tarreau
@ 2018-01-08 16:57   ` Thomas Gleixner
  2018-01-08 17:03     ` Willy Tarreau
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2018-01-08 16:57 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel, x86, gnomes, torvalds

On Mon, 8 Jan 2018, Willy Tarreau wrote:

> This flag indicates that the task will not use isolated page tables.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  arch/x86/include/asm/thread_info.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 0022333..2f92cf1 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -126,6 +126,14 @@ struct thread_info {
>  #define _TIF_X32		(1 << TIF_X32)
>  #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
>  
> +/* The following flags only exist on x86-64. We can't use the shift anymore

Please do not use this horrible comment syle

And what's wrong with (1UL << 32)?

> + * due to C using signed ints by default and asm using unsigned longs.
> + */
> +#ifdef CONFIG_X86_64
> +# define  TIF_NOPTI		32	/* disable PTI for this task */

No tail comments please.

> +# define _TIF_NOPTI		0x0000000100000000
> +#endif
> +

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/4] Per-task PTI activation
  2018-01-08 16:12 [PATCH RFC 0/4] Per-task PTI activation Willy Tarreau
                   ` (3 preceding siblings ...)
  2018-01-08 16:12 ` [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI Willy Tarreau
@ 2018-01-08 16:59 ` Dave Hansen
  2018-01-08 17:06   ` Willy Tarreau
  2018-01-08 17:13   ` Ingo Molnar
  2018-01-09 15:31 ` Eric W. Biederman
  5 siblings, 2 replies; 53+ messages in thread
From: Dave Hansen @ 2018-01-08 16:59 UTC (permalink / raw)
  To: Willy Tarreau, linux-kernel, x86; +Cc: tglx, gnomes, torvalds, Andy Lutomirski

On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> I could experiment a bit with the possibility to enable/disable PTI per
> task. Please keep in mind that it's not my area of experitise at all, but
> doing so I could recover the initial performance without disabling PTI on
> the whole system.

This cc list is way too small.  Please at *least* include me and Andy on
this kind of stuff.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 16:12 ` [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI Willy Tarreau
  2018-01-08 16:49   ` Peter Zijlstra
@ 2018-01-08 17:02   ` Thomas Gleixner
  2018-01-08 17:10     ` Willy Tarreau
  2018-01-08 17:17     ` Ingo Molnar
  2018-01-08 17:05   ` Ingo Molnar
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 53+ messages in thread
From: Thomas Gleixner @ 2018-01-08 17:02 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel, x86, gnomes, torvalds

On Mon, 8 Jan 2018, Willy Tarreau wrote:
> This allows to report the current state of the PTI protection and to
> enable or disable it for the current task.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  arch/x86/include/uapi/asm/prctl.h |  3 +++
>  arch/x86/kernel/process_64.c      | 24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9..1f1b5bc 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -10,6 +10,9 @@
>  #define ARCH_GET_CPUID		0x1011
>  #define ARCH_SET_CPUID		0x1012
>  
> +#define ARCH_GET_NOPTI		0x1021
> +#define ARCH_SET_NOPTI		0x1022
> +
>  #define ARCH_MAP_VDSO_X32	0x2001
>  #define ARCH_MAP_VDSO_32	0x2002
>  #define ARCH_MAP_VDSO_64	0x2003
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index c754662..1686d3d 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
>  		ret = put_user(base, (unsigned long __user *)arg2);
>  		break;
>  	}
> +	case ARCH_GET_NOPTI: {
> +		unsigned long flag;
> +
> +		printk(KERN_DEBUG "get1: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
> +		flag = !!(task_thread_info(task)->flags & _TIF_NOPTI);
> +		ret = put_user(flag, (unsigned long __user *)arg2);
> +		break;

Per task is really an odd choice. That should be per process I think, but
that of course needs synchronization of some form. Aside of that we need to
think about fork().

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 16:12 ` [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX Willy Tarreau
@ 2018-01-08 17:03   ` Dave Hansen
  2018-01-08 17:17     ` Willy Tarreau
                       ` (2 more replies)
  2018-01-08 17:05   ` Thomas Gleixner
  1 sibling, 3 replies; 53+ messages in thread
From: Dave Hansen @ 2018-01-08 17:03 UTC (permalink / raw)
  To: Willy Tarreau, linux-kernel, x86; +Cc: tglx, gnomes, torvalds

On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> Since we're going to keep running on the same PGD when returning to
> userspace for certain performance-critical tasks, we'll need the user
> pages to be executable. So this code disables the extra protection
> that was added consisting in marking user pages _PAGE_NX so that this
> pgd remains usable for userspace.
> 
> Note: it isn't necessarily the best approach, but one way or another
>       if we want to be able to return to userspace from the kernel,
>       we'll have to have this executable anyway. Another approach
>       might consist in using another pgd for userland+kernel but
>       the current core really looks like an extra careful measure
>       to catch early bugs if any.

I don't like this.

I think the prctl() should apply to an entire process, not to a thread.
If it applies to a process, you can unpoison the PGD.  I even had code
to do this in an earlier version of the (whole system) runtime PTI
on/off stuff.

Why are you even posting half-baked hacks like this now?  Is there
something super-pressing about this set that we need to lock in a new
ABI now?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task
  2018-01-08 16:57   ` Thomas Gleixner
@ 2018-01-08 17:03     ` Willy Tarreau
  2018-01-08 17:14       ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 17:03 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, x86, gnomes, torvalds

On Mon, Jan 08, 2018 at 05:57:11PM +0100, Thomas Gleixner wrote:
> On Mon, 8 Jan 2018, Willy Tarreau wrote:
> 
> > This flag indicates that the task will not use isolated page tables.
> > 
> > Signed-off-by: Willy Tarreau <w@1wt.eu>
> > ---
> >  arch/x86/include/asm/thread_info.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> > index 0022333..2f92cf1 100644
> > --- a/arch/x86/include/asm/thread_info.h
> > +++ b/arch/x86/include/asm/thread_info.h
> > @@ -126,6 +126,14 @@ struct thread_info {
> >  #define _TIF_X32		(1 << TIF_X32)
> >  #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
> >  
> > +/* The following flags only exist on x86-64. We can't use the shift anymore
> 
> Please do not use this horrible comment syle

You mean, the fact that there is no '/*' alone on the first line or
anything else ?

> And what's wrong with (1UL << 32)?

It fails when inherited in assembly parts. Initially the test was based
on "testq $(_TIF_NOPTI), reg" and 1UL doesn't parse there, which is why
I had to abandon it. In the latest patch I dropped "testq" for "bt" so
I didn't need the mask anymore and it wouldn't be a problem... until
someone wants to use it again.


> > + * due to C using signed ints by default and asm using unsigned longs.
> > + */
> > +#ifdef CONFIG_X86_64
> > +# define  TIF_NOPTI		32	/* disable PTI for this task */
> 
> No tail comments please.

OK but I did exactly like is done for all other flags above :-/

Thanks,
Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 16:12 ` [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI Willy Tarreau
  2018-01-08 16:49   ` Peter Zijlstra
  2018-01-08 17:02   ` Thomas Gleixner
@ 2018-01-08 17:05   ` Ingo Molnar
  2018-01-08 17:19     ` Peter Zijlstra
  2018-01-08 17:50   ` Borislav Petkov
  2018-01-08 17:54   ` Linus Torvalds
  4 siblings, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2018-01-08 17:05 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: linux-kernel, x86, tglx, gnomes, torvalds, Peter Zijlstra,
	Borislav Petkov, Josh Poimboeuf, Andy Lutomirski, Dave Hansen


(Expanded the Cc: list.)

* Willy Tarreau <w@1wt.eu> wrote:

> This allows to report the current state of the PTI protection and to
> enable or disable it for the current task.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  arch/x86/include/uapi/asm/prctl.h |  3 +++
>  arch/x86/kernel/process_64.c      | 24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9..1f1b5bc 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -10,6 +10,9 @@
>  #define ARCH_GET_CPUID		0x1011
>  #define ARCH_SET_CPUID		0x1012
>  
> +#define ARCH_GET_NOPTI		0x1021
> +#define ARCH_SET_NOPTI		0x1022
> +
>  #define ARCH_MAP_VDSO_X32	0x2001
>  #define ARCH_MAP_VDSO_32	0x2002
>  #define ARCH_MAP_VDSO_64	0x2003
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index c754662..1686d3d 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
>  		ret = put_user(base, (unsigned long __user *)arg2);
>  		break;
>  	}
> +	case ARCH_GET_NOPTI: {
> +		unsigned long flag;
> +
> +		printk(KERN_DEBUG "get1: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
> +		flag = !!(task_thread_info(task)->flags & _TIF_NOPTI);
> +		ret = put_user(flag, (unsigned long __user *)arg2);
> +		break;
> +	}
> +
> +	case ARCH_SET_NOPTI:
> +		if (!capable(CAP_SYS_RAWIO))
> +			return -EPERM;
> +
> +		printk(KERN_DEBUG "set1: task=%p ti=%p fl=%16lx doit=%d arg2=%ld\n", task, task_thread_info(task), task_thread_info(task)->flags, doit, arg2);
> +
> +		if (doit) {
> +			if (arg2)
> +				task_thread_info(task)->flags |= _TIF_NOPTI;
> +			else
> +				task_thread_info(task)->flags &= ~_TIF_NOPTI;
> +
> +			printk(KERN_DEBUG "set2: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
> +		}
> +		break;

Btw., we could enforce the CAP_SYS_RAWIO permission check only if it's _clearing_ 
the PTI flag.

I.e. this would allow apps and runtime environments to opt into PTI, without 
having to rely on external security frameworks getting it right.

Note that there is somewhat of a fuzzy detail regarding AMD CPUs which are marked 
as 'Meltdown safe': should an explicit request to turn on PTI be honored by the 
kernel? Should that be some sort of separate 'force PTI on' attribute?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 16:12 ` [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX Willy Tarreau
  2018-01-08 17:03   ` Dave Hansen
@ 2018-01-08 17:05   ` Thomas Gleixner
  2018-01-08 17:28     ` Dave Hansen
  1 sibling, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2018-01-08 17:05 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel, x86, gnomes, torvalds

On Mon, 8 Jan 2018, Willy Tarreau wrote:
> Since we're going to keep running on the same PGD when returning to
> userspace for certain performance-critical tasks, we'll need the user
> pages to be executable. So this code disables the extra protection
> that was added consisting in marking user pages _PAGE_NX so that this
> pgd remains usable for userspace.
> 
> Note: it isn't necessarily the best approach, but one way or another
>       if we want to be able to return to userspace from the kernel,
>       we'll have to have this executable anyway. Another approach
>       might consist in using another pgd for userland+kernel but
>       the current core really looks like an extra careful measure
>       to catch early bugs if any.

I surely want to keep that as a safety measure. The entry code is simple to
get wrong and running with the wrong pagetables by a silly mistake and
thereby undoing the protection is surely not what we want.

Need to find a free time slot to think about that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/4] Per-task PTI activation
  2018-01-08 16:59 ` [PATCH RFC 0/4] Per-task PTI activation Dave Hansen
@ 2018-01-08 17:06   ` Willy Tarreau
  2018-01-08 17:17     ` Dave Hansen
  2018-01-08 17:13   ` Ingo Molnar
  1 sibling, 1 reply; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 17:06 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, x86, tglx, gnomes, torvalds, Andy Lutomirski

On Mon, Jan 08, 2018 at 08:59:54AM -0800, Dave Hansen wrote:
> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> > I could experiment a bit with the possibility to enable/disable PTI per
> > task. Please keep in mind that it's not my area of experitise at all, but
> > doing so I could recover the initial performance without disabling PTI on
> > the whole system.
> 
> This cc list is way too small.  Please at *least* include me and Andy on
> this kind of stuff.

You're welcome Dave. In fact you were the two I initially copied from
another thread and seeing the huge amount of CCs I preferred to stick
to recent discussions and cc x86 thinking all people involved were there.

Feel free to suggest a wider list. Maybe all the CCs of the latest PTI
patch ?

Thanks,
Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 17:02   ` Thomas Gleixner
@ 2018-01-08 17:10     ` Willy Tarreau
  2018-01-08 17:17     ` Ingo Molnar
  1 sibling, 0 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 17:10 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, x86, gnomes, torvalds

On Mon, Jan 08, 2018 at 06:02:35PM +0100, Thomas Gleixner wrote:
> On Mon, 8 Jan 2018, Willy Tarreau wrote:
> > This allows to report the current state of the PTI protection and to
> > enable or disable it for the current task.
> > 
> > Signed-off-by: Willy Tarreau <w@1wt.eu>
> > ---
> >  arch/x86/include/uapi/asm/prctl.h |  3 +++
> >  arch/x86/kernel/process_64.c      | 24 ++++++++++++++++++++++++
> >  2 files changed, 27 insertions(+)
> > 
> > diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> > index 5a6aac9..1f1b5bc 100644
> > --- a/arch/x86/include/uapi/asm/prctl.h
> > +++ b/arch/x86/include/uapi/asm/prctl.h
> > @@ -10,6 +10,9 @@
> >  #define ARCH_GET_CPUID		0x1011
> >  #define ARCH_SET_CPUID		0x1012
> >  
> > +#define ARCH_GET_NOPTI		0x1021
> > +#define ARCH_SET_NOPTI		0x1022
> > +
> >  #define ARCH_MAP_VDSO_X32	0x2001
> >  #define ARCH_MAP_VDSO_32	0x2002
> >  #define ARCH_MAP_VDSO_64	0x2003
> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index c754662..1686d3d 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
> >  		ret = put_user(base, (unsigned long __user *)arg2);
> >  		break;
> >  	}
> > +	case ARCH_GET_NOPTI: {
> > +		unsigned long flag;
> > +
> > +		printk(KERN_DEBUG "get1: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
> > +		flag = !!(task_thread_info(task)->flags & _TIF_NOPTI);
> > +		ret = put_user(flag, (unsigned long __user *)arg2);
> > +		break;
> 
> Per task is really an odd choice. That should be per process I think, but
> that of course needs synchronization of some form. Aside of that we need to
> think about fork().

I also wondered how to do it and had no idea, but I wanted to start with
something, also keeping in mind that I didn't want to risk losing in
extra checks what we managed to previously save. I agree that it's not
the best we can have. That said, other features seem to work like this,
like TIF_NOTSC and TIF_NOCPUID, so it didn't look too odd at first glance.

Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI
  2018-01-08 16:12 ` [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI Willy Tarreau
@ 2018-01-08 17:11   ` Ingo Molnar
  2018-01-08 17:20   ` Dave Hansen
  2018-01-08 23:01   ` Andy Lutomirski
  2 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-01-08 17:11 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: linux-kernel, x86, tglx, gnomes, torvalds, Andy Lutomirski,
	Borislav Petkov, Peter Zijlstra, Josh Poimboeuf, Dave Hansen


* Willy Tarreau <w@1wt.eu> wrote:

> If a task has the TIF_NOPTI flag set, it doesn't want to experience
> page table isolation. In this case, returns from kernel to user will
> not switch the CR3, leaving it to the kernel one which already maps
> both user and kernel pages. Upon entry in the kernel, we can't check
> this flag so we simply check if CR3 was pointing to the kernel's PGD,
> indicating an earlier absence of switch, and in this case we don't
> change it.
> 
> Thanks to these changes, haproxy running under KVM went back from
> 12400 conn/s to 21000 once loaded after calling prctl().
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  arch/x86/entry/calling.h | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)

Just a few nits:

> 
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 45a63e0..054b8b7 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #include <linux/jump_label.h>
> +#include <asm/thread_info.h>
>  #include <asm/unwind_hints.h>
>  #include <asm/cpufeatures.h>
>  #include <asm/page_types.h>
> @@ -214,6 +215,11 @@
>  .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
>  	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>  	mov	%cr3, \scratch_reg
> +
> +	/* if we're already on the kernel PGD, we don't switch */
> +	testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> +	jz .Lend_\@
> +
>  	ADJUST_KERNEL_CR3 \scratch_reg
>  	mov	\scratch_reg, %cr3
>  .Lend_\@:
> @@ -224,6 +230,12 @@
>  
>  .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
>  	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> +
> +	/* "NOPTI" taskflag avoids the switch */
> +	movq	PER_CPU_VAR(current_task), \scratch_reg
> +	btq	$TIF_NOPTI, TASK_TI_flags(\scratch_reg)
> +	jc	.Lend_\@

s/"NOPTI" taskflag
 /The "NOPTI" task flag

> +
>  	mov	%cr3, \scratch_reg
>  
>  	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
> @@ -262,6 +274,13 @@
>  	ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
>  	movq	%cr3, \scratch_reg
>  	movq	\scratch_reg, \save_reg
> +
> +	/* if we're already on the kernel PGD, we don't switch,
> +	 * we just save the current cr3.
> +	 */
> +	testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> +	jz .Ldone_\@

Proper kernel comment style please. Also:

s/cr3/CR3

> +
>  	/*
>  	 * Is the "switch mask" all zero?  That means that both of
>  	 * these are zero:
> @@ -284,6 +303,10 @@
>  .macro RESTORE_CR3 scratch_reg:req save_reg:req
>  	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>  
> +	/* if we saved a kernel context, we didn't switch so we don't switch */
> +	testq $(PTI_SWITCH_PGTABLES_MASK), \save_reg
> +	jz .Lend_\@

Maybe clarify this a bit? How about:

	/*
	 * If we saved a kernel context on entry, we didn't switch the CR3,
	 * so we don't need to restore it on the way out either:
	 */

or so?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/4] Per-task PTI activation
  2018-01-08 16:59 ` [PATCH RFC 0/4] Per-task PTI activation Dave Hansen
  2018-01-08 17:06   ` Willy Tarreau
@ 2018-01-08 17:13   ` Ingo Molnar
  1 sibling, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-01-08 17:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Willy Tarreau, linux-kernel, x86, tglx, gnomes, torvalds,
	Andy Lutomirski


* Dave Hansen <dave.hansen@intel.com> wrote:

> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> > I could experiment a bit with the possibility to enable/disable PTI per
> > task. Please keep in mind that it's not my area of experitise at all, but
> > doing so I could recover the initial performance without disabling PTI on
> > the whole system.
> 
> This cc list is way too small.  Please at *least* include me and Andy on
> this kind of stuff.

Yeah. The full x86-PTI Cc list that I typically add to new PTI commits is:

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task
  2018-01-08 17:03     ` Willy Tarreau
@ 2018-01-08 17:14       ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-01-08 17:14 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Thomas Gleixner, linux-kernel, x86, gnomes, torvalds


* Willy Tarreau <w@1wt.eu> wrote:

> On Mon, Jan 08, 2018 at 05:57:11PM +0100, Thomas Gleixner wrote:
> > On Mon, 8 Jan 2018, Willy Tarreau wrote:
> > 
> > > This flag indicates that the task will not use isolated page tables.
> > > 
> > > Signed-off-by: Willy Tarreau <w@1wt.eu>
> > > ---
> > >  arch/x86/include/asm/thread_info.h | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> > > index 0022333..2f92cf1 100644
> > > --- a/arch/x86/include/asm/thread_info.h
> > > +++ b/arch/x86/include/asm/thread_info.h
> > > @@ -126,6 +126,14 @@ struct thread_info {
> > >  #define _TIF_X32		(1 << TIF_X32)
> > >  #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
> > >  
> > > +/* The following flags only exist on x86-64. We can't use the shift anymore
> > 
> > Please do not use this horrible comment syle
> 
> You mean, the fact that there is no '/*' alone on the first line or
> anything else ?

please use the customary (multi-line) comment style:

  /*
   * Comment .....
   * ...... goes here.
   */

specified in Documentation/CodingStyle.

Thanks,

        Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 17:03   ` Dave Hansen
@ 2018-01-08 17:17     ` Willy Tarreau
  2018-01-08 17:23       ` Dave Hansen
  2018-01-08 17:21     ` Ingo Molnar
  2018-01-08 23:05     ` Andy Lutomirski
  2 siblings, 1 reply; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 17:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, tglx, gnomes, torvalds, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Josh Poimboeuf, Andy Lutomirski

[ expanded the Cc list a bit ]

On Mon, Jan 08, 2018 at 09:03:36AM -0800, Dave Hansen wrote:
> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> > Since we're going to keep running on the same PGD when returning to
> > userspace for certain performance-critical tasks, we'll need the user
> > pages to be executable. So this code disables the extra protection
> > that was added consisting in marking user pages _PAGE_NX so that this
> > pgd remains usable for userspace.
> > 
> > Note: it isn't necessarily the best approach, but one way or another
> >       if we want to be able to return to userspace from the kernel,
> >       we'll have to have this executable anyway. Another approach
> >       might consist in using another pgd for userland+kernel but
> >       the current core really looks like an extra careful measure
> >       to catch early bugs if any.
> 
> I don't like this.

This is the purpose of the review.

> I think the prctl() should apply to an entire process, not to a thread.

As I mentionned in another mail, I didn't know how to do it, even less
how to do it fast enough so that we didn't add more cycles to the syscall
code.

> If it applies to a process, you can unpoison the PGD.  I even had code
> to do this in an earlier version of the (whole system) runtime PTI
> on/off stuff.
> 
> Why are you even posting half-baked hacks like this now?  Is there
> something super-pressing about this set that we need to lock in a new
> ABI now?

No need to lock in or whatever. It's just that a number of us simply
cannot use the current protection due to the huge cost it comes with
for their specific workload, and having to choose between performance
or protection remains a problem. Having a bit more available time and
being directly concerned by this problem I tried to propose something
to 1) see if there was any hope and 2) possibly help things move forward
in this direction. The patches are marked RFC, they're for discussing,
not for merging.

Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/4] Per-task PTI activation
  2018-01-08 17:06   ` Willy Tarreau
@ 2018-01-08 17:17     ` Dave Hansen
  0 siblings, 0 replies; 53+ messages in thread
From: Dave Hansen @ 2018-01-08 17:17 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel, x86, tglx, gnomes, torvalds, Andy Lutomirski

On 01/08/2018 09:06 AM, Willy Tarreau wrote:
> On Mon, Jan 08, 2018 at 08:59:54AM -0800, Dave Hansen wrote:
>> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
>>> I could experiment a bit with the possibility to enable/disable PTI per
>>> task. Please keep in mind that it's not my area of experitise at all, but
>>> doing so I could recover the initial performance without disabling PTI on
>>> the whole system.
>> This cc list is way too small.  Please at *least* include me and Andy on
>> this kind of stuff.
> You're welcome Dave. In fact you were the two I initially copied from
> another thread and seeing the huge amount of CCs I preferred to stick
> to recent discussions and cc x86 thinking all people involved were there.

x86@ is only the maintainers.  It isn't a general mailing list.

> Feel free to suggest a wider list. Maybe all the CCs of the latest PTI
> patch ?

get_maintainer.pl is your friend.  For patch 4/4 it gives this:

> Thomas Gleixner <tglx@linutronix.de> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),commit_signer:6/8=75%,authored:1/8=12%)
> Ingo Molnar <mingo@redhat.com> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),commit_signer:7/8=88%)
> "H. Peter Anvin" <hpa@zytor.com> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> x86@kernel.org (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Borislav Petkov <bp@suse.de> (commit_signer:2/8=25%)
> Andy Lutomirski <luto@kernel.org> (commit_signer:2/8=25%,authored:2/8=25%,added_lines:21/187=11%,removed_lines:52/67=78%)
> Peter Zijlstra <peterz@infradead.org> (commit_signer:2/8=25%,authored:2/8=25%,added_lines:87/187=47%,removed_lines:15/67=22%)
> Dave Hansen <dave.hansen@linux.intel.com> (authored:1/8=12%,added_lines:66/187=35%)
> Josh Poimboeuf <jpoimboe@redhat.com> (authored:1/8=12%)
> linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))

Which is shockingly close to what Ingo suggested.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 17:02   ` Thomas Gleixner
  2018-01-08 17:10     ` Willy Tarreau
@ 2018-01-08 17:17     ` Ingo Molnar
  2018-01-08 17:26       ` Thomas Gleixner
  1 sibling, 1 reply; 53+ messages in thread
From: Ingo Molnar @ 2018-01-08 17:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Willy Tarreau, linux-kernel, x86, gnomes, torvalds, Dave Hansen,
	Andy Lutomirski, Borislav Petkov, Peter Zijlstra, Josh Poimboeuf


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Mon, 8 Jan 2018, Willy Tarreau wrote:
> > This allows to report the current state of the PTI protection and to
> > enable or disable it for the current task.
> > 
> > Signed-off-by: Willy Tarreau <w@1wt.eu>
> > ---
> >  arch/x86/include/uapi/asm/prctl.h |  3 +++
> >  arch/x86/kernel/process_64.c      | 24 ++++++++++++++++++++++++
> >  2 files changed, 27 insertions(+)
> > 
> > diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> > index 5a6aac9..1f1b5bc 100644
> > --- a/arch/x86/include/uapi/asm/prctl.h
> > +++ b/arch/x86/include/uapi/asm/prctl.h
> > @@ -10,6 +10,9 @@
> >  #define ARCH_GET_CPUID		0x1011
> >  #define ARCH_SET_CPUID		0x1012
> >  
> > +#define ARCH_GET_NOPTI		0x1021
> > +#define ARCH_SET_NOPTI		0x1022
> > +
> >  #define ARCH_MAP_VDSO_X32	0x2001
> >  #define ARCH_MAP_VDSO_32	0x2002
> >  #define ARCH_MAP_VDSO_64	0x2003
> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index c754662..1686d3d 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
> >  		ret = put_user(base, (unsigned long __user *)arg2);
> >  		break;
> >  	}
> > +	case ARCH_GET_NOPTI: {
> > +		unsigned long flag;
> > +
> > +		printk(KERN_DEBUG "get1: task=%p ti=%p fl=%16lx\n", task, task_thread_info(task), task_thread_info(task)->flags);
> > +		flag = !!(task_thread_info(task)->flags & _TIF_NOPTI);
> > +		ret = put_user(flag, (unsigned long __user *)arg2);
> > +		break;
> 
> Per task is really an odd choice. That should be per process I think, but
> that of course needs synchronization of some form. Aside of that we need to
> think about fork().

So per task (thread) is the most natural approach to low level asm flaggery.

Making it per thread also makes some sense conceptually: in a complex 
multi-threaded runtime implementation some threads might never execute
'untrusted' code, some might. No need to penalize the 'server' threads.

Not sure we want that complexity though, and while it _should_ work I think, 
mostly, there might be some unexpected implications.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 17:05   ` Ingo Molnar
@ 2018-01-08 17:19     ` Peter Zijlstra
  2018-01-08 17:26       ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2018-01-08 17:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, linux-kernel, x86, tglx, gnomes, torvalds,
	Borislav Petkov, Josh Poimboeuf, Andy Lutomirski, Dave Hansen

On Mon, Jan 08, 2018 at 06:05:31PM +0100, Ingo Molnar wrote:
> Note that there is somewhat of a fuzzy detail regarding AMD CPUs which are marked 
> as 'Meltdown safe': should an explicit request to turn on PTI be honored by the 
> kernel? Should that be some sort of separate 'force PTI on' attribute?

AMD should not have FEATURE_PTI enabled, and thus not end up in any code
that cares about TIF_NOPTI.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI
  2018-01-08 16:12 ` [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI Willy Tarreau
  2018-01-08 17:11   ` Ingo Molnar
@ 2018-01-08 17:20   ` Dave Hansen
  2018-01-08 18:12     ` Willy Tarreau
  2018-01-08 23:01   ` Andy Lutomirski
  2 siblings, 1 reply; 53+ messages in thread
From: Dave Hansen @ 2018-01-08 17:20 UTC (permalink / raw)
  To: Willy Tarreau, linux-kernel, x86; +Cc: tglx, gnomes, torvalds, Andy Lutomirski

Please cc Andy on this stuff.  I can't imagine patching entry_64.S at
this point without cc'ing him.  *Surely* you didn't even bother to run
get_maintainer.pl on this.

> @@ -214,6 +215,11 @@
>  .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
>  	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>  	mov	%cr3, \scratch_reg
> +
> +	/* if we're already on the kernel PGD, we don't switch */
> +	testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> +	jz .Lend_\@
> +
>  	ADJUST_KERNEL_CR3 \scratch_reg
>  	mov	\scratch_reg, %cr3
>  .Lend_\@:

This is an optimization that we can do generally without your feature.
Actually, it would be a welcome bit of benchmarking if you could see if
just this hunk helps your workload.

You touched on it in the description, but this is a *very* clever way to
do what you need without needing to look at the task flag at
user->kernel entry which also happens to be a place you don't have
task_struct mapped.  It *greatly* simplifies what this would have to do
otherwise.

That needs calling out specifically though.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 17:03   ` Dave Hansen
  2018-01-08 17:17     ` Willy Tarreau
@ 2018-01-08 17:21     ` Ingo Molnar
  2018-01-08 23:05     ` Andy Lutomirski
  2 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-01-08 17:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Willy Tarreau, linux-kernel, x86, tglx, gnomes, torvalds,
	Borislav Petkov, Andy Lutomirski, Josh Poimboeuf, Peter Zijlstra


* Dave Hansen <dave.hansen@intel.com> wrote:

> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> > Since we're going to keep running on the same PGD when returning to
> > userspace for certain performance-critical tasks, we'll need the user
> > pages to be executable. So this code disables the extra protection
> > that was added consisting in marking user pages _PAGE_NX so that this
> > pgd remains usable for userspace.
> > 
> > Note: it isn't necessarily the best approach, but one way or another
> >       if we want to be able to return to userspace from the kernel,
> >       we'll have to have this executable anyway. Another approach
> >       might consist in using another pgd for userland+kernel but
> >       the current core really looks like an extra careful measure
> >       to catch early bugs if any.
> 
> I don't like this.
> 
> I think the prctl() should apply to an entire process, not to a thread.
> If it applies to a process, you can unpoison the PGD.  I even had code
> to do this in an earlier version of the (whole system) runtime PTI
> on/off stuff.
> 
> Why are you even posting half-baked hacks like this now?  Is there
> something super-pressing about this set that we need to lock in a new
> ABI now?

Arguably it was posted as an RFC patch-set, to get feedback early on.

The motivation is clear enough from the announcement I think: to speed up the 
haproxy performance almost two-fold, without sacrificing the overall security 
given by PTI against the Meltdown attack. haproxy does not require PTI, as it 
never executes untrusted code.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 17:17     ` Willy Tarreau
@ 2018-01-08 17:23       ` Dave Hansen
  2018-01-08 17:30         ` Peter Zijlstra
  0 siblings, 1 reply; 53+ messages in thread
From: Dave Hansen @ 2018-01-08 17:23 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: linux-kernel, x86, tglx, gnomes, torvalds, Ingo Molnar,
	Peter Zijlstra, Borislav Petkov, Josh Poimboeuf, Andy Lutomirski

On 01/08/2018 09:17 AM, Willy Tarreau wrote:
>> I think the prctl() should apply to an entire process, not to a thread.
> 
> As I mentionned in another mail, I didn't know how to do it, even less
> how to do it fast enough so that we didn't add more cycles to the syscall
> code.

You can _implement_ it with a task thread if you want.  Just spray it
across all threads at the prctl()-time instead of a single thread.
It'll take a wee bit of locking.

I just don't think the API should apply to a single thread.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 17:17     ` Ingo Molnar
@ 2018-01-08 17:26       ` Thomas Gleixner
  2018-01-08 17:46         ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2018-01-08 17:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, linux-kernel, x86, gnomes, torvalds, Dave Hansen,
	Andy Lutomirski, Borislav Petkov, Peter Zijlstra, Josh Poimboeuf

On Mon, 8 Jan 2018, Ingo Molnar wrote:
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> > Per task is really an odd choice. That should be per process I think, but
> > that of course needs synchronization of some form. Aside of that we need to
> > think about fork().
> 
> So per task (thread) is the most natural approach to low level asm flaggery.

Well, yes and no. PTI is a property of the mm/pgdir and that's process
wide.
 
> Making it per thread also makes some sense conceptually: in a complex 
> multi-threaded runtime implementation some threads might never execute
> 'untrusted' code, some might. No need to penalize the 'server' threads.

If one thread runs untrusted code then your 'trusted' thread is not longer
trusted either as they share everything.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 17:19     ` Peter Zijlstra
@ 2018-01-08 17:26       ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-01-08 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Willy Tarreau, linux-kernel, x86, tglx, gnomes, torvalds,
	Borislav Petkov, Josh Poimboeuf, Andy Lutomirski, Dave Hansen


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jan 08, 2018 at 06:05:31PM +0100, Ingo Molnar wrote:
> > Note that there is somewhat of a fuzzy detail regarding AMD CPUs which are marked 
> > as 'Meltdown safe': should an explicit request to turn on PTI be honored by the 
> > kernel? Should that be some sort of separate 'force PTI on' attribute?
> 
> AMD should not have FEATURE_PTI enabled, and thus not end up in any code
> that cares about TIF_NOPTI.

I know, this is the status quo.

Nevertheless:

 - if someone disbelieves AMD's claims and wants to force-enable it, should it be 
   possible without patching the kernel?

 - or if someone wants to test it on AMD to increase test coverage. pti=on will 
   already be force-enable it on AMD CPUs.

Likewise, there's the counter part on the app level PTI disabling/enabling 
ABI functionality as well:

 - should there be a way for sysadmins to force PTI enabled, even on apps that 
   want to turn it off?

 - should there be a way for sysadmins to force PTI disabled, even for apps that 
   want to turn it on?

If we decide that we want to allow fine-grained, per app control of PTI, then all 
of these look valid scenarios to me.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 17:05   ` Thomas Gleixner
@ 2018-01-08 17:28     ` Dave Hansen
  2018-01-08 17:50       ` Ingo Molnar
  0 siblings, 1 reply; 53+ messages in thread
From: Dave Hansen @ 2018-01-08 17:28 UTC (permalink / raw)
  To: Thomas Gleixner, Willy Tarreau; +Cc: linux-kernel, x86, gnomes, torvalds

On 01/08/2018 09:05 AM, Thomas Gleixner wrote:
> On Mon, 8 Jan 2018, Willy Tarreau wrote:
>> Since we're going to keep running on the same PGD when returning to
>> userspace for certain performance-critical tasks, we'll need the user
>> pages to be executable. So this code disables the extra protection
>> that was added consisting in marking user pages _PAGE_NX so that this
>> pgd remains usable for userspace.
>>
>> Note: it isn't necessarily the best approach, but one way or another
>>       if we want to be able to return to userspace from the kernel,
>>       we'll have to have this executable anyway. Another approach
>>       might consist in using another pgd for userland+kernel but
>>       the current core really looks like an extra careful measure
>>       to catch early bugs if any.
> 
> I surely want to keep that as a safety measure. The entry code is simple to
> get wrong and running with the wrong pagetables by a silly mistake and
> thereby undoing the protection is surely not what we want.
> 
> Need to find a free time slot to think about that.

This does get immensely easier if we choose a mode at exec() (or fork()
even) and never change it.  The prctl() _could_ just be a flag to tell
what your children should do.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 17:23       ` Dave Hansen
@ 2018-01-08 17:30         ` Peter Zijlstra
  2018-01-08 17:49           ` Willy Tarreau
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2018-01-08 17:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Willy Tarreau, linux-kernel, x86, tglx, gnomes, torvalds,
	Ingo Molnar, Borislav Petkov, Josh Poimboeuf, Andy Lutomirski

On Mon, Jan 08, 2018 at 09:23:49AM -0800, Dave Hansen wrote:
> On 01/08/2018 09:17 AM, Willy Tarreau wrote:
> >> I think the prctl() should apply to an entire process, not to a thread.
> > 
> > As I mentionned in another mail, I didn't know how to do it, even less
> > how to do it fast enough so that we didn't add more cycles to the syscall
> > code.
> 
> You can _implement_ it with a task thread if you want.  Just spray it
> across all threads at the prctl()-time instead of a single thread.
> It'll take a wee bit of locking.
> 
> I just don't think the API should apply to a single thread.

It is surprisingly hard to find all tasks that share an mm. Finding all
threads in a threadgroup is easy, but we have CLONE_THREAD and CLONE_VM
as separate bits.

In any case, aside from that, setting this remotely is indeed
'intersting'.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 17:26       ` Thomas Gleixner
@ 2018-01-08 17:46         ` Ingo Molnar
  0 siblings, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-01-08 17:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Willy Tarreau, linux-kernel, x86, gnomes, torvalds, Dave Hansen,
	Andy Lutomirski, Borislav Petkov, Peter Zijlstra, Josh Poimboeuf


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Mon, 8 Jan 2018, Ingo Molnar wrote:
> > * Thomas Gleixner <tglx@linutronix.de> wrote:
> > > Per task is really an odd choice. That should be per process I think, but
> > > that of course needs synchronization of some form. Aside of that we need to
> > > think about fork().
> > 
> > So per task (thread) is the most natural approach to low level asm flaggery.
> 
> Well, yes and no. PTI is a property of the mm/pgdir and that's process
> wide.

Yes and no: the contents of CR3 is a property of the _thread_ (CPU), so we do have 
the implementational degree of freedom of executing user-space with either the 
user+kernel PGD or the user-only PGD.

The question I'm asking is whether it makes sense to offer this - the 
_possibility_ is clearly there unless I'm missing something.

> > Making it per thread also makes some sense conceptually: in a complex 
> > multi-threaded runtime implementation some threads might never execute
> > 'untrusted' code, some might. No need to penalize the 'server' threads.
> 
> If one thread runs untrusted code then your 'trusted' thread is not longer
> trusted either as they share everything.

I think you are missing the wider context here: a number of runtime environments 
(browsers, JIT compilers, etc.) are able to run "untrusted code" while still 
applying heavy sandboxing constraints.

In some of those runtime models it's very much possible to have untrusted code in 
the same address space as trusted code. BPF is an example for such a runtime 
model: we can run user-defined BPF JIT-ed code in kernel context.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 17:30         ` Peter Zijlstra
@ 2018-01-08 17:49           ` Willy Tarreau
  0 siblings, 0 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 17:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, linux-kernel, x86, tglx, gnomes, torvalds,
	Ingo Molnar, Borislav Petkov, Josh Poimboeuf, Andy Lutomirski

On Mon, Jan 08, 2018 at 06:30:32PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 08, 2018 at 09:23:49AM -0800, Dave Hansen wrote:
> > On 01/08/2018 09:17 AM, Willy Tarreau wrote:
> > >> I think the prctl() should apply to an entire process, not to a thread.
> > > 
> > > As I mentionned in another mail, I didn't know how to do it, even less
> > > how to do it fast enough so that we didn't add more cycles to the syscall
> > > code.
> > 
> > You can _implement_ it with a task thread if you want.  Just spray it
> > across all threads at the prctl()-time instead of a single thread.
> > It'll take a wee bit of locking.
> > 
> > I just don't think the API should apply to a single thread.
> 
> It is surprisingly hard to find all tasks that share an mm. Finding all
> threads in a threadgroup is easy, but we have CLONE_THREAD and CLONE_VM
> as separate bits.
> 
> In any case, aside from that, setting this remotely is indeed
> 'intersting'.

Then couldn't we instead detect that there's more than one thread in
the process and refuse to apply prctl() to prevent the behaviour from
becoming inconsistent ? This would seem reasonable after all, we want
to do this very early upon startup, it probably doesn't make sense to
change one's mind after threads have been created (or maybe only to
re-enable protection on some of them ?).

Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 17:28     ` Dave Hansen
@ 2018-01-08 17:50       ` Ingo Molnar
  2018-01-08 18:25         ` Alan Cox
  2018-01-08 18:44         ` Dave Hansen
  0 siblings, 2 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-01-08 17:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Willy Tarreau, linux-kernel, x86, gnomes, torvalds


* Dave Hansen <dave.hansen@intel.com> wrote:

> On 01/08/2018 09:05 AM, Thomas Gleixner wrote:
> > On Mon, 8 Jan 2018, Willy Tarreau wrote:
> >> Since we're going to keep running on the same PGD when returning to
> >> userspace for certain performance-critical tasks, we'll need the user
> >> pages to be executable. So this code disables the extra protection
> >> that was added consisting in marking user pages _PAGE_NX so that this
> >> pgd remains usable for userspace.
> >>
> >> Note: it isn't necessarily the best approach, but one way or another
> >>       if we want to be able to return to userspace from the kernel,
> >>       we'll have to have this executable anyway. Another approach
> >>       might consist in using another pgd for userland+kernel but
> >>       the current core really looks like an extra careful measure
> >>       to catch early bugs if any.
> > 
> > I surely want to keep that as a safety measure. The entry code is simple to
> > get wrong and running with the wrong pagetables by a silly mistake and
> > thereby undoing the protection is surely not what we want.
> > 
> > Need to find a free time slot to think about that.
> 
> This does get immensely easier if we choose a mode at exec() (or fork()
> even) and never change it.  The prctl() _could_ just be a flag to tell
> what your children should do.

Switching PTI on/off for a whole process would be nightmarish.

The simplest model is indeed child inheritance tree propagation - plus perhaps the 
ability for a thread to change its *own* PTI status, which obviously doesn't 
create any deep "process lookup" or cross-CPU complications.

( Note that here I only mean "simple to implement" - we might decide to not offer
  the ABI. )

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 16:12 ` [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI Willy Tarreau
                     ` (2 preceding siblings ...)
  2018-01-08 17:05   ` Ingo Molnar
@ 2018-01-08 17:50   ` Borislav Petkov
  2018-01-08 17:54   ` Linus Torvalds
  4 siblings, 0 replies; 53+ messages in thread
From: Borislav Petkov @ 2018-01-08 17:50 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel, x86, tglx, gnomes, torvalds

On Mon, Jan 08, 2018 at 05:12:17PM +0100, Willy Tarreau wrote:
> This allows to report the current state of the PTI protection and to
> enable or disable it for the current task.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  arch/x86/include/uapi/asm/prctl.h |  3 +++
>  arch/x86/kernel/process_64.c      | 24 ++++++++++++++++++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9..1f1b5bc 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -10,6 +10,9 @@
>  #define ARCH_GET_CPUID		0x1011
>  #define ARCH_SET_CPUID		0x1012
>  
> +#define ARCH_GET_NOPTI		0x1021
> +#define ARCH_SET_NOPTI		0x1022
> +
>  #define ARCH_MAP_VDSO_X32	0x2001
>  #define ARCH_MAP_VDSO_32	0x2002
>  #define ARCH_MAP_VDSO_64	0x2003
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index c754662..1686d3d 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -654,6 +654,30 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
>  		ret = put_user(base, (unsigned long __user *)arg2);
>  		break;
>  	}
> +	case ARCH_GET_NOPTI: {

Please add a CONFIG_ item for this so that people can disable those.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 16:12 ` [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI Willy Tarreau
                     ` (3 preceding siblings ...)
  2018-01-08 17:50   ` Borislav Petkov
@ 2018-01-08 17:54   ` Linus Torvalds
  2018-01-08 18:22     ` Willy Tarreau
  2018-01-08 20:35     ` Willy Tarreau
  4 siblings, 2 replies; 53+ messages in thread
From: Linus Torvalds @ 2018-01-08 17:54 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers,
	Thomas Gleixner, One Thousand Gnomes

On Mon, Jan 8, 2018 at 8:12 AM, Willy Tarreau <w@1wt.eu> wrote:
> This allows to report the current state of the PTI protection and to
> enable or disable it for the current task.

So I really think that this needs to be done up-front to avoid a lot
of complexity. And per mm.

If the process is already threaded (so the mm has multiple users),
it's too late to start playing games with PTI.

In fact, maybe the whole thing needs to be controlled before "exec"
happens, so that we have the knowledge as we build up the mm, rather
than being "runtime" dynamic at all.

But in no case should you even try to handle the multi-threaded case -
just error out for trying to change the PTI setting.

So make the thing per-mm, and then at task switch time as you switch
mms, you set the bit in a percpu variable for testing at kernel entry.

                    Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI
  2018-01-08 17:20   ` Dave Hansen
@ 2018-01-08 18:12     ` Willy Tarreau
  0 siblings, 0 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 18:12 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, x86, tglx, gnomes, torvalds, Andy Lutomirski

On Mon, Jan 08, 2018 at 09:20:08AM -0800, Dave Hansen wrote:
> > @@ -214,6 +215,11 @@
> >  .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> >  	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> >  	mov	%cr3, \scratch_reg
> > +
> > +	/* if we're already on the kernel PGD, we don't switch */
> > +	testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> > +	jz .Lend_\@
> > +
> >  	ADJUST_KERNEL_CR3 \scratch_reg
> >  	mov	\scratch_reg, %cr3
> >  .Lend_\@:
> 
> This is an optimization that we can do generally without your feature.
> Actually, it would be a welcome bit of benchmarking if you could see if
> just this hunk helps your workload.

Well, this alone provides no gains at all, as I can see when running
haproxy without the nopti wrapper. And it makes sense, as there is no
reason without this patch series for coming from userspace with CR3
pointing to the kernel's PGD.

> You touched on it in the description, but this is a *very* clever way to
> do what you need without needing to look at the task flag at
> user->kernel entry which also happens to be a place you don't have
> task_struct mapped.  It *greatly* simplifies what this would have to do
> otherwise.

In fact when I saw my kernel fail to boot when trying to read the task
flag, I quickly realized I needed something else to check before
entering the kernel ;-)

Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 17:54   ` Linus Torvalds
@ 2018-01-08 18:22     ` Willy Tarreau
  2018-01-08 20:49       ` Thomas Gleixner
  2018-01-08 20:35     ` Willy Tarreau
  1 sibling, 1 reply; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 18:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers,
	Thomas Gleixner, One Thousand Gnomes

On Mon, Jan 08, 2018 at 09:54:05AM -0800, Linus Torvalds wrote:
> On Mon, Jan 8, 2018 at 8:12 AM, Willy Tarreau <w@1wt.eu> wrote:
> > This allows to report the current state of the PTI protection and to
> > enable or disable it for the current task.
> 
> So I really think that this needs to be done up-front to avoid a lot
> of complexity. And per mm.
> 
> If the process is already threaded (so the mm has multiple users),
> it's too late to start playing games with PTI.
> 
> In fact, maybe the whole thing needs to be controlled before "exec"
> happens, so that we have the knowledge as we build up the mm, rather
> than being "runtime" dynamic at all.

In fact I initially wanted to start with a prctl() flag that acts upon
exec because it looked simpler. But then I realized that this would
always require a wrapper and that it's not necessarily more convenient.
It even makes permission checks more complicated from an administration
perspective, and I'd hate to see such a wrapper ending up setuid...

> But in no case should you even try to handle the multi-threaded case -
> just error out for trying to change the PTI setting.

I totally agree here. Like Ingo says, there *may* be useful cases for
this but I'm not sure we're prepared to address a new class of bugs
caused by this yet. And now I can get rid of GET_NOPTI, it was just for
debugging.

> So make the thing per-mm, and then at task switch time as you switch
> mms, you set the bit in a percpu variable for testing at kernel entry.

I'll see how to do that, this is not yet 100% clear to me, I'm still
discovering (this code has immensely changed since last time I *really*
dug into it). So I suspect I'll have to set this variable in
__switch_to() based on this other MM flag.

I'll study this.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 17:50       ` Ingo Molnar
@ 2018-01-08 18:25         ` Alan Cox
  2018-01-08 18:35           ` Ingo Molnar
  2018-01-08 18:35           ` Linus Torvalds
  2018-01-08 18:44         ` Dave Hansen
  1 sibling, 2 replies; 53+ messages in thread
From: Alan Cox @ 2018-01-08 18:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, Thomas Gleixner, Willy Tarreau, linux-kernel, x86, torvalds

> The simplest model is indeed child inheritance tree propagation - plus perhaps the 
> ability for a thread to change its *own* PTI status, which obviously doesn't 
> create any deep "process lookup" or cross-CPU complications.
> 
> ( Note that here I only mean "simple to implement" - we might decide to not offer
>   the ABI. )

I still think cgroups are the best model for this. In particular it
naturally fits things like containers, or network facing apps that fork
helpers.

Secondly when you are looking at barrier semantics between client/client
a cgroup is much more natural as a way to group processes together who
don't need to be protected from each other as they are trusting each
other. (Or we could just harcode this based upon ptraceability ?)

Alan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 18:25         ` Alan Cox
@ 2018-01-08 18:35           ` Ingo Molnar
  2018-01-08 18:35           ` Linus Torvalds
  1 sibling, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2018-01-08 18:35 UTC (permalink / raw)
  To: Alan Cox
  Cc: Dave Hansen, Thomas Gleixner, Willy Tarreau, linux-kernel, x86, torvalds


* Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:

> > The simplest model is indeed child inheritance tree propagation - plus perhaps the 
> > ability for a thread to change its *own* PTI status, which obviously doesn't 
> > create any deep "process lookup" or cross-CPU complications.
> > 
> > ( Note that here I only mean "simple to implement" - we might decide to not offer
> >   the ABI. )
> 
> I still think cgroups are the best model for this. In particular it
> naturally fits things like containers, or network facing apps that fork
> helpers.

I think the suggested exec() time inheritance model would naturally also cover 
cgroups (without tying the ABI to cgroups) - as containers typically get inherited 
from a single binary. A bit like how various personality bits get propagated.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 18:25         ` Alan Cox
  2018-01-08 18:35           ` Ingo Molnar
@ 2018-01-08 18:35           ` Linus Torvalds
  1 sibling, 0 replies; 53+ messages in thread
From: Linus Torvalds @ 2018-01-08 18:35 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Dave Hansen, Thomas Gleixner, Willy Tarreau,
	Linux Kernel Mailing List, the arch/x86 maintainers

On Mon, Jan 8, 2018 at 10:25 AM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
>
> I still think cgroups are the best model for this. In particular it
> naturally fits things like containers, or network facing apps that fork
> helpers.
>
> Secondly when you are looking at barrier semantics between client/client
> a cgroup is much more natural as a way to group processes together who
> don't need to be protected from each other as they are trusting each
> other. (Or we could just harcode this based upon ptraceability ?)

I agree that cgroups would be fairly natural, but I do think we could
look at things like simply trusted users too ("running as root? Yeah,
we're not going to try to protect the kernel from you") and/or trusted
binaries.

But all of those things are likely things that can easily be
determined at execve() time.

                      Linus

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 17:50       ` Ingo Molnar
  2018-01-08 18:25         ` Alan Cox
@ 2018-01-08 18:44         ` Dave Hansen
  1 sibling, 0 replies; 53+ messages in thread
From: Dave Hansen @ 2018-01-08 18:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Willy Tarreau, linux-kernel, x86, gnomes, torvalds

On 01/08/2018 09:50 AM, Ingo Molnar wrote:
>> This does get immensely easier if we choose a mode at exec() (or fork()
>> even) and never change it.  The prctl() _could_ just be a flag to tell
>> what your children should do.
> Switching PTI on/off for a whole process would be nightmarish.

Yeah, totally.  I meant "future children" aka. things *after* fork().

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 17:54   ` Linus Torvalds
  2018-01-08 18:22     ` Willy Tarreau
@ 2018-01-08 20:35     ` Willy Tarreau
  1 sibling, 0 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 20:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers,
	Thomas Gleixner, One Thousand Gnomes, Andy Lutomirski,
	Borislav Petkov, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Josh Poimboeuf, H. Peter Anvin

[ increased the CC list this time ]

On Mon, Jan 08, 2018 at 09:54:05AM -0800, Linus Torvalds wrote:
> On Mon, Jan 8, 2018 at 8:12 AM, Willy Tarreau <w@1wt.eu> wrote:
> > This allows to report the current state of the PTI protection and to
> > enable or disable it for the current task.
> 
> So I really think that this needs to be done up-front to avoid a lot
> of complexity. And per mm.
> 
> If the process is already threaded (so the mm has multiple users),
> it's too late to start playing games with PTI.
> 
> In fact, maybe the whole thing needs to be controlled before "exec"
> happens, so that we have the knowledge as we build up the mm, rather
> than being "runtime" dynamic at all.
> 
> But in no case should you even try to handle the multi-threaded case -
> just error out for trying to change the PTI setting.
> 
> So make the thing per-mm, and then at task switch time as you switch
> mms, you set the bit in a percpu variable for testing at kernel entry.

So I did something like this (will have to remerge the awful patches
and remove the printks before resending). In short, now here's what it
does :
 
  - added a new x86 flag : "mm->context.pti_disable", depends on
    CONFIG_PAGE_TABLE_ISOLATION.
  - the new prctl() also depends on this config setting.
  - prctl() refuses any change if mm->mm_users > 1
  - prctl() refuses to set nopti if !CAP_SYS_RAWIO, but clearing it is
    fine without (Ingo's idea)
  - __switch_to() sets a new "pti_disable" per-cpu variable to the copy
    of mm->context.pti_disable
  - entry code in SWITCH_TO_USER_CR3_NOSTACK now checks
    PER_CPU_VAR(pti_disable)
    
First tests show that it still works. One main difference I immediately
observed is that it stops at execve(). This means that it will not be
possible to implement a wrapper to enable the bypass, but on the other
hand it guarantees that any execve() even from a so called "trusted"
process doesn't accidently expose a victim program. So there are pros
and cons here.

I'm personally fine with both the wrapper and the code changes. But I'm
in the easiest situation, working with opensource code that I can easily
update to accommodate the changes. Other users might have a different
opinion here.

Another option could be to have a per-task (and really task here) flag
is only passed to execve() to mention that the per-mm pti_disable has
to be set in the new mm (and which would clear the task flag). But this
mechanism would always require a wrapper. Or we could have both.

I'll clean up my patches tomorrow morning and will post an update. Ideas
and objections welcome in the mean time ;-)

Cheers,
Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 18:22     ` Willy Tarreau
@ 2018-01-08 20:49       ` Thomas Gleixner
  2018-01-08 21:03         ` Willy Tarreau
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2018-01-08 20:49 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	the arch/x86 maintainers, One Thousand Gnomes, Andy Lutomirski

On Mon, 8 Jan 2018, Willy Tarreau wrote:
> On Mon, Jan 08, 2018 at 09:54:05AM -0800, Linus Torvalds wrote:
> 
> > So make the thing per-mm, and then at task switch time as you switch
> > mms, you set the bit in a percpu variable for testing at kernel entry.
> 
> I'll see how to do that, this is not yet 100% clear to me, I'm still
> discovering (this code has immensely changed since last time I *really*
> dug into it). So I suspect I'll have to set this variable in
> __switch_to() based on this other MM flag.

The right thing is probably the cpu entry area because that's what you can
access before switching CR3 if PTI is enabled for the task

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI
  2018-01-08 20:49       ` Thomas Gleixner
@ 2018-01-08 21:03         ` Willy Tarreau
  0 siblings, 0 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-08 21:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	the arch/x86 maintainers, One Thousand Gnomes, Andy Lutomirski

On Mon, Jan 08, 2018 at 09:49:23PM +0100, Thomas Gleixner wrote:
> On Mon, 8 Jan 2018, Willy Tarreau wrote:
> > On Mon, Jan 08, 2018 at 09:54:05AM -0800, Linus Torvalds wrote:
> > 
> > > So make the thing per-mm, and then at task switch time as you switch
> > > mms, you set the bit in a percpu variable for testing at kernel entry.
> > 
> > I'll see how to do that, this is not yet 100% clear to me, I'm still
> > discovering (this code has immensely changed since last time I *really*
> > dug into it). So I suspect I'll have to set this variable in
> > __switch_to() based on this other MM flag.
> 
> The right thing is probably the cpu entry area because that's what you can
> access before switching CR3 if PTI is enabled for the task

In fact I didn't need any such thing, because when coming from user
space I can simply check CR3.12 : if 0, we had PTI disabled for the
task, otherwise it's enabled.

I'll repost the cleaned up series ASAP, unfortunately I need to go now.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI
  2018-01-08 16:12 ` [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI Willy Tarreau
  2018-01-08 17:11   ` Ingo Molnar
  2018-01-08 17:20   ` Dave Hansen
@ 2018-01-08 23:01   ` Andy Lutomirski
  2 siblings, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2018-01-08 23:01 UTC (permalink / raw)
  To: Willy Tarreau, linux-kernel, x86; +Cc: tglx, gnomes, torvalds

On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> If a task has the TIF_NOPTI flag set, it doesn't want to experience
> page table isolation. In this case, returns from kernel to user will
> not switch the CR3, leaving it to the kernel one which already maps
> both user and kernel pages. Upon entry in the kernel, we can't check
> this flag so we simply check if CR3 was pointing to the kernel's PGD,
> indicating an earlier absence of switch, and in this case we don't
> change it.
> 
> Thanks to these changes, haproxy running under KVM went back from
> 12400 conn/s to 21000 once loaded after calling prctl().
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>   arch/x86/entry/calling.h | 23 +++++++++++++++++++++++
>   1 file changed, 23 insertions(+)
> 
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 45a63e0..054b8b7 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: GPL-2.0 */
>   #include <linux/jump_label.h>
> +#include <asm/thread_info.h>
>   #include <asm/unwind_hints.h>
>   #include <asm/cpufeatures.h>
>   #include <asm/page_types.h>
> @@ -214,6 +215,11 @@
>   .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
>   	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>   	mov	%cr3, \scratch_reg
> +
> +	/* if we're already on the kernel PGD, we don't switch */
> +	testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> +	jz .Lend_\@
> +
>   	ADJUST_KERNEL_CR3 \scratch_reg
>   	mov	\scratch_reg, %cr3
>   .Lend_\@:
> @@ -224,6 +230,12 @@
>   
>   .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
>   	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> +
> +	/* "NOPTI" taskflag avoids the switch */
> +	movq	PER_CPU_VAR(current_task), \scratch_reg
> +	btq	$TIF_NOPTI, TASK_TI_flags(\scratch_reg)
> +	jc	.Lend_\@
> +
>   	mov	%cr3, \scratch_reg
>   
>   	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
> @@ -262,6 +274,13 @@
>   	ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
>   	movq	%cr3, \scratch_reg
>   	movq	\scratch_reg, \save_reg
> +
> +	/* if we're already on the kernel PGD, we don't switch,
> +	 * we just save the current cr3.
> +	 */
> +	testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
> +	jz .Ldone_\@
> +
>   	/*
>   	 * Is the "switch mask" all zero?  That means that both of
>   	 * these are zero:
> @@ -284,6 +303,10 @@
>   .macro RESTORE_CR3 scratch_reg:req save_reg:req
>   	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>   
> +	/* if we saved a kernel context, we didn't switch so we don't switch */
> +	testq $(PTI_SWITCH_PGTABLES_MASK), \save_reg
> +	jz .Lend_\@
> +

I haven't looked that carefully, but this seems generally sane.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 17:03   ` Dave Hansen
  2018-01-08 17:17     ` Willy Tarreau
  2018-01-08 17:21     ` Ingo Molnar
@ 2018-01-08 23:05     ` Andy Lutomirski
  2018-01-08 23:09       ` Kees Cook
  2018-01-09  4:22       ` Willy Tarreau
  2 siblings, 2 replies; 53+ messages in thread
From: Andy Lutomirski @ 2018-01-08 23:05 UTC (permalink / raw)
  To: Dave Hansen, Willy Tarreau, linux-kernel, x86
  Cc: tglx, gnomes, torvalds, Kees Cook

On 01/08/2018 09:03 AM, Dave Hansen wrote:
> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
>> Since we're going to keep running on the same PGD when returning to
>> userspace for certain performance-critical tasks, we'll need the user
>> pages to be executable. So this code disables the extra protection
>> that was added consisting in marking user pages _PAGE_NX so that this
>> pgd remains usable for userspace.
>>
>> Note: it isn't necessarily the best approach, but one way or another
>>        if we want to be able to return to userspace from the kernel,
>>        we'll have to have this executable anyway. Another approach
>>        might consist in using another pgd for userland+kernel but
>>        the current core really looks like an extra careful measure
>>        to catch early bugs if any.
> 
> I don't like this.
> 
> I think the prctl() should apply to an entire process, not to a thread.
> If it applies to a process, you can unpoison the PGD.  I even had code
> to do this in an earlier version of the (whole system) runtime PTI
> on/off stuff.
> 
> Why are you even posting half-baked hacks like this now?  Is there
> something super-pressing about this set that we need to lock in a new
> ABI now?
> 

I vote per-thread.

Anyway, we can easily sync the NX-clearing: just catch the spurious page 
fault and clear the bit.  Avoiding infinite loops will need a bit of 
thought, but it's surely doable.

Or we set a per-mm flag saying "no NX", then do synchronize_sched() or 
similar if we were the first to set it (or take the pagetable lock), 
then clear all the NX bits.  Again, needs some care, but doable.

FWIW, the NX trick quite nicely emulates SMEP on non-SMEP hardware, 
which is fantastic for Spectre resistance and general hardening. 
Turning it off totally defeats that, which hurts a bit.

Also, Kees should be CC'd here.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 23:05     ` Andy Lutomirski
@ 2018-01-08 23:09       ` Kees Cook
  2018-01-09  4:22       ` Willy Tarreau
  1 sibling, 0 replies; 53+ messages in thread
From: Kees Cook @ 2018-01-08 23:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Willy Tarreau, LKML, X86 ML, Thomas Gleixner,
	Alan Cox, Linus Torvalds

On Mon, Jan 8, 2018 at 3:05 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On 01/08/2018 09:03 AM, Dave Hansen wrote:
>>
>> On 01/08/2018 08:12 AM, Willy Tarreau wrote:
>>>
>>> Since we're going to keep running on the same PGD when returning to
>>> userspace for certain performance-critical tasks, we'll need the user
>>> pages to be executable. So this code disables the extra protection
>>> that was added consisting in marking user pages _PAGE_NX so that this
>>> pgd remains usable for userspace.
>>>
>>> Note: it isn't necessarily the best approach, but one way or another
>>>        if we want to be able to return to userspace from the kernel,
>>>        we'll have to have this executable anyway. Another approach
>>>        might consist in using another pgd for userland+kernel but
>>>        the current core really looks like an extra careful measure
>>>        to catch early bugs if any.
>>
>>
>> I don't like this.
>>
>> I think the prctl() should apply to an entire process, not to a thread.
>> If it applies to a process, you can unpoison the PGD.  I even had code
>> to do this in an earlier version of the (whole system) runtime PTI
>> on/off stuff.
>>
>> Why are you even posting half-baked hacks like this now?  Is there
>> something super-pressing about this set that we need to lock in a new
>> ABI now?
>>
>
> I vote per-thread.
>
> Anyway, we can easily sync the NX-clearing: just catch the spurious page
> fault and clear the bit.  Avoiding infinite loops will need a bit of
> thought, but it's surely doable.
>
> Or we set a per-mm flag saying "no NX", then do synchronize_sched() or
> similar if we were the first to set it (or take the pagetable lock), then
> clear all the NX bits.  Again, needs some care, but doable.
>
> FWIW, the NX trick quite nicely emulates SMEP on non-SMEP hardware, which is
> fantastic for Spectre resistance and general hardening. Turning it off
> totally defeats that, which hurts a bit.
>
> Also, Kees should be CC'd here.

Please please keep the NX. As mentioned, this gets us emulated SMEP
for "free" with PTI, which Linux has needed for years.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-08 23:05     ` Andy Lutomirski
  2018-01-08 23:09       ` Kees Cook
@ 2018-01-09  4:22       ` Willy Tarreau
  1 sibling, 0 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-09  4:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, x86, tglx, gnomes, torvalds, Kees Cook

Hi Andy,

On Mon, Jan 08, 2018 at 03:05:48PM -0800, Andy Lutomirski wrote:
> On 01/08/2018 09:03 AM, Dave Hansen wrote:
> > On 01/08/2018 08:12 AM, Willy Tarreau wrote:
> I vote per-thread.

The per-mm approach that Linus suggested doesn't look bad either and
makes quite some sense.

> Anyway, we can easily sync the NX-clearing: just catch the spurious page
> fault and clear the bit.  Avoiding infinite loops will need a bit of
> thought, but it's surely doable.

That's an excellent idea, eventhough I have no idea how to implement it :-)

> Or we set a per-mm flag saying "no NX", then do synchronize_sched() or
> similar if we were the first to set it (or take the pagetable lock), then
> clear all the NX bits.  Again, needs some care, but doable.
> 
> FWIW, the NX trick quite nicely emulates SMEP on non-SMEP hardware, which is
> fantastic for Spectre resistance and general hardening.

Yes I figured exactly this when I faced this protection!

> Turning it off totally defeats that, which hurts a bit.

I agree, that's why I'd like it to be conditional. Probably that with
your idea of catching the page fault and the per-mm flag it would work
quite well, but before being able to do this I still have a lot to
explore :-/

> Also, Kees should be CC'd here.

Yes I've added him and you (and a few others) in CC of all forthcoming
patches. Sorry for not adding you initially, I simply wanted to share
a quick experiment and initiate a discussion.

Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/4] Per-task PTI activation
  2018-01-08 16:12 [PATCH RFC 0/4] Per-task PTI activation Willy Tarreau
                   ` (4 preceding siblings ...)
  2018-01-08 16:59 ` [PATCH RFC 0/4] Per-task PTI activation Dave Hansen
@ 2018-01-09 15:31 ` Eric W. Biederman
  2018-01-09 16:02   ` Willy Tarreau
  5 siblings, 1 reply; 53+ messages in thread
From: Eric W. Biederman @ 2018-01-09 15:31 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel, x86, tglx, gnomes, torvalds

Willy Tarreau <w@1wt.eu> writes:

> Hi!
>
> I could experiment a bit with the possibility to enable/disable PTI per
> task. Please keep in mind that it's not my area of experitise at all, but
> doing so I could recover the initial performance without disabling PTI on
> the whole system.
>
> So what I did in this series consists in the following :
>   - addition of a new per-task TIF_NOPTI flag. Please note that I'm not
>     proud of the way I did it, as 32 flags were already taken. The flags
>     are declared as "long" so there are 32 more flags available on x86_64
>     but C and asm disagree on the type of 1<<32 so I had to declare the
>     hex value by hand... By the way I even suspect that _TIF_FSCHECK is
>     wrong once cast to a long, I think it causes sign extension into the
>     32 upper bits since it's supposed to be signed.
>
>   - addition of a set of arch_prctl() calls (ARCH_GET_NOPTI and
>     ARCH_SET_NOPTI), to check and change the activation of the
>     protection. The change requires CAP_SYS_RAWIO and can be done in
>     a wrapper (that's how I tested)
>
>   - the user PGD was marked with _PAGE_NX to prevent an accidental leak
>     of CR3 from not being detected. I obviously had to disable this since
>     in this case we do want such a user task to run without switching the
>     PGD. I think this could be performed per-task maybe. Another approach
>     might consist in dealing with 3 PGDs and using a different one for
>     unprotected tasks but that really starts to sound overkill.
>
>   - upon return to userspace, I check if the task's flags contain the
>     new TIF_NOPTI or not. If it does contain it, then we don't switch
>     the CR3.
>
>   - upon entry into the kernel from userspace, we can't access the task's
>     flags but we can already check if CR3 points to the kernel or user PGD,
>     and we refrain from switching if it's already the system one.
>
> By doing so I could recover the initial performance of haproxy in a VM,
> going from 12400 connections per second to 21000 once started with this
> trivial wrapper :
>
>   #include <asm/prctl.h>
>   #include <sys/prctl.h>
>   
>   #ifndef ARCH_SET_NOPTI
>   #define ARCH_SET_NOPTI 0x1022
>   #endif
>   
>   int main(int argc, char **argv)
>   {
>           arch_prctl(ARCH_SET_NOPTI, 1);
>           argv++;
>           return execvp(argv[0], argv);
>   }
>
> I have not yet run it on real hardware. Before trying to go a bit further
> I'd like to know if such an approach is acceptable or if I'm doing anything
> stupid and looking in the wrong direction.

Before this goes much farther I want to point something out.

When I have kpti protecting me it is the applications with that connect
to the network I worry about.  Until I get to a system with users that
don't trust each other local I don't have a reason to worry about these
attacks from local applications.

The dangerous scenario is someone exploting a buffer overflow, or
otherwise getting a network facing application to misbehave, and then
using these new attacks to assist in gaining privilege escalation.


Googling seems to indicate that there is about one issue a year found in
haproxy.  So this is not an unrealistic concern for the case you
mention.


So unless I am seeing things wrong this is a patchset designed to drop
your defensense on the most vulnerable applications.


Disably protection on the most vunerable applications is not behavior
I would encourage.  It seems better than disabling protection system
wide but only slightly.   I definitely don't think this is something we
want applications disabling themselves.

Certainly this is something that should look at no-new-privs and if
no-new-privs is set not allow disabling this protection.

Eric

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/4] Per-task PTI activation
  2018-01-09 15:31 ` Eric W. Biederman
@ 2018-01-09 16:02   ` Willy Tarreau
  2018-01-09 18:11     ` Zhi Wang
  2018-01-09 21:07     ` Eric W. Biederman
  0 siblings, 2 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-09 16:02 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, x86, tglx, gnomes, torvalds

Hi Eric,

On Tue, Jan 09, 2018 at 09:31:27AM -0600, Eric W. Biederman wrote:
> The dangerous scenario is someone exploting a buffer overflow, or
> otherwise getting a network facing application to misbehave, and then
> using these new attacks to assist in gaining privilege escalation.

For most use cases sure. But for *some* use cases, if they can control
of the application, you've already lost everything you had. Private keys,
clear text traffic, etc. We're precisely talking about such applications
where the userspace is as much important as the kernel, and where there's
hardly anything left to lose once the application is cracked. However, a
significant performance drop on the application definitely is a problem,
first making it weaker when facing attacks, or even failing to deal with
traffic peaks.

> Googling seems to indicate that there is about one issue a year found in
> haproxy.  So this is not an unrealistic concern for the case you
> mention.

I agree. But in practice, we had two exploitable bugs, one in 2002 
(overflow in the logs), and one in 2014 requiring a purposely written
config which makes no pratical sense at all. Most other vulnerabilities
involve freezes, occasionally crashes, though that's even more rare.
And even with the two above, you just have one chance to try to exploit
it, if you get your pointer wrong, it dies and you have to wait for the
admin to restart it. In practice, seeing the process die is the worst
nightmare of admins as the service simply stops. I'm not saying we don't
want to defend them, we even chroot to an empty directory and drop
privileges to mitigate such a risk. But when the intruder is in the
process it's really too late.

> So unless I am seeing things wrong this is a patchset designed to drop
> your defensense on the most vulnerable applications.

In fact it can be seen very differently. By making it possible for exposed
but critical applications to share some risks with the rest of the system,
we also ensure they remain strong for their initial purpose and against
the most common types of attacks. And quite frankly we're not weakening
much given the risks already involved by the process itself.

What I'm describing represents a small category of processes in only
certain environments. Some database servers will have the same issue.
Imagine a Redis server for example, which normally is very fast and
easily saturates whatever network around it. Some DNS providers may
have the same problem when dealing with hundreds of thousands to
millions of UDP packets each second (not counting attacks).

All such services are critical in themselves, but the fact that we accept
to let them share the risks with the system doesn't mean they should be
running without the protections from the occasional operations guy just
allowed to connect there to verify if logs are full and to retrive stats.

> Disably protection on the most vunerable applications is not behavior
> I would encourage.

I'm not encouraging this behaviour either but right now the only option
for performance critical applications (even if they are vulnerable) is
to make the whole system vulnerable.

> It seems better than disabling protection system
> wide but only slightly.   I definitely don't think this is something we
> want applications disabling themselves.

In fact that's what I liked with the wrapper approach, except that it
had the downside of being harder to manage in terms of administration
and we'd risk to see it used everywhere by default. The arch_prctl()
approach ensures that only applications where this is relevant can do
it. In the case of haproxy, I can trivially add a config option like
"disable-page-isolation" to let the admin enable it on purpose.

But I suspect there might be some performance critical applications that
cannot be patched, and that's where the wrapper could still provide some
value.

> Certainly this is something that should look at no-new-privs and if
> no-new-privs is set not allow disabling this protection.

I don't know what is "no-new-privs" and couldn't find info on it
unfortunately. Do you have a link please ?

Thanks!
Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/4] Per-task PTI activation
  2018-01-09 16:02   ` Willy Tarreau
@ 2018-01-09 18:11     ` Zhi Wang
  2018-01-09 21:07     ` Eric W. Biederman
  1 sibling, 0 replies; 53+ messages in thread
From: Zhi Wang @ 2018-01-09 18:11 UTC (permalink / raw)
  To: Willy Tarreau, Eric W. Biederman
  Cc: linux-kernel, x86, tglx, gnomes, torvalds

Is is possible to put per-task PTI control interface into cgroup or 
other interfaces?  Enabling/disabling per-task PTI should be a decision 
from the system administrator not the application itself.

On 2018/1/9 18:02, Willy Tarreau wrote:
> Hi Eric,
>
> On Tue, Jan 09, 2018 at 09:31:27AM -0600, Eric W. Biederman wrote:
>> The dangerous scenario is someone exploting a buffer overflow, or
>> otherwise getting a network facing application to misbehave, and then
>> using these new attacks to assist in gaining privilege escalation.
> For most use cases sure. But for *some* use cases, if they can control
> of the application, you've already lost everything you had. Private keys,
> clear text traffic, etc. We're precisely talking about such applications
> where the userspace is as much important as the kernel, and where there's
> hardly anything left to lose once the application is cracked. However, a
> significant performance drop on the application definitely is a problem,
> first making it weaker when facing attacks, or even failing to deal with
> traffic peaks.
>
>> Googling seems to indicate that there is about one issue a year found in
>> haproxy.  So this is not an unrealistic concern for the case you
>> mention.
> I agree. But in practice, we had two exploitable bugs, one in 2002
> (overflow in the logs), and one in 2014 requiring a purposely written
> config which makes no pratical sense at all. Most other vulnerabilities
> involve freezes, occasionally crashes, though that's even more rare.
> And even with the two above, you just have one chance to try to exploit
> it, if you get your pointer wrong, it dies and you have to wait for the
> admin to restart it. In practice, seeing the process die is the worst
> nightmare of admins as the service simply stops. I'm not saying we don't
> want to defend them, we even chroot to an empty directory and drop
> privileges to mitigate such a risk. But when the intruder is in the
> process it's really too late.
>
>> So unless I am seeing things wrong this is a patchset designed to drop
>> your defensense on the most vulnerable applications.
> In fact it can be seen very differently. By making it possible for exposed
> but critical applications to share some risks with the rest of the system,
> we also ensure they remain strong for their initial purpose and against
> the most common types of attacks. And quite frankly we're not weakening
> much given the risks already involved by the process itself.
>
> What I'm describing represents a small category of processes in only
> certain environments. Some database servers will have the same issue.
> Imagine a Redis server for example, which normally is very fast and
> easily saturates whatever network around it. Some DNS providers may
> have the same problem when dealing with hundreds of thousands to
> millions of UDP packets each second (not counting attacks).
>
> All such services are critical in themselves, but the fact that we accept
> to let them share the risks with the system doesn't mean they should be
> running without the protections from the occasional operations guy just
> allowed to connect there to verify if logs are full and to retrive stats.
>
>> Disably protection on the most vunerable applications is not behavior
>> I would encourage.
> I'm not encouraging this behaviour either but right now the only option
> for performance critical applications (even if they are vulnerable) is
> to make the whole system vulnerable.
>
>> It seems better than disabling protection system
>> wide but only slightly.   I definitely don't think this is something we
>> want applications disabling themselves.
> In fact that's what I liked with the wrapper approach, except that it
> had the downside of being harder to manage in terms of administration
> and we'd risk to see it used everywhere by default. The arch_prctl()
> approach ensures that only applications where this is relevant can do
> it. In the case of haproxy, I can trivially add a config option like
> "disable-page-isolation" to let the admin enable it on purpose.
>
> But I suspect there might be some performance critical applications that
> cannot be patched, and that's where the wrapper could still provide some
> value.
>
>> Certainly this is something that should look at no-new-privs and if
>> no-new-privs is set not allow disabling this protection.
> I don't know what is "no-new-privs" and couldn't find info on it
> unfortunately. Do you have a link please ?
>
> Thanks!
> Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/4] Per-task PTI activation
  2018-01-09 16:02   ` Willy Tarreau
  2018-01-09 18:11     ` Zhi Wang
@ 2018-01-09 21:07     ` Eric W. Biederman
  2018-01-09 21:57       ` Willy Tarreau
  1 sibling, 1 reply; 53+ messages in thread
From: Eric W. Biederman @ 2018-01-09 21:07 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel, x86, tglx, gnomes, torvalds

Willy Tarreau <w@1wt.eu> writes:

> Hi Eric,
>
> On Tue, Jan 09, 2018 at 09:31:27AM -0600, Eric W. Biederman wrote:
>> The dangerous scenario is someone exploting a buffer overflow, or
>> otherwise getting a network facing application to misbehave, and then
>> using these new attacks to assist in gaining privilege escalation.
>
> For most use cases sure. But for *some* use cases, if they can control
> of the application, you've already lost everything you had. Private keys,
> clear text traffic, etc. We're precisely talking about such applications
> where the userspace is as much important as the kernel, and where there's
> hardly anything left to lose once the application is cracked. However, a
> significant performance drop on the application definitely is a problem,
> first making it weaker when facing attacks, or even failing to deal with
> traffic peaks.

>From reading the earlier emails it was not clear that all was lost if
they were compromomised.  In that case this makes plenty of sense.


>> Googling seems to indicate that there is about one issue a year found in
>> haproxy.  So this is not an unrealistic concern for the case you
>> mention.
>
> I agree. But in practice, we had two exploitable bugs, one in 2002 
> (overflow in the logs), and one in 2014 requiring a purposely written
> config which makes no pratical sense at all. Most other vulnerabilities
> involve freezes, occasionally crashes, though that's even more rare.
> And even with the two above, you just have one chance to try to exploit
> it, if you get your pointer wrong, it dies and you have to wait for the
> admin to restart it. In practice, seeing the process die is the worst
> nightmare of admins as the service simply stops. I'm not saying we don't
> want to defend them, we even chroot to an empty directory and drop
> privileges to mitigate such a risk. But when the intruder is in the
> process it's really too late.
>
>> So unless I am seeing things wrong this is a patchset designed to drop
>> your defensense on the most vulnerable applications.
>
> In fact it can be seen very differently. By making it possible for exposed
> but critical applications to share some risks with the rest of the system,
> we also ensure they remain strong for their initial purpose and against
> the most common types of attacks. And quite frankly we're not weakening
> much given the risks already involved by the process itself.
>
> What I'm describing represents a small category of processes in only
> certain environments. Some database servers will have the same issue.
> Imagine a Redis server for example, which normally is very fast and
> easily saturates whatever network around it. Some DNS providers may
> have the same problem when dealing with hundreds of thousands to
> millions of UDP packets each second (not counting attacks).
>
> All such services are critical in themselves, but the fact that we accept
> to let them share the risks with the system doesn't mean they should be
> running without the protections from the occasional operations guy just
> allowed to connect there to verify if logs are full and to retrive
> stats.

Reasonable.

>> Disably protection on the most vunerable applications is not behavior
>> I would encourage.
>
> I'm not encouraging this behaviour either but right now the only option
> for performance critical applications (even if they are vulnerable) is
> to make the whole system vulnerable.
>
>> It seems better than disabling protection system
>> wide but only slightly.   I definitely don't think this is something we
>> want applications disabling themselves.
>
> In fact that's what I liked with the wrapper approach, except that it
> had the downside of being harder to manage in terms of administration
> and we'd risk to see it used everywhere by default. The arch_prctl()
> approach ensures that only applications where this is relevant can do
> it. In the case of haproxy, I can trivially add a config option like
> "disable-page-isolation" to let the admin enable it on purpose.

How is that different from the option?

> But I suspect there might be some performance critical applications that
> cannot be patched, and that's where the wrapper could still provide some
> value.

I just don't want to encourage changning this option by default.  As a
lot of applications get installed in home servers or other places where
they are not performance critical.  At which point disabling the kpti
protection by default would be reducing the level of protection of
everything.

But ultimately I only brought this up so that people are thinking about
the other side of this.   About how it will affect not the high
performance servers single function but how it will affect the millions
of little servers that do many things all from a single machine.

Certainly I would not want this enabled in a container or a virtual
private server.  The capable(CAP_RAWIO) seems to handle that beautifully.

>> Certainly this is something that should look at no-new-privs and if
>> no-new-privs is set not allow disabling this protection.
>
> I don't know what is "no-new-privs" and couldn't find info on it
> unfortunately. Do you have a link please ?

Probably because I used dashes.  The no new privs flag is documented
in:
Documentation/userspace-api/no_new_privs.rst

It is a sandboxing flag that guarantees a process can not gain
privileges after it has been set.  You can search for PFA_NO_NEW_PRIVS
in sched.h if you want to see where it is defined.

Eric

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH RFC 0/4] Per-task PTI activation
  2018-01-09 21:07     ` Eric W. Biederman
@ 2018-01-09 21:57       ` Willy Tarreau
  0 siblings, 0 replies; 53+ messages in thread
From: Willy Tarreau @ 2018-01-09 21:57 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, x86, tglx, gnomes, torvalds

On Tue, Jan 09, 2018 at 03:07:07PM -0600, Eric W. Biederman wrote:
> > In fact that's what I liked with the wrapper approach, except that it
> > had the downside of being harder to manage in terms of administration
> > and we'd risk to see it used everywhere by default. The arch_prctl()
> > approach ensures that only applications where this is relevant can do
> > it. In the case of haproxy, I can trivially add a config option like
> > "disable-page-isolation" to let the admin enable it on purpose.
> 
> How is that different from the option?

Not sure what is different from the option, so let me just enumerate the
two approaches I'm seeing :
  - either a wrapper which work for the next execve() and not further.
    The benefit is that applications do not need to be recompiled. The
    problem is that it requires some changes to various places ranging
    from init scripts to API wrappers and whatever to place this wrapper.
    Also it can start to be perceived as the "wrapper that makes things
    fast" and some admins might start to routinely use it like we use
    taskset, and I'm not fond of this.

  - the option I mentionned would be a configuration setting enabling
    arch_prctl(). The option name should explicitly indicate that the
    admin wants to screw up his system's security. For example,
    "destroy-security-for-performance" is not the type of option you
    leave enabled in your config templates. It *will* require to
    recompile applications. I don't think it's a problem for the most
    performance sensitive ones, but I may be biased by my own
    experience.

  - or a combination of the two if some admins need to support stuff
    they can't rebuild (well LD_PRELOAD might be an option after all...)

> > But I suspect there might be some performance critical applications that
> > cannot be patched, and that's where the wrapper could still provide some
> > value.
> 
> I just don't want to encourage changning this option by default.

That's the same for me. Even the prctl name should be scary enough. Boris
suggested tainting the kernel and that's a very good idea, it will even
ensure that the setting is not used by default in applications because
admins won't want to see their systems tainted for no reason.

> As a
> lot of applications get installed in home servers or other places where
> they are not performance critical.  At which point disabling the kpti
> protection by default would be reducing the level of protection of
> everything.

Definitely. That's why I don't want to see the hard-coded prctl() either.

> But ultimately I only brought this up so that people are thinking about
> the other side of this.   About how it will affect not the high
> performance servers single function but how it will affect the millions
> of little servers that do many things all from a single machine.

This is my concern as well, which is why I'm seeking the most balanced
approach I can think of.

> Certainly I would not want this enabled in a container or a virtual
> private server.  The capable(CAP_RAWIO) seems to handle that beautifully.
> 
> >> Certainly this is something that should look at no-new-privs and if
> >> no-new-privs is set not allow disabling this protection.
> >
> > I don't know what is "no-new-privs" and couldn't find info on it
> > unfortunately. Do you have a link please ?
> 
> Probably because I used dashes.  The no new privs flag is documented
> in:
> Documentation/userspace-api/no_new_privs.rst

Ah stupid me, I should have tried underscores as well. Thank you.

> It is a sandboxing flag that guarantees a process can not gain
> privileges after it has been set.  You can search for PFA_NO_NEW_PRIVS
> in sched.h if you want to see where it is defined.

Oh that's very interesting, I wasn't aware of this! Definitely something
I need to set after dropping privileges in haproxy I guess ;-)

Thanks,
Willy

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2018-01-09 21:57 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-08 16:12 [PATCH RFC 0/4] Per-task PTI activation Willy Tarreau
2018-01-08 16:12 ` [PATCH RFC 1/4] x86/thread_info: add TIF_NOPTI to disable PTI per task Willy Tarreau
2018-01-08 16:57   ` Thomas Gleixner
2018-01-08 17:03     ` Willy Tarreau
2018-01-08 17:14       ` Ingo Molnar
2018-01-08 16:12 ` [PATCH RFC 2/4] x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to enable/disable PTI Willy Tarreau
2018-01-08 16:49   ` Peter Zijlstra
2018-01-08 16:56     ` Willy Tarreau
2018-01-08 17:02   ` Thomas Gleixner
2018-01-08 17:10     ` Willy Tarreau
2018-01-08 17:17     ` Ingo Molnar
2018-01-08 17:26       ` Thomas Gleixner
2018-01-08 17:46         ` Ingo Molnar
2018-01-08 17:05   ` Ingo Molnar
2018-01-08 17:19     ` Peter Zijlstra
2018-01-08 17:26       ` Ingo Molnar
2018-01-08 17:50   ` Borislav Petkov
2018-01-08 17:54   ` Linus Torvalds
2018-01-08 18:22     ` Willy Tarreau
2018-01-08 20:49       ` Thomas Gleixner
2018-01-08 21:03         ` Willy Tarreau
2018-01-08 20:35     ` Willy Tarreau
2018-01-08 16:12 ` [PATCH RFC 3/4] x86/pti: don't mark the user PGD with _PAGE_NX Willy Tarreau
2018-01-08 17:03   ` Dave Hansen
2018-01-08 17:17     ` Willy Tarreau
2018-01-08 17:23       ` Dave Hansen
2018-01-08 17:30         ` Peter Zijlstra
2018-01-08 17:49           ` Willy Tarreau
2018-01-08 17:21     ` Ingo Molnar
2018-01-08 23:05     ` Andy Lutomirski
2018-01-08 23:09       ` Kees Cook
2018-01-09  4:22       ` Willy Tarreau
2018-01-08 17:05   ` Thomas Gleixner
2018-01-08 17:28     ` Dave Hansen
2018-01-08 17:50       ` Ingo Molnar
2018-01-08 18:25         ` Alan Cox
2018-01-08 18:35           ` Ingo Molnar
2018-01-08 18:35           ` Linus Torvalds
2018-01-08 18:44         ` Dave Hansen
2018-01-08 16:12 ` [PATCH RFC 4/4] x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI Willy Tarreau
2018-01-08 17:11   ` Ingo Molnar
2018-01-08 17:20   ` Dave Hansen
2018-01-08 18:12     ` Willy Tarreau
2018-01-08 23:01   ` Andy Lutomirski
2018-01-08 16:59 ` [PATCH RFC 0/4] Per-task PTI activation Dave Hansen
2018-01-08 17:06   ` Willy Tarreau
2018-01-08 17:17     ` Dave Hansen
2018-01-08 17:13   ` Ingo Molnar
2018-01-09 15:31 ` Eric W. Biederman
2018-01-09 16:02   ` Willy Tarreau
2018-01-09 18:11     ` Zhi Wang
2018-01-09 21:07     ` Eric W. Biederman
2018-01-09 21:57       ` Willy Tarreau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.