All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/8] Per process PTI activation
@ 2018-01-10 19:28 Willy Tarreau
  2018-01-10 19:28 ` [RFC PATCH v3 1/8] x86/thread_info: add TIF_DISABLE_PTI_{NOW,NEXT} to disable PTI per task Willy Tarreau
                   ` (8 more replies)
  0 siblings, 9 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-10 19:28 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Willy Tarreau, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Dave Hansen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, Kees Cook,
	Alexander Viro

This is the third version of the proposal.

Consecutive to the discussions, I went back to using thread_info flags
as they may be cheaper to check than the per-CPU variable (being hot in
the cache) and will later make it possible to allow specific threads to
re-enable protection if desired (not supported yet as I'm not totally
sure of all possible impacts yet). 

The prctl is now conditionned by :
  - a config option : CONFIG_PER_PROCESS_PTI
  - a sysctl : pti_adjust, which takes 3 values :
      -  0 (default) : changes to PTI are not permitted
      -  1           : changes to PTI are permitted
      - -1           : like zero but cannot be changed anymore

This ensures that users running untrusted code can disable the support
at build time, and that distros can leave it to the admin to decide to
enable it, or to block it until next reboot.

There are now two prctls :
  - ARCH_DISABLE_PTI_NOW to disable PTI for the current process. It
    checks that mm_users <= 1 before proceeding, and only acts if
    pti_adjust == 1. It's cleared on execve().

  - ARCH_DISABLE_PTI_NEXT to disable PTI after the next execve(). It
    doesn't change the current process' state and will only set
    ARCH_DISABLE_PTI_NOW after the next execve() and be cleared. It's
    made for wrappers. I lazily copied the same checks as the first one
    so it also checks that mm_users <= 1, though it doesn't really make
    sense, but since it's present only to create simple wrappers, it's
    unlikely that such a simple wrapper will be called with threads.
    I'm not seeing any particular risks in removing this test though.

The GET prctl was dropped as useless (was there just as a debugging aid).

What remains to be done :
  - _PAGE_NX is still commented out for now. I'll need some help here
    if we have to catch a page fault to deal with it. Ingo apparently
    suggested that probably it doesn't bring any value anymore on modern
    systems with SMEP.

  - I haven't yet added the other values for the system-wide boot options

  - nothing done on tainting yet

  - documentation

I now find the solution really convenient to use and reassuring at the
same time, being disabled by default and with the ability to disable it
forever at runtime. I think we really are on a good balance here.

I'm interested in feedback, to know if it's worth pursuing that direction.

I wrote this quick-n-dirty test program for it serving both as a wrapper
and as a benchmark tool to quickly tell if it works or not (performs 3
million write()). I tested all combinations of NOW and NEXT with the
various sysctl values and everything works as expected.

  #include <asm/prctl.h>
  #include <sys/prctl.h>
  #include <stdio.h>
  
  #ifndef ARCH_DISABLE_PTI_NOW
  #define ARCH_DISABLE_PTI_NOW 0x1021
  #endif
  
  #ifndef ARCH_DISABLE_PTI_NEXT
  #define ARCH_DISABLE_PTI_NEXT 0x1022
  #endif
  
  int main(int argc, char **argv)
  {
          if (argc < 2) {
                  printf("usage: nopti [0|1|2|3] [<cmd> ...]\n");
                  return 1;
          }
  
          if (argv[1][0] & 1)
                  if (arch_prctl(ARCH_DISABLE_PTI_NOW, 1) == -1)
                          printf("failed PTI_NOW\n");
  
          if (argv[1][0] & 2)
                  if (arch_prctl(ARCH_DISABLE_PTI_NEXT, 1) == -1)
                          printf("failed PTI_NEXT\n");
  
          argv += 2;
          argc -= 2;
  
          if (!argc) {
                  /* run a local loop */
                  long loops = 3000000;
  
                  while (loops--)
                          write(-1, "a", 1);
                  return 0;
          }
  
          return execvp(argv[0], argv);
  }


Tests gave me this in a PCID-enabled VM :

  # time ./nopti 3
  failed PTI_NOW
  failed PTI_NEXT
  
  real    0m0.924s    ==> PTI enabled
  user    0m0.295s
  sys     0m0.627s
  
  # echo 1 > /proc/sys/vm/pti_adjust 
  # time ./nopti 1
  
  real    0m0.220s    ==> PTI disabled
  user    0m0.104s
  sys     0m0.116s
  
  # time ./nopti 1 ./nopti 0
  
  real    0m0.918s    ==> PTI enabled in target process
  user    0m0.276s
  sys     0m0.640s
  
  # time ./nopti 2
  
  real    0m0.906s    ==> PTI enabled
  user    0m0.280s
  sys     0m0.625s
  
  # time ./nopti 2 ./nopti 0
  
  real    0m0.207s    ==> PTI disabled in target process
  user    0m0.076s
  sys     0m0.131s
  
  # time ./nopti 3
  
  real    0m0.216s
  user    0m0.068s
  sys     0m0.148s
  
  # su admin
  $ id
  uid=100(admin) gid=4(adm) groups=4(adm)
  $ ./nopti 3
  failed PTI_NOW
  failed PTI_NEXT
  
  # echo 0 > /proc/sys/vm/pti_adjust 
  # time ./nopti 1
  failed PTI_NOW
  
  real    0m0.875s  ==> PTI enabled
  user    0m0.308s
  sys     0m0.567s
  
  # echo -1 > /proc/sys/vm/pti_adjust
  # ./nopti 3
  failed PTI_NOW
  failed PTI_NEXT
  
  # echo 0 > /proc/sys/vm/pti_adjust 
  -su: echo: write error: Operation not permitted
  # echo 1 > /proc/sys/vm/pti_adjust 
  -su: echo: write error: Operation not permitted

Willy

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>


Willy Tarreau (8):
  x86/thread_info: add TIF_DISABLE_PTI_{NOW,NEXT} to disable PTI per
    task
  x86/pti: add new config option PER_PROCESS_PTI
  x86/pti: create the pti_adjust sysctl
  x86/arch_prctl: add ARCH_DISABLE_PTI_{NOW,NEXT} to enable/disable PTI
  exec: take care of disabling PTI upon execve()
  x86/pti: don't mark the user PGD with _PAGE_NX.
  x86/entry/pti: avoid setting CR3 when it's already correct
  x86/entry/pti: don't switch PGD when TIF_DISABLE_PTI_NOW is set

 arch/x86/entry/calling.h           | 27 +++++++++++++++++++++++++++
 arch/x86/include/asm/pti.h         |  5 +++++
 arch/x86/include/asm/thread_info.h | 13 +++++++++++++
 arch/x86/include/uapi/asm/prctl.h  |  3 +++
 arch/x86/kernel/process_64.c       | 30 ++++++++++++++++++++++++++++++
 arch/x86/mm/pti.c                  | 22 ++++++++++++++++++++++
 fs/exec.c                          | 10 ++++++++++
 kernel/sysctl.c                    | 12 ++++++++++++
 security/Kconfig                   | 12 ++++++++++++
 9 files changed, 134 insertions(+)

-- 
1.7.12.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v3 1/8] x86/thread_info: add TIF_DISABLE_PTI_{NOW,NEXT} to disable PTI per task
  2018-01-10 19:28 [RFC PATCH v3 0/8] Per process PTI activation Willy Tarreau
@ 2018-01-10 19:28 ` Willy Tarreau
  2018-01-10 19:28 ` [RFC PATCH v3 2/8] x86/pti: add new config option PER_PROCESS_PTI Willy Tarreau
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-10 19:28 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Willy Tarreau, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Dave Hansen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, Kees Cook

The first flag indicates that the current task will not use page table
isolation. The second indicates that page table isolation must be turned
off only after the next execve().

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>

v3:
  - switched back to task flags
  - used two flags to avoid undesired propagation over execve()
  - more explicitly renamed the flags
---
 arch/x86/include/asm/thread_info.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 0022333..4f248b6 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -127,6 +127,19 @@ struct thread_info {
 #define _TIF_FSCHECK		(1 << TIF_FSCHECK)
 
 /*
+ * The following flags only exist on x86-64. Their equivalent mask will not be
+ * usable from assembly code due to the presence of '1UL' which doesn't parse
+ * there.
+ */
+#ifdef CONFIG_X86_64
+# define  TIF_DISABLE_PTI_NOW	32	/* disable PTI for this task */
+# define  TIF_DISABLE_PTI_NEXT	33	/* disable PTI after next execve() */
+
+# define _TIF_DISABLE_PTI_NOW	(1UL << TIF_DISABLE_PTI_NOW)
+# define _TIF_DISABLE_PTI_NEXT	(1UL << TIF_DISABLE_PTI_NEXT)
+#endif
+
+/*
  * work to do in syscall_trace_enter().  Also includes TIF_NOHZ for
  * enter_from_user_mode()
  */
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v3 2/8] x86/pti: add new config option PER_PROCESS_PTI
  2018-01-10 19:28 [RFC PATCH v3 0/8] Per process PTI activation Willy Tarreau
  2018-01-10 19:28 ` [RFC PATCH v3 1/8] x86/thread_info: add TIF_DISABLE_PTI_{NOW,NEXT} to disable PTI per task Willy Tarreau
@ 2018-01-10 19:28 ` Willy Tarreau
  2018-01-10 19:28 ` [RFC PATCH v3 3/8] x86/pti: create the pti_adjust sysctl Willy Tarreau
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-10 19:28 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Willy Tarreau, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Dave Hansen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, Kees Cook

This option will expose a sysctl allowing to adjust PTI per
process at run time.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
---
 security/Kconfig | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/security/Kconfig b/security/Kconfig
index 3d4debd..64adb48 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -65,6 +65,18 @@ config PAGE_TABLE_ISOLATION
 
 	  See Documentation/x86/pagetable-isolation.txt for more details.
 
+config PER_PROCESS_PTI
+	bool "Allow page table isolation to be adjusted per process"
+	default n
+	depends on PAGE_TABLE_ISOLATION
+	help
+	  This feature exposes a sysctl permitting administrators to
+	  specifically exempt certain critical tasks from the PTI
+	  protection at the risk of trading security for a marginal
+	  performance increase for I/O intensive applications.
+
+	  If you are unsure how to answer this question, answer N.
+
 config SECURITY_INFINIBAND
 	bool "Infiniband Security Hooks"
 	depends on SECURITY && INFINIBAND
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v3 3/8] x86/pti: create the pti_adjust sysctl
  2018-01-10 19:28 [RFC PATCH v3 0/8] Per process PTI activation Willy Tarreau
  2018-01-10 19:28 ` [RFC PATCH v3 1/8] x86/thread_info: add TIF_DISABLE_PTI_{NOW,NEXT} to disable PTI per task Willy Tarreau
  2018-01-10 19:28 ` [RFC PATCH v3 2/8] x86/pti: add new config option PER_PROCESS_PTI Willy Tarreau
@ 2018-01-10 19:28 ` Willy Tarreau
  2018-01-10 19:28 ` [RFC PATCH v3 4/8] x86/arch_prctl: add ARCH_DISABLE_PTI_{NOW,NEXT} to enable/disable PTI Willy Tarreau
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-10 19:28 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Willy Tarreau, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Dave Hansen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, Kees Cook

This sysctl supports a value from -1 to 1 which will let the administrator
decide whether or not to adjust the PTI behaviour per process.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/pti.h |  5 +++++
 arch/x86/mm/pti.c          | 19 +++++++++++++++++++
 kernel/sysctl.c            | 12 ++++++++++++
 3 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 0b5ef05..cc8e0d0e 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -6,6 +6,11 @@
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 extern void pti_init(void);
 extern void pti_check_boottime_disable(void);
+# ifdef CONFIG_PER_PROCESS_PTI
+extern int pti_adjust;
+int pti_adjust_sysctl_handler(struct ctl_table *table, int write,
+			      void __user *buffer, size_t *lenp, loff_t *ppos);
+# endif
 #else
 static inline void pti_check_boottime_disable(void) { }
 #endif
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 43d4a4a..8166686f 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -54,6 +54,10 @@
 #define __GFP_NOTRACK	0
 #endif
 
+# ifdef CONFIG_PER_PROCESS_PTI
+int pti_adjust;
+#endif
+
 static void __init pti_print_if_insecure(const char *reason)
 {
 	if (boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN))
@@ -371,6 +375,21 @@ static void __init pti_clone_entry_text(void)
 		       _PAGE_RW | _PAGE_GLOBAL);
 }
 
+#ifdef CONFIG_PER_PROCESS_PTI
+/*
+ * sysctl handler for pti_adjust which decides whether or not to accept a
+ * change to the value. 0 and 1 may be interchanged, but -1 is definitive.
+ */
+int pti_adjust_sysctl_handler(struct ctl_table *table, int write,
+			      void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	if (write && (!capable(CAP_SYS_RAWIO) || pti_adjust == -1))
+		return -EPERM;
+
+	return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+}
+#endif
+
 /*
  * Initialize kernel page table isolation
  */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 557d467..a37a5e1 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -34,6 +34,7 @@
 #include <linux/fs.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
+#include <linux/pti.h>
 #include <linux/kobject.h>
 #include <linux/net.h>
 #include <linux/sysrq.h>
@@ -1572,6 +1573,17 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.proc_handler	= numa_zonelist_order_handler,
 	},
 #endif
+#ifdef CONFIG_PER_PROCESS_PTI
+	{
+		.procname	= "pti_adjust",
+		.data		= &pti_adjust,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= pti_adjust_sysctl_handler,
+		.extra1		= &neg_one,
+		.extra2		= &one,
+	},
+#endif
 #if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \
    (defined(CONFIG_SUPERH) && defined(CONFIG_VSYSCALL))
 	{
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v3 4/8] x86/arch_prctl: add ARCH_DISABLE_PTI_{NOW,NEXT} to enable/disable PTI
  2018-01-10 19:28 [RFC PATCH v3 0/8] Per process PTI activation Willy Tarreau
                   ` (2 preceding siblings ...)
  2018-01-10 19:28 ` [RFC PATCH v3 3/8] x86/pti: create the pti_adjust sysctl Willy Tarreau
@ 2018-01-10 19:28 ` Willy Tarreau
  2018-01-10 19:28 ` [RFC PATCH v3 5/8] exec: take care of disabling PTI upon execve() Willy Tarreau
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-10 19:28 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Willy Tarreau, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Dave Hansen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, Kees Cook

These two prctls adjust the current task flags allowing to disable page
table isolation respectively for the current process or for the one
resulting from the next execve().

Both settings depend on CONFIG_PER_PROCESS_PTI. It is not possible to
set the flags if the pti_adjust sysctl is lower than 1, nor if the task
isn't capable of CAP_SYS_RAWIO, though it is still possible to disable
them.

Setting the flags is not allowed anymore once the task has created new
threads, but it's still possible to disable them.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>

v3:
  - depend on CONFIG_PER_PROCESS_PTI
  - switch back to task flags
  - use one task flag for the immediate task (config-based setting) and
    one task flag for the task resulting from the next execve (wrapper-based
    setting)
  - check the pti_adjust sysctl

v2:
  - use {set,clear}_thread_flag() as recommended by Peter
  - use task->mm->context.pti_disable instead of task flag
  - check for mm_users == 1
  - check for CAP_SYS_RAWIO only when setting, not clearing
  - make the code depend on CONFIG_PAGE_TABLE_ISOLATION
---
 arch/x86/include/uapi/asm/prctl.h |  3 +++
 arch/x86/kernel/process_64.c      | 30 ++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9..1564f98 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -10,6 +10,9 @@
 #define ARCH_GET_CPUID		0x1011
 #define ARCH_SET_CPUID		0x1012
 
+#define ARCH_DISABLE_PTI_NOW	0x1021
+#define ARCH_DISABLE_PTI_NEXT	0x1022
+
 #define ARCH_MAP_VDSO_X32	0x2001
 #define ARCH_MAP_VDSO_32	0x2002
 #define ARCH_MAP_VDSO_64	0x2003
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index c754662..b4de8aa 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -654,7 +654,37 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
 		ret = put_user(base, (unsigned long __user *)arg2);
 		break;
 	}
+#ifdef CONFIG_PER_PROCESS_PTI
+	case ARCH_DISABLE_PTI_NOW:
+		if (!task->mm || atomic_read(&task->mm->mm_users) > 1)
+			return -EPERM;
+
+		if (arg2 && (!capable(CAP_SYS_RAWIO) || pti_adjust < 1))
+			return -EPERM;
+
+		if (doit) {
+			if (arg2)
+				set_thread_flag(TIF_DISABLE_PTI_NOW);
+			else
+				clear_thread_flag(TIF_DISABLE_PTI_NOW);
+		}
+		break;
 
+	case ARCH_DISABLE_PTI_NEXT:
+		if (!task->mm || atomic_read(&task->mm->mm_users) > 1)
+			return -EPERM;
+
+		if (arg2 && (!capable(CAP_SYS_RAWIO) || pti_adjust < 1))
+			return -EPERM;
+
+		if (doit) {
+			if (arg2)
+				set_thread_flag(TIF_DISABLE_PTI_NEXT);
+			else
+				clear_thread_flag(TIF_DISABLE_PTI_NEXT);
+		}
+		break;
+#endif
 #ifdef CONFIG_CHECKPOINT_RESTORE
 # ifdef CONFIG_X86_X32_ABI
 	case ARCH_MAP_VDSO_X32:
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v3 5/8] exec: take care of disabling PTI upon execve()
  2018-01-10 19:28 [RFC PATCH v3 0/8] Per process PTI activation Willy Tarreau
                   ` (3 preceding siblings ...)
  2018-01-10 19:28 ` [RFC PATCH v3 4/8] x86/arch_prctl: add ARCH_DISABLE_PTI_{NOW,NEXT} to enable/disable PTI Willy Tarreau
@ 2018-01-10 19:28 ` Willy Tarreau
  2018-01-10 19:28 ` [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX Willy Tarreau
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-10 19:28 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Willy Tarreau, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Dave Hansen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, Kees Cook,
	Alexander Viro

Here's what we do here :
  - TIF_DISABLE_PTI_NOW is always cleared as we don't want an unprotected
    process to pass its lack of protection to any possible other program
    it could exec.

  - TIF_DISABLE_PTI_NEXT is copied into TIF_DISABLE_PTI_NOW and cleared,
    this is used by wrappers to disable PTI for a single exec call.

Thanks to this, PTI-aware programs can adjust TIF_DISABLE_PTI_NOW for
themselves, and a simple wrapper can be implemented by setting
TIF_DISABLE_PTI_NEXT to manage those unable to set TIF_DISABLE_PTI_NOW
themselves.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
---
 fs/exec.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 7eb8d21..cf42ddc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1814,6 +1814,16 @@ static int do_execveat_common(int fd, struct filename *filename,
 	putname(filename);
 	if (displaced)
 		put_files_struct(displaced);
+
+#ifdef CONFIG_PER_PROCESS_PTI
+	/*
+	 * TIF_DISABLE_PTI_NOW doesn't pass execve(). TIF_DISABLE_PTI_NEXT
+	 * turns into TIF_DISABLE_PTI_NOW and disappears.
+	 */
+	clear_thread_flag(TIF_DISABLE_PTI_NOW);
+	if (test_and_clear_thread_flag(TIF_DISABLE_PTI_NEXT))
+		set_thread_flag(TIF_DISABLE_PTI_NOW);
+#endif
 	return retval;
 
 out:
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-10 19:28 [RFC PATCH v3 0/8] Per process PTI activation Willy Tarreau
                   ` (4 preceding siblings ...)
  2018-01-10 19:28 ` [RFC PATCH v3 5/8] exec: take care of disabling PTI upon execve() Willy Tarreau
@ 2018-01-10 19:28 ` Willy Tarreau
  2018-01-10 19:54   ` Linus Torvalds
  2018-01-10 20:20   ` Dave Hansen
  2018-01-10 19:28 ` [RFC PATCH v3 7/8] x86/entry/pti: avoid setting CR3 when it's already correct Willy Tarreau
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-10 19:28 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Willy Tarreau, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Dave Hansen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, David Woodhouse,
	Kees Cook

Since we're going to keep running on the same PGD when returning to
userspace for certain performance-critical tasks, we'll need the user
pages to be executable. So this code disables the extra protection
that was added consisting in marking user pages _PAGE_NX so that this
pgd remains usable for userspace.

Note: it isn't necessarily the best approach, but one way or another
      if we want to be able to return to userspace from the kernel,
      we'll have to have this executable anyway. Another approach
      might consist in using another pgd for userland+kernel but
      the current core really looks like an extra careful measure
      to catch early bugs if any.

Note2: Andy's suggestion to instead dynamically disable NX upon
       page fault seems the most appealing.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Kees Cook <keescook@chromium.org>
---
 arch/x86/mm/pti.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 8166686f..181d56e 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -34,6 +34,7 @@
 #include <linux/init.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
+#include <linux/sysctl.h>
 #include <linux/uaccess.h>
 
 #include <asm/cpufeature.h>
@@ -139,9 +140,11 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 	 *  - we don't have NX support
 	 *  - we're clearing the PGD (i.e. the new pgd is not present).
 	 */
+#if 0
 	if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
 	    (__supported_pte_mask & _PAGE_NX))
 		pgd.pgd |= _PAGE_NX;
+#endif
 
 	/* return the copy of the PGD we want the kernel to use: */
 	return pgd;
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v3 7/8] x86/entry/pti: avoid setting CR3 when it's already correct
  2018-01-10 19:28 [RFC PATCH v3 0/8] Per process PTI activation Willy Tarreau
                   ` (5 preceding siblings ...)
  2018-01-10 19:28 ` [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX Willy Tarreau
@ 2018-01-10 19:28 ` Willy Tarreau
  2018-01-10 19:28 ` [RFC PATCH v3 8/8] x86/entry/pti: don't switch PGD when TIF_DISABLE_PTI_NOW is set Willy Tarreau
  2018-01-10 19:35 ` [RFC PATCH v3 0/8] Per process PTI activation Linus Torvalds
  8 siblings, 0 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-10 19:28 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Willy Tarreau, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Dave Hansen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, Kees Cook

When entering the kernel with CR3 pointing to the kernel's PGD, there's
no need to set it again. This will avoid a TLB flush on syscalls for tasks
running with the kernel's PGD (see next patch).

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>

v2:
  - updated comments according to Ingo's suggestions
  - split the code to keep only the CR3 changes here
---
 arch/x86/entry/calling.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 45a63e0..19c6790 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -214,6 +214,11 @@
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 	mov	%cr3, \scratch_reg
+
+	/* If we're already on the kernel PGD, we don't switch */
+	testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
+	jz .Lend_\@
+
 	ADJUST_KERNEL_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
 .Lend_\@:
@@ -262,6 +267,14 @@
 	ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
 	movq	%cr3, \scratch_reg
 	movq	\scratch_reg, \save_reg
+
+	/*
+	 * If we're already on the kernel PGD, we don't switch,
+	 * we just save the current CR3.
+	 */
+	testq $(PTI_SWITCH_PGTABLES_MASK), \scratch_reg
+	jz .Ldone_\@
+
 	/*
 	 * Is the "switch mask" all zero?  That means that both of
 	 * these are zero:
@@ -284,6 +297,13 @@
 .macro RESTORE_CR3 scratch_reg:req save_reg:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 
+	/*
+	 * If we saved a kernel context on entry, we didn't switch the CR3,
+	 * so we don't need to restore it on the way out either:
+	 */
+	testq $(PTI_SWITCH_PGTABLES_MASK), \save_reg
+	jz .Lend_\@
+
 	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
 
 	/*
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v3 8/8] x86/entry/pti: don't switch PGD when TIF_DISABLE_PTI_NOW is set
  2018-01-10 19:28 [RFC PATCH v3 0/8] Per process PTI activation Willy Tarreau
                   ` (6 preceding siblings ...)
  2018-01-10 19:28 ` [RFC PATCH v3 7/8] x86/entry/pti: avoid setting CR3 when it's already correct Willy Tarreau
@ 2018-01-10 19:28 ` Willy Tarreau
  2018-01-10 19:35 ` [RFC PATCH v3 0/8] Per process PTI activation Linus Torvalds
  8 siblings, 0 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-10 19:28 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Willy Tarreau, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Dave Hansen, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, Kees Cook

When a syscall returns to userspace with TIF_DISABLE_PTI_NOW set on
the task, it means it is configured to disable page table isolation
(PTI). In this case, returns from kernel to user will not switch the
CR3, leaving it to the kernel one which already maps both user and
kernel pages. This avoids a TLB flush, and saves another one on next
entry.

Thanks to these changes, haproxy running under KVM went back from
12700 conn/s (without PCID) or 19700 (with PCID) to 23100 once loaded
after calling prctl(), indicating that PTI has no measurable impact on
this workload.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>

v3:
  - switched back to using a task flag

v2:
  - use pti_disable instead of task flag
---
 arch/x86/entry/calling.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 19c6790..563478d 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <linux/jump_label.h>
+#include <asm/thread_info.h>
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
 #include <asm/page_types.h>
@@ -229,6 +230,12 @@
 
 .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+	/* This task may be exempt from PTI */
+	movq    PER_CPU_VAR(current_task), \scratch_reg
+	btq     $TIF_DISABLE_PTI_NOW, TASK_TI_flags(\scratch_reg)
+	jc      .Lend_\@
+
 	mov	%cr3, \scratch_reg
 
 	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
-- 
1.7.12.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v3 0/8] Per process PTI activation
  2018-01-10 19:28 [RFC PATCH v3 0/8] Per process PTI activation Willy Tarreau
                   ` (7 preceding siblings ...)
  2018-01-10 19:28 ` [RFC PATCH v3 8/8] x86/entry/pti: don't switch PGD when TIF_DISABLE_PTI_NOW is set Willy Tarreau
@ 2018-01-10 19:35 ` Linus Torvalds
  8 siblings, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2018-01-10 19:35 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers,
	Andy Lutomirski, Borislav Petkov, Brian Gerst, Dave Hansen,
	Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf,
	H. Peter Anvin, Kees Cook, Alexander Viro

On Wed, Jan 10, 2018 at 11:28 AM, Willy Tarreau <w@1wt.eu> wrote:
>
> There are now two prctls :
>   - ARCH_DISABLE_PTI_NOW to disable PTI for the current process. It
>     checks that mm_users <= 1 before proceeding, and only acts if
>     pti_adjust == 1. It's cleared on execve().
>
>   - ARCH_DISABLE_PTI_NEXT to disable PTI after the next execve(). It
>     doesn't change the current process' state and will only set
>     ARCH_DISABLE_PTI_NOW after the next execve() and be cleared. It's
>     made for wrappers.

So I really hate these interfaces. It's fundamentally only good for
long-lived processes.

What if you trust a session, not a single process?


> What remains to be done :
>   - _PAGE_NX is still commented out for now. I'll need some help here
>     if we have to catch a page fault to deal with it. Ingo apparently
>     suggested that probably it doesn't bring any value anymore on modern
>     systems with SMEP.

I still think that the whole "go by thread" is fundamentally complete garbage.

I know Andy asked for it, and I know he's usually right, but he's on
drugs on this one.

Per-thread PTI does not make sense, and it is WRONG. It means, for
example, that you can't just say "we always just use one page table
for this mm", and not even populate the other one.

So per-thread PTI is garbage and must die. Either the VM is protected
or it is not. It's about the mm, not about the thread. Andy didn't
give any sane reasons why it should be per-thread.

Worst comes to worst, and we end up having somebody who really really
has a good reason for it and can articulate it and explain why they
want it, and we can then revisit things. But no. Not in some initial
implementation. That's crazy.

                   Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-10 19:28 ` [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX Willy Tarreau
@ 2018-01-10 19:54   ` Linus Torvalds
  2018-01-10 19:59     ` Andy Lutomirski
  2018-01-10 20:20   ` Dave Hansen
  1 sibling, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2018-01-10 19:54 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers,
	Andy Lutomirski, Borislav Petkov, Brian Gerst, Dave Hansen,
	Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf,
	H. Peter Anvin, David Woodhouse, Kees Cook

On Wed, Jan 10, 2018 at 11:28 AM, Willy Tarreau <w@1wt.eu> wrote:
> Since we're going to keep running on the same PGD when returning to
> userspace for certain performance-critical tasks, we'll need the user
> pages to be executable. So this code disables the extra protection
> that was added consisting in marking user pages _PAGE_NX so that this
> pgd remains usable for userspace.

Yeah, no. This is wrong.

Sure, SMEP gives the same thing in most cases, but not for older CPU's.

So NX is a really nice way to make sure that PTI really does protect
against user-space gadgets.

We don't break that, and we definitely don't break that just because
of some broken notion of "let's make page table isolation per-thread".

                  Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-10 19:54   ` Linus Torvalds
@ 2018-01-10 19:59     ` Andy Lutomirski
  2018-01-10 20:28         ` Woodhouse, David
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2018-01-10 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Willy Tarreau, Linux Kernel Mailing List,
	the arch/x86 maintainers, Andy Lutomirski, Borislav Petkov,
	Brian Gerst, Dave Hansen, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, David Woodhouse,
	Kees Cook

On Wed, Jan 10, 2018 at 11:54 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Jan 10, 2018 at 11:28 AM, Willy Tarreau <w@1wt.eu> wrote:
>> Since we're going to keep running on the same PGD when returning to
>> userspace for certain performance-critical tasks, we'll need the user
>> pages to be executable. So this code disables the extra protection
>> that was added consisting in marking user pages _PAGE_NX so that this
>> pgd remains usable for userspace.
>
> Yeah, no. This is wrong.
>
> Sure, SMEP gives the same thing in most cases, but not for older CPU's.
>
> So NX is a really nice way to make sure that PTI really does protect
> against user-space gadgets.
>
> We don't break that, and we definitely don't break that just because
> of some broken notion of "let's make page table isolation per-thread".
>

If we're going to have a thread without PTI off, that thread needs to
run with the same page tables for kernel and user, so it needs NX off
on the user part.  I don't see any way around it.

We could nix the entire concept of fine-grained PTI control, or we
could make it require SMEP, I suppose.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-10 19:28 ` [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX Willy Tarreau
  2018-01-10 19:54   ` Linus Torvalds
@ 2018-01-10 20:20   ` Dave Hansen
  2018-01-11  6:27     ` Willy Tarreau
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2018-01-10 20:20 UTC (permalink / raw)
  To: Willy Tarreau, linux-kernel, x86
  Cc: Andy Lutomirski, Borislav Petkov, Brian Gerst, Ingo Molnar,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf,
	H. Peter Anvin, David Woodhouse, Kees Cook

On 01/10/2018 11:28 AM, Willy Tarreau wrote:
> Since we're going to keep running on the same PGD when returning to
> userspace for certain performance-critical tasks, we'll need the user
> pages to be executable. So this code disables the extra protection
> that was added consisting in marking user pages _PAGE_NX so that this
> pgd remains usable for userspace.

If you are going to keep pushing this patch, or anything like it, the
least you can do is to describe the downsides.  Describe the SMEP-like
semantics that PTI gives you and describe how this shoots them in the
head for the entire process.

Also describe the reason PTI put this mechanism in place, and how this
shoots _that_ in the head for the entire process.

Granted, you have an RFC on this, but please, for the love of everything
that is good the world, please stop sending this patch set until you
have a halfway reasonable method of dealing with NX that doesn't involve
#ifdefs.  Please.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-10 19:59     ` Andy Lutomirski
@ 2018-01-10 20:28         ` Woodhouse, David
  0 siblings, 0 replies; 19+ messages in thread
From: Woodhouse, David @ 2018-01-10 20:28 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Willy Tarreau, Linux Kernel Mailing List,
	the arch/x86 maintainers, Borislav Petkov, Brian Gerst,
	Dave Hansen, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Josh Poimboeuf, H. Peter Anvin, Kees Cook

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

On Wed, 2018-01-10 at 11:59 -0800, Andy Lutomirski wrote:
> On Wed, Jan 10, 2018 at 11:54 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > 
> > On Wed, Jan 10, 2018 at 11:28 AM, Willy Tarreau <w@1wt.eu> wrote:
> > > 
> > > Since we're going to keep running on the same PGD when returning to
> > > userspace for certain performance-critical tasks, we'll need the user
> > > pages to be executable. So this code disables the extra protection
> > > that was added consisting in marking user pages _PAGE_NX so that this
> > > pgd remains usable for userspace.
> > Yeah, no. This is wrong.
> > 
> > Sure, SMEP gives the same thing in most cases, but not for older CPU's.
> > 
> > So NX is a really nice way to make sure that PTI really does protect
> > against user-space gadgets.
> > 
> > We don't break that, and we definitely don't break that just because
> > of some broken notion of "let's make page table isolation per-thread".
> > 
> If we're going to have a thread without PTI off, that thread needs to
> run with the same page tables for kernel and user, so it needs NX off
> on the user part.  I don't see any way around it.
> 
> We could nix the entire concept of fine-grained PTI control, or we
> could make it require SMEP, I suppose.

We've been bashing out the precise requirements for RSB clearing (for
pre-SKL to avoid bogus entries) or stuffing (for SKL+ to avoid
underflow causing the BTB to be used).

It looks like we can avoid the RSB clearing on kernel entry if we have
SMEP. And PTI setting NX on userspace pages is equivalent to SMEP for
this purpose — so the RSB clearing basically ended up being AMD-only
(!SMEP && !PTI).

We also need to clear the RSB on vmexit, as documented. And if we
really want 100% support for retpoline on SKL+ instead of saying "use
IBRS if you're paranoid", then there are a few more cases where we need
to stuff it to avoid underflow (which is the same operation, but Arjan
insists we should differentiate the two, which is reasonable enough).

So... we'd really like to *not* lose the property that KPTI implies
SMEP-like NX of user space for the kernel.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX.
@ 2018-01-10 20:28         ` Woodhouse, David
  0 siblings, 0 replies; 19+ messages in thread
From: Woodhouse, David @ 2018-01-10 20:28 UTC (permalink / raw)
  To: torvalds, luto
  Cc: linux-kernel, mingo, peterz, keescook, tglx, dave.hansen,
	jpoimboe, x86, hpa, brgerst, bp, w


[-- Attachment #1.1: Type: text/plain, Size: 2168 bytes --]

On Wed, 2018-01-10 at 11:59 -0800, Andy Lutomirski wrote:
> On Wed, Jan 10, 2018 at 11:54 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > 
> > On Wed, Jan 10, 2018 at 11:28 AM, Willy Tarreau <w@1wt.eu> wrote:
> > > 
> > > Since we're going to keep running on the same PGD when returning to
> > > userspace for certain performance-critical tasks, we'll need the user
> > > pages to be executable. So this code disables the extra protection
> > > that was added consisting in marking user pages _PAGE_NX so that this
> > > pgd remains usable for userspace.
> > Yeah, no. This is wrong.
> > 
> > Sure, SMEP gives the same thing in most cases, but not for older CPU's.
> > 
> > So NX is a really nice way to make sure that PTI really does protect
> > against user-space gadgets.
> > 
> > We don't break that, and we definitely don't break that just because
> > of some broken notion of "let's make page table isolation per-thread".
> > 
> If we're going to have a thread without PTI off, that thread needs to
> run with the same page tables for kernel and user, so it needs NX off
> on the user part.  I don't see any way around it.
> 
> We could nix the entire concept of fine-grained PTI control, or we
> could make it require SMEP, I suppose.

We've been bashing out the precise requirements for RSB clearing (for
pre-SKL to avoid bogus entries) or stuffing (for SKL+ to avoid
underflow causing the BTB to be used).

It looks like we can avoid the RSB clearing on kernel entry if we have
SMEP. And PTI setting NX on userspace pages is equivalent to SMEP for
this purpose — so the RSB clearing basically ended up being AMD-only
(!SMEP && !PTI).

We also need to clear the RSB on vmexit, as documented. And if we
really want 100% support for retpoline on SKL+ instead of saying "use
IBRS if you're paranoid", then there are a few more cases where we need
to stuff it to avoid underflow (which is the same operation, but Arjan
insists we should differentiate the two, which is reasonable enough).

So... we'd really like to *not* lose the property that KPTI implies
SMEP-like NX of user space for the kernel.

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-10 20:28         ` Woodhouse, David
  (?)
@ 2018-01-11  6:23         ` Willy Tarreau
  -1 siblings, 0 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-11  6:23 UTC (permalink / raw)
  To: Woodhouse, David
  Cc: torvalds, luto, linux-kernel, mingo, peterz, keescook, tglx,
	dave.hansen, jpoimboe, x86, hpa, brgerst, bp

Hi David,

On Wed, Jan 10, 2018 at 08:28:27PM +0000, Woodhouse, David wrote:
> So... we'd really like to *not* lose the property that KPTI implies
> SMEP-like NX of user space for the kernel.

Don't worry, I find it nice as well and am not trying to kill it. As
mentionned in the "Note" section in the commit message, the current
#ifdef is temporary to make the whole thing work and I'm seeking good
ideas to do it only on unprotected processes. Andy proposed to continue
to do it inconditionally and to catch the page fault upon the first
return to user space and disable it. I like this approach but for now
I don't know how to do it. Another possibility would be that we disable
it when removing the protection on the mm.

Given that most of the discussion till now has been focused on how to
enable/disable the protection I'm leaving this part as-is for now. I'll
change the temporary commit message to make it clearer that it's broken
for now.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-10 20:20   ` Dave Hansen
@ 2018-01-11  6:27     ` Willy Tarreau
  0 siblings, 0 replies; 19+ messages in thread
From: Willy Tarreau @ 2018-01-11  6:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, Andy Lutomirski, Borislav Petkov, Brian Gerst,
	Ingo Molnar, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	Josh Poimboeuf, H. Peter Anvin, David Woodhouse, Kees Cook

On Wed, Jan 10, 2018 at 12:20:05PM -0800, Dave Hansen wrote:
> Granted, you have an RFC on this, but please, for the love of everything
> that is good the world, please stop sending this patch set until you
> have a halfway reasonable method of dealing with NX that doesn't involve
> #ifdefs.  Please.

Dave, people have been disagreeing for a while on two critical points :
  - per-task vs per-mm setting
  - stop or pass execve()

The discussion resulted from the patchset publication with some good
and bad points detected in each case (including your comment about the
conditional jump in the entry code based on CR3).

The method of dealing with NX will result from design choices, so for
me it should be handled last (but I'm find with changing the commit
message to mark it "BROKEN" to avoid confusion).

Thanks,
Willy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-01-10 20:28         ` Woodhouse, David
  (?)
  (?)
@ 2018-02-23 17:58         ` Konrad Rzeszutek Wilk
  2018-02-23 19:30           ` Dave Hansen
  -1 siblings, 1 reply; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-02-23 17:58 UTC (permalink / raw)
  To: Woodhouse, David
  Cc: Andy Lutomirski, Linus Torvalds, Willy Tarreau,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Borislav Petkov, Brian Gerst, Dave Hansen, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin,
	Kees Cook

On Wed, Jan 10, 2018 at 08:28:26PM +0000, Woodhouse, David wrote:
> On Wed, 2018-01-10 at 11:59 -0800, Andy Lutomirski wrote:
> > On Wed, Jan 10, 2018 at 11:54 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > 
> > > On Wed, Jan 10, 2018 at 11:28 AM, Willy Tarreau <w@1wt.eu> wrote:
> > > > 
> > > > Since we're going to keep running on the same PGD when returning to
> > > > userspace for certain performance-critical tasks, we'll need the user
> > > > pages to be executable. So this code disables the extra protection
> > > > that was added consisting in marking user pages _PAGE_NX so that this
> > > > pgd remains usable for userspace.
> > > Yeah, no. This is wrong.
> > > 
> > > Sure, SMEP gives the same thing in most cases, but not for older CPU's.
> > > 
> > > So NX is a really nice way to make sure that PTI really does protect
> > > against user-space gadgets.
> > > 
> > > We don't break that, and we definitely don't break that just because
> > > of some broken notion of "let's make page table isolation per-thread".
> > > 
> > If we're going to have a thread without PTI off, that thread needs to
> > run with the same page tables for kernel and user, so it needs NX off
> > on the user part.  I don't see any way around it.
> > 
> > We could nix the entire concept of fine-grained PTI control, or we
> > could make it require SMEP, I suppose.
> 
> We've been bashing out the precise requirements for RSB clearing (for
> pre-SKL to avoid bogus entries) or stuffing (for SKL+ to avoid
> underflow causing the BTB to be used).
> 
> It looks like we can avoid the RSB clearing on kernel entry if we have
> SMEP. And PTI setting NX on userspace pages is equivalent to SMEP for
> this purpose — so the RSB clearing basically ended up being AMD-only
> (!SMEP && !PTI).
> 
> We also need to clear the RSB on vmexit, as documented. And if we
> really want 100% support for retpoline on SKL+ instead of saying "use
> IBRS if you're paranoid", then there are a few more cases where we need
> to stuff it to avoid underflow (which is the same operation, but Arjan
> insists we should differentiate the two, which is reasonable enough).

Are these cases documented somewhere along with what approaches are taken?

I do remember Thomas's email from Feb 4th titled:
Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation                                                          

which outlined a pretty nifty idea on this, but not sure where that has gone?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX.
  2018-02-23 17:58         ` Konrad Rzeszutek Wilk
@ 2018-02-23 19:30           ` Dave Hansen
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Hansen @ 2018-02-23 19:30 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Woodhouse, David
  Cc: Andy Lutomirski, Linus Torvalds, Willy Tarreau,
	Linux Kernel Mailing List, the arch/x86 maintainers,
	Borislav Petkov, Brian Gerst, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Josh Poimboeuf, H. Peter Anvin, Kees Cook

On 02/23/2018 09:58 AM, Konrad Rzeszutek Wilk wrote:
>> We also need to clear the RSB on vmexit, as documented. And if we
>> really want 100% support for retpoline on SKL+ instead of saying "use
>> IBRS if you're paranoid", then there are a few more cases where we need
>> to stuff it to avoid underflow (which is the same operation, but Arjan
>> insists we should differentiate the two, which is reasonable enough).
> Are these cases documented somewhere along with what approaches are taken?

Some of them are documented here: https://goo.gl/pXbvBE

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-02-23 19:31 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-10 19:28 [RFC PATCH v3 0/8] Per process PTI activation Willy Tarreau
2018-01-10 19:28 ` [RFC PATCH v3 1/8] x86/thread_info: add TIF_DISABLE_PTI_{NOW,NEXT} to disable PTI per task Willy Tarreau
2018-01-10 19:28 ` [RFC PATCH v3 2/8] x86/pti: add new config option PER_PROCESS_PTI Willy Tarreau
2018-01-10 19:28 ` [RFC PATCH v3 3/8] x86/pti: create the pti_adjust sysctl Willy Tarreau
2018-01-10 19:28 ` [RFC PATCH v3 4/8] x86/arch_prctl: add ARCH_DISABLE_PTI_{NOW,NEXT} to enable/disable PTI Willy Tarreau
2018-01-10 19:28 ` [RFC PATCH v3 5/8] exec: take care of disabling PTI upon execve() Willy Tarreau
2018-01-10 19:28 ` [RFC PATCH v3 6/8] x86/pti: don't mark the user PGD with _PAGE_NX Willy Tarreau
2018-01-10 19:54   ` Linus Torvalds
2018-01-10 19:59     ` Andy Lutomirski
2018-01-10 20:28       ` Woodhouse, David
2018-01-10 20:28         ` Woodhouse, David
2018-01-11  6:23         ` Willy Tarreau
2018-02-23 17:58         ` Konrad Rzeszutek Wilk
2018-02-23 19:30           ` Dave Hansen
2018-01-10 20:20   ` Dave Hansen
2018-01-11  6:27     ` Willy Tarreau
2018-01-10 19:28 ` [RFC PATCH v3 7/8] x86/entry/pti: avoid setting CR3 when it's already correct Willy Tarreau
2018-01-10 19:28 ` [RFC PATCH v3 8/8] x86/entry/pti: don't switch PGD when TIF_DISABLE_PTI_NOW is set Willy Tarreau
2018-01-10 19:35 ` [RFC PATCH v3 0/8] Per process PTI activation Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.