All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/26] [RFCv2] x86: Memory Protection Keys
@ 2015-09-16 17:49 ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm

MM reviewers, if you are going to look at one thing, please look
at patch 14 which adds a bunch of additional vma/pte permission
checks.  Everybody else, please take a look at the two syscall
alternatives, especially the non-x86 folk.

This is a second big, fat RFC.  This code is not runnable to
anyone outside of Intel unless they have some special hardware or
a fancy simulator.  If you are interested in running this for
real, please get in touch with me.  Hardware is available to
a very small but nonzero number of people.

Since the last posting, I have implemented almost all of the
"software enforcement" for protection keys.  Basically, in places
where we look at VMA or PTE permissions, we try to enforce
protection keys to make it act similarly to mprotect().  This is
the part of the approach that really needs the most review and is
almost entirely contained in the "check VMAs and PTEs for
protection keys".

I also implemented a new system call.  There are basically two
possibilities for plumbing protection keys out to userspace.
I've included *both* approaches here:
1. Create a new system call: mprotect_key().  It's mprotect(),
   plus a protection key.  The patches implementing this have
   [NEWSYSCALL] in the subject.
2. Hijack some space in the PROT_* bits and pass a protection key
   in there.  That way, existing system calls like mmap(),
   mprotect(), etc... just work.  The patches implementing this
   have [HIJACKPROT] in the subject and must be applied without
   the [NEWSYSCALL] ones.

There is still work left to do here.  Current TODO:
 * Build on something other than x86
 * Do some more exhaustive x86 randconfig tests
 * Make sure DAX mappings work
 * Pound on some of the modified paths to ensure limited
   performance impact from modifications to hot paths.

This set is also available here (with the new syscall):

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v005

A version with the modification of the PROT_ syscalls is tagged
as 'pkeys-v005-protsyscalls'.

=== diffstat (new syscall version) ===

 Documentation/kernel-parameters.txt         |    3 
 Documentation/x86/protection-keys.txt       |   65 ++++++++++++++++++++
 arch/powerpc/include/asm/mman.h             |    5 -
 arch/x86/Kconfig                            |   15 ++++
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/x86/include/asm/cpufeature.h           |   54 ++++++++++------
 arch/x86/include/asm/disabled-features.h    |   12 +++
 arch/x86/include/asm/fpu/types.h            |   17 +++++
 arch/x86/include/asm/fpu/xstate.h           |    4 -
 arch/x86/include/asm/mmu_context.h          |   66 ++++++++++++++++++++
 arch/x86/include/asm/pgtable.h              |   37 +++++++++++
 arch/x86/include/asm/pgtable_types.h        |   34 +++++++++-
 arch/x86/include/asm/required-features.h    |    4 +
 arch/x86/include/asm/special_insns.h        |   33 ++++++++++
 arch/x86/include/uapi/asm/mman.h            |   23 +++++++
 arch/x86/include/uapi/asm/processor-flags.h |    2 
 arch/x86/kernel/cpu/common.c                |   27 ++++++++
 arch/x86/kernel/fpu/xstate.c                |   10 ++-
 arch/x86/kernel/process_64.c                |    2 
 arch/x86/kernel/setup.c                     |    9 ++
 arch/x86/mm/fault.c                         |   89 ++++++++++++++++++++++++++--
 arch/x86/mm/gup.c                           |   37 ++++++-----
 drivers/char/agp/frontend.c                 |    2 
 drivers/staging/android/ashmem.c            |    3 
 fs/proc/task_mmu.c                          |    5 +
 include/asm-generic/mm_hooks.h              |   12 +++
 include/linux/mm.h                          |   15 ++++
 include/linux/mman.h                        |    6 -
 include/uapi/asm-generic/siginfo.h          |   11 +++
 mm/Kconfig                                  |   11 +++
 mm/gup.c                                    |   28 +++++++-
 mm/memory.c                                 |    8 +-
 mm/mmap.c                                   |    2 
 mm/mprotect.c                               |   20 +++++-
 35 files changed, 607 insertions(+), 66 deletions(-)

== FEATURE OVERVIEW ==

Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU
feature which will be found in future Intel CPUs.  The work here
was done with the aid of simulators.

Memory Protection Keys provides a mechanism for enforcing
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.  It
works by dedicating 4 previously ignored bits in each page table
entry to assigning a "protection key", giving 16 possible keys to
each page mapping.

There is also a new user-accessible register (PKRU) with two
separate bits (Access Disable and Write Disable) for each key.
Being a CPU register, PKRU is inherently thread-local,
potentially giving each thread a different set of protections
from every other thread.

There are two new instructions (RDPKRU/WRPKRU) for reading and
writing to the new register.  The feature is only available in
64-bit mode, even though there is theoretically space in the PAE
PTEs.  These permissions are enforced on data access only and
have no effect on instruction fetches.


^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 00/26] [RFCv2] x86: Memory Protection Keys
@ 2015-09-16 17:49 ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm

MM reviewers, if you are going to look at one thing, please look
at patch 14 which adds a bunch of additional vma/pte permission
checks.  Everybody else, please take a look at the two syscall
alternatives, especially the non-x86 folk.

This is a second big, fat RFC.  This code is not runnable to
anyone outside of Intel unless they have some special hardware or
a fancy simulator.  If you are interested in running this for
real, please get in touch with me.  Hardware is available to
a very small but nonzero number of people.

Since the last posting, I have implemented almost all of the
"software enforcement" for protection keys.  Basically, in places
where we look at VMA or PTE permissions, we try to enforce
protection keys to make it act similarly to mprotect().  This is
the part of the approach that really needs the most review and is
almost entirely contained in the "check VMAs and PTEs for
protection keys".

I also implemented a new system call.  There are basically two
possibilities for plumbing protection keys out to userspace.
I've included *both* approaches here:
1. Create a new system call: mprotect_key().  It's mprotect(),
   plus a protection key.  The patches implementing this have
   [NEWSYSCALL] in the subject.
2. Hijack some space in the PROT_* bits and pass a protection key
   in there.  That way, existing system calls like mmap(),
   mprotect(), etc... just work.  The patches implementing this
   have [HIJACKPROT] in the subject and must be applied without
   the [NEWSYSCALL] ones.

There is still work left to do here.  Current TODO:
 * Build on something other than x86
 * Do some more exhaustive x86 randconfig tests
 * Make sure DAX mappings work
 * Pound on some of the modified paths to ensure limited
   performance impact from modifications to hot paths.

This set is also available here (with the new syscall):

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v005

A version with the modification of the PROT_ syscalls is tagged
as 'pkeys-v005-protsyscalls'.

=== diffstat (new syscall version) ===

 Documentation/kernel-parameters.txt         |    3 
 Documentation/x86/protection-keys.txt       |   65 ++++++++++++++++++++
 arch/powerpc/include/asm/mman.h             |    5 -
 arch/x86/Kconfig                            |   15 ++++
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/x86/include/asm/cpufeature.h           |   54 ++++++++++------
 arch/x86/include/asm/disabled-features.h    |   12 +++
 arch/x86/include/asm/fpu/types.h            |   17 +++++
 arch/x86/include/asm/fpu/xstate.h           |    4 -
 arch/x86/include/asm/mmu_context.h          |   66 ++++++++++++++++++++
 arch/x86/include/asm/pgtable.h              |   37 +++++++++++
 arch/x86/include/asm/pgtable_types.h        |   34 +++++++++-
 arch/x86/include/asm/required-features.h    |    4 +
 arch/x86/include/asm/special_insns.h        |   33 ++++++++++
 arch/x86/include/uapi/asm/mman.h            |   23 +++++++
 arch/x86/include/uapi/asm/processor-flags.h |    2 
 arch/x86/kernel/cpu/common.c                |   27 ++++++++
 arch/x86/kernel/fpu/xstate.c                |   10 ++-
 arch/x86/kernel/process_64.c                |    2 
 arch/x86/kernel/setup.c                     |    9 ++
 arch/x86/mm/fault.c                         |   89 ++++++++++++++++++++++++++--
 arch/x86/mm/gup.c                           |   37 ++++++-----
 drivers/char/agp/frontend.c                 |    2 
 drivers/staging/android/ashmem.c            |    3 
 fs/proc/task_mmu.c                          |    5 +
 include/asm-generic/mm_hooks.h              |   12 +++
 include/linux/mm.h                          |   15 ++++
 include/linux/mman.h                        |    6 -
 include/uapi/asm-generic/siginfo.h          |   11 +++
 mm/Kconfig                                  |   11 +++
 mm/gup.c                                    |   28 +++++++-
 mm/memory.c                                 |    8 +-
 mm/mmap.c                                   |    2 
 mm/mprotect.c                               |   20 +++++-
 35 files changed, 607 insertions(+), 66 deletions(-)

== FEATURE OVERVIEW ==

Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU
feature which will be found in future Intel CPUs.  The work here
was done with the aid of simulators.

Memory Protection Keys provides a mechanism for enforcing
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.  It
works by dedicating 4 previously ignored bits in each page table
entry to assigning a "protection key", giving 16 possible keys to
each page mapping.

There is also a new user-accessible register (PKRU) with two
separate bits (Access Disable and Write Disable) for each key.
Being a CPU register, PKRU is inherently thread-local,
potentially giving each thread a different set of protections
from every other thread.

There are two new instructions (RDPKRU/WRPKRU) for reading and
writing to the new register.  The feature is only available in
64-bit mode, even though there is theoretically space in the PAE
PTEs.  These permissions are enforced on data access only and
have no effect on instruction fetches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 03/26] x86, pkeys: cpuid bit definition
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


There are two CPUID bits for protection keys.  One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys.  Specifically:

	Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
	Protection keys (and the RDPKRU/WRPKRU instructions)

This is because userspace can not see CR4 contents, but it can
see CPUID contents.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:

	CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]

X86_FEATURE_OSPKE is "OSPKU":

	CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]

These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.

Add it to the disabled-features mask when its config option is
off.  Even though we are not using it here, we also extend the
REQUIRED_MASK_BIT_SET() macro to keep it mirroring the
DISABLED_MASK_BIT_SET() version.

This means that in almost all code, you should use:

	cpu_has(X86_FEATURE_PKU)

and *not* the CONFIG option.

---

 b/arch/x86/include/asm/cpufeature.h        |   54 +++++++++++++++++------------
 b/arch/x86/include/asm/disabled-features.h |   12 ++++++
 b/arch/x86/include/asm/required-features.h |    4 ++
 b/arch/x86/kernel/cpu/common.c             |    1 
 4 files changed, 50 insertions(+), 21 deletions(-)

diff -puN arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid arch/x86/include/asm/cpufeature.h
--- a/arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid	2015-09-16 10:48:12.424018599 -0700
+++ b/arch/x86/include/asm/cpufeature.h	2015-09-16 10:48:12.433019007 -0700
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	13	/* N 32-bit words worth of info */
+#define NCAPINTS	14	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -254,6 +254,10 @@
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 13 */
+#define X86_FEATURE_PKU		(13*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE	(13*32+ 4) /* OS Protection Keys Enable */
+
 /*
  * BUG word(s)
  */
@@ -294,28 +298,36 @@ extern const char * const x86_bug_flags[
 	 test_bit(bit, (unsigned long *)((c)->x86_capability))
 
 #define REQUIRED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & REQUIRED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & REQUIRED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & REQUIRED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & REQUIRED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & REQUIRED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & REQUIRED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & REQUIRED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & REQUIRED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & REQUIRED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & REQUIRED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & REQUIRED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & REQUIRED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & REQUIRED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & REQUIRED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & REQUIRED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & REQUIRED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & REQUIRED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & REQUIRED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & REQUIRED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & REQUIRED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & REQUIRED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & REQUIRED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & REQUIRED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK13)) )
 
 #define DISABLED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & DISABLED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & DISABLED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & DISABLED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & DISABLED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & DISABLED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & DISABLED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & DISABLED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & DISABLED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & DISABLED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & DISABLED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & DISABLED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & DISABLED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & DISABLED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & DISABLED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & DISABLED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & DISABLED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & DISABLED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & DISABLED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & DISABLED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & DISABLED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & DISABLED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & DISABLED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & DISABLED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK13)) )
 
 #define cpu_has(c, bit)							\
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
diff -puN arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid arch/x86/include/asm/disabled-features.h
--- a/arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid	2015-09-16 10:48:12.426018689 -0700
+++ b/arch/x86/include/asm/disabled-features.h	2015-09-16 10:48:12.433019007 -0700
@@ -28,6 +28,14 @@
 # define DISABLE_CENTAUR_MCR	0
 #endif /* CONFIG_X86_64 */
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+# define DISABLE_PKU		(1<<(X86_FEATURE_PKU))
+# define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE))
+#else
+# define DISABLE_PKU		0
+# define DISABLE_OSPKE		0
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -41,5 +49,9 @@
 #define DISABLED_MASK7	0
 #define DISABLED_MASK8	0
 #define DISABLED_MASK9	(DISABLE_MPX)
+#define DISABLED_MASK10	0
+#define DISABLED_MASK11	0
+#define DISABLED_MASK12	0
+#define DISABLED_MASK13	(DISABLE_PKU|DISABLE_OSPKE)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff -puN arch/x86/include/asm/required-features.h~pkeys-01-cpuid arch/x86/include/asm/required-features.h
--- a/arch/x86/include/asm/required-features.h~pkeys-01-cpuid	2015-09-16 10:48:12.428018780 -0700
+++ b/arch/x86/include/asm/required-features.h	2015-09-16 10:48:12.433019007 -0700
@@ -92,5 +92,9 @@
 #define REQUIRED_MASK7	0
 #define REQUIRED_MASK8	0
 #define REQUIRED_MASK9	0
+#define REQUIRED_MASK10	0
+#define REQUIRED_MASK11	0
+#define REQUIRED_MASK12	0
+#define REQUIRED_MASK13	0
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff -puN arch/x86/kernel/cpu/common.c~pkeys-01-cpuid arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-01-cpuid	2015-09-16 10:48:12.429018825 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-09-16 10:48:12.434019052 -0700
@@ -619,6 +619,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		cpuid_count(0x00000007, 0, &eax, &ebx, &ecx, &edx);
 
 		c->x86_capability[9] = ebx;
+		c->x86_capability[13] = ecx;
 	}
 
 	/* Extended state features: level 0x0000000d */
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 02/26] x86, pkeys: Add Kconfig option
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


I don't have a strong opinion on whether we need a Kconfig prompt
or not.  Protection Keys has relatively little code associated
with it, and it is not a heavyweight feature to keep enabled.
However, I can imagine that folks would still appreciate being
able to disable it.

We will hide the prompt for now.

---

 b/arch/x86/Kconfig |    4 ++++
 1 file changed, 4 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-01-kconfig arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-01-kconfig	2015-09-16 10:48:12.006999694 -0700
+++ b/arch/x86/Kconfig	2015-09-16 10:48:12.010999875 -0700
@@ -1694,6 +1694,10 @@ config X86_INTEL_MPX
 
 	  If unsure, say N.
 
+config X86_INTEL_MEMORY_PROTECTION_KEYS
+	def_bool y
+	depends on CPU_SUP_INTEL && X86_64
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 01/26] x86, fpu: add placeholder for Processor Trace XSAVE state
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


There is an XSAVE state component for Intel Processor Trace.  But,
we do not use it and do not expect to ever use it.

We add a placeholder in the code for it so it is not a mystery and
also so we do not need an explicit enum initialization for Protection
Keys in a moment.

Why will we never use it?  According to Andi Kleen:

The XSAVE support assumes that there is a single buffer for each
thread. But perf generally doesn't work this way, it usually has
only a single perf event per CPU per user, and when tracing
multiple threads on that CPU it inherits perf event buffers between
different threads. So XSAVE per thread cannot handle this inheritance
case directly.

Using multiple XSAVE areas (another one per perf event) would defeat
some of the state caching that the CPUs do.


---

 b/arch/x86/include/asm/fpu/types.h |    1 +
 b/arch/x86/kernel/fpu/xstate.c     |    3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/fpu/types.h~pt-xstate-bit arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pt-xstate-bit	2015-09-16 10:48:11.570979927 -0700
+++ b/arch/x86/include/asm/fpu/types.h	2015-09-16 10:48:11.574980109 -0700
@@ -108,6 +108,7 @@ enum xfeature {
 	XFEATURE_OPMASK,
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
+	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 
 	XFEATURE_MAX,
 };
diff -puN arch/x86/kernel/fpu/xstate.c~pt-xstate-bit arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pt-xstate-bit	2015-09-16 10:48:11.571979973 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-09-16 10:48:11.575980154 -0700
@@ -469,7 +469,8 @@ static void check_xstate_against_struct(
 	 * numbers.
 	 */
 	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX)) {
+	    (nr >= XFEATURE_MAX) ||
+	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR)) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 04/26] x86, pku: define new CR4 bit
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


There is a new bit in CR4 for enabling protection keys.  We
will actually enable it later in the series.

---

 b/arch/x86/include/uapi/asm/processor-flags.h |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4 arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4	2015-09-16 10:48:12.921041130 -0700
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2015-09-16 10:48:12.924041266 -0700
@@ -118,6 +118,8 @@
 #define X86_CR4_SMEP		_BITUL(X86_CR4_SMEP_BIT)
 #define X86_CR4_SMAP_BIT	21 /* enable SMAP support */
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
+#define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
+#define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 04/26] x86, pku: define new CR4 bit
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


There is a new bit in CR4 for enabling protection keys.  We
will actually enable it later in the series.

---

 b/arch/x86/include/uapi/asm/processor-flags.h |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4 arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4	2015-09-16 10:48:12.921041130 -0700
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2015-09-16 10:48:12.924041266 -0700
@@ -118,6 +118,8 @@
 #define X86_CR4_SMEP		_BITUL(X86_CR4_SMEP_BIT)
 #define X86_CR4_SMAP_BIT	21 /* enable SMAP support */
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
+#define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
+#define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 02/26] x86, pkeys: Add Kconfig option
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


I don't have a strong opinion on whether we need a Kconfig prompt
or not.  Protection Keys has relatively little code associated
with it, and it is not a heavyweight feature to keep enabled.
However, I can imagine that folks would still appreciate being
able to disable it.

We will hide the prompt for now.

---

 b/arch/x86/Kconfig |    4 ++++
 1 file changed, 4 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-01-kconfig arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-01-kconfig	2015-09-16 10:48:12.006999694 -0700
+++ b/arch/x86/Kconfig	2015-09-16 10:48:12.010999875 -0700
@@ -1694,6 +1694,10 @@ config X86_INTEL_MPX
 
 	  If unsure, say N.
 
+config X86_INTEL_MEMORY_PROTECTION_KEYS
+	def_bool y
+	depends on CPU_SUP_INTEL && X86_64
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 01/26] x86, fpu: add placeholder for Processor Trace XSAVE state
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


There is an XSAVE state component for Intel Processor Trace.  But,
we do not use it and do not expect to ever use it.

We add a placeholder in the code for it so it is not a mystery and
also so we do not need an explicit enum initialization for Protection
Keys in a moment.

Why will we never use it?  According to Andi Kleen:

The XSAVE support assumes that there is a single buffer for each
thread. But perf generally doesn't work this way, it usually has
only a single perf event per CPU per user, and when tracing
multiple threads on that CPU it inherits perf event buffers between
different threads. So XSAVE per thread cannot handle this inheritance
case directly.

Using multiple XSAVE areas (another one per perf event) would defeat
some of the state caching that the CPUs do.


---

 b/arch/x86/include/asm/fpu/types.h |    1 +
 b/arch/x86/kernel/fpu/xstate.c     |    3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/fpu/types.h~pt-xstate-bit arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pt-xstate-bit	2015-09-16 10:48:11.570979927 -0700
+++ b/arch/x86/include/asm/fpu/types.h	2015-09-16 10:48:11.574980109 -0700
@@ -108,6 +108,7 @@ enum xfeature {
 	XFEATURE_OPMASK,
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
+	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 
 	XFEATURE_MAX,
 };
diff -puN arch/x86/kernel/fpu/xstate.c~pt-xstate-bit arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pt-xstate-bit	2015-09-16 10:48:11.571979973 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-09-16 10:48:11.575980154 -0700
@@ -469,7 +469,8 @@ static void check_xstate_against_struct(
 	 * numbers.
 	 */
 	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX)) {
+	    (nr >= XFEATURE_MAX) ||
+	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR)) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 03/26] x86, pkeys: cpuid bit definition
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


There are two CPUID bits for protection keys.  One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys.  Specifically:

	Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
	Protection keys (and the RDPKRU/WRPKRU instructions)

This is because userspace can not see CR4 contents, but it can
see CPUID contents.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:

	CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]

X86_FEATURE_OSPKE is "OSPKU":

	CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]

These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.

Add it to the disabled-features mask when its config option is
off.  Even though we are not using it here, we also extend the
REQUIRED_MASK_BIT_SET() macro to keep it mirroring the
DISABLED_MASK_BIT_SET() version.

This means that in almost all code, you should use:

	cpu_has(X86_FEATURE_PKU)

and *not* the CONFIG option.

---

 b/arch/x86/include/asm/cpufeature.h        |   54 +++++++++++++++++------------
 b/arch/x86/include/asm/disabled-features.h |   12 ++++++
 b/arch/x86/include/asm/required-features.h |    4 ++
 b/arch/x86/kernel/cpu/common.c             |    1 
 4 files changed, 50 insertions(+), 21 deletions(-)

diff -puN arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid arch/x86/include/asm/cpufeature.h
--- a/arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid	2015-09-16 10:48:12.424018599 -0700
+++ b/arch/x86/include/asm/cpufeature.h	2015-09-16 10:48:12.433019007 -0700
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	13	/* N 32-bit words worth of info */
+#define NCAPINTS	14	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -254,6 +254,10 @@
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 13 */
+#define X86_FEATURE_PKU		(13*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE	(13*32+ 4) /* OS Protection Keys Enable */
+
 /*
  * BUG word(s)
  */
@@ -294,28 +298,36 @@ extern const char * const x86_bug_flags[
 	 test_bit(bit, (unsigned long *)((c)->x86_capability))
 
 #define REQUIRED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & REQUIRED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & REQUIRED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & REQUIRED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & REQUIRED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & REQUIRED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & REQUIRED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & REQUIRED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & REQUIRED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & REQUIRED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & REQUIRED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & REQUIRED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & REQUIRED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & REQUIRED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & REQUIRED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & REQUIRED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & REQUIRED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & REQUIRED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & REQUIRED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & REQUIRED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & REQUIRED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & REQUIRED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & REQUIRED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & REQUIRED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK13)) )
 
 #define DISABLED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & DISABLED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & DISABLED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & DISABLED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & DISABLED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & DISABLED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & DISABLED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & DISABLED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & DISABLED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & DISABLED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & DISABLED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & DISABLED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & DISABLED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & DISABLED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & DISABLED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & DISABLED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & DISABLED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & DISABLED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & DISABLED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & DISABLED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & DISABLED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & DISABLED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & DISABLED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & DISABLED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK13)) )
 
 #define cpu_has(c, bit)							\
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
diff -puN arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid arch/x86/include/asm/disabled-features.h
--- a/arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid	2015-09-16 10:48:12.426018689 -0700
+++ b/arch/x86/include/asm/disabled-features.h	2015-09-16 10:48:12.433019007 -0700
@@ -28,6 +28,14 @@
 # define DISABLE_CENTAUR_MCR	0
 #endif /* CONFIG_X86_64 */
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+# define DISABLE_PKU		(1<<(X86_FEATURE_PKU))
+# define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE))
+#else
+# define DISABLE_PKU		0
+# define DISABLE_OSPKE		0
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -41,5 +49,9 @@
 #define DISABLED_MASK7	0
 #define DISABLED_MASK8	0
 #define DISABLED_MASK9	(DISABLE_MPX)
+#define DISABLED_MASK10	0
+#define DISABLED_MASK11	0
+#define DISABLED_MASK12	0
+#define DISABLED_MASK13	(DISABLE_PKU|DISABLE_OSPKE)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff -puN arch/x86/include/asm/required-features.h~pkeys-01-cpuid arch/x86/include/asm/required-features.h
--- a/arch/x86/include/asm/required-features.h~pkeys-01-cpuid	2015-09-16 10:48:12.428018780 -0700
+++ b/arch/x86/include/asm/required-features.h	2015-09-16 10:48:12.433019007 -0700
@@ -92,5 +92,9 @@
 #define REQUIRED_MASK7	0
 #define REQUIRED_MASK8	0
 #define REQUIRED_MASK9	0
+#define REQUIRED_MASK10	0
+#define REQUIRED_MASK11	0
+#define REQUIRED_MASK12	0
+#define REQUIRED_MASK13	0
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff -puN arch/x86/kernel/cpu/common.c~pkeys-01-cpuid arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-01-cpuid	2015-09-16 10:48:12.429018825 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-09-16 10:48:12.434019052 -0700
@@ -619,6 +619,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		cpuid_count(0x00000007, 0, &eax, &ebx, &ecx, &edx);
 
 		c->x86_capability[9] = ebx;
+		c->x86_capability[13] = ecx;
 	}
 
 	/* Extended state features: level 0x0000000d */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 07/26] x86, pkeys: new page fault error code bit: PF_PK
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


Note: "PK" is how the Intel SDM refers to this bit, so we also
use that nomenclature.

This only defines the bit, it does not plumb it anywhere to be
handled.

---

 b/arch/x86/mm/fault.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN arch/x86/mm/fault.c~pkeys-05-pfec arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-05-pfec	2015-09-16 10:48:14.219099976 -0700
+++ b/arch/x86/mm/fault.c	2015-09-16 10:48:14.222100112 -0700
@@ -33,6 +33,7 @@
  *   bit 2 ==	 0: kernel-mode access	1: user-mode access
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
+ *   bit 5 ==				1: protection keys block access
  */
 enum x86_pf_error_code {
 
@@ -41,6 +42,7 @@ enum x86_pf_error_code {
 	PF_USER		=		1 << 2,
 	PF_RSVD		=		1 << 3,
 	PF_INSTR	=		1 << 4,
+	PF_PK		=		1 << 5,
 };
 
 /*
@@ -916,7 +918,10 @@ static int spurious_fault_check(unsigned
 
 	if ((error_code & PF_INSTR) && !pte_exec(*pte))
 		return 0;
-
+	/*
+	 * Note: We do not do lazy flushing on protection key
+	 * changes, so no spurious fault will ever set PF_PK.
+	 */
 	return 1;
 }
 
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 06/26] x86, pkeys: PTE bits for storing protection key
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


Previous documentation has referred to these 4 bits as "ignored".
That means that software could have made use of them.  But, as
far as I know, the kernel never used them.

They are still ignored when protection keys is not enabled, so
they could theoretically still get used for software purposes.

We also implement "empty" versions so that code that references
to them can be optimized away by the compiler when the config
option is not enabled.

---

 b/arch/x86/include/asm/pgtable_types.h |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits	2015-09-16 10:48:13.805081207 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-16 10:48:13.809081388 -0700
@@ -25,7 +25,11 @@
 #define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
+#define _PAGE_BIT_PKEY_BIT0	59       /* Protection Keys, bit 1/4 */
+#define _PAGE_BIT_PKEY_BIT1	60       /* Protection Keys, bit 2/4 */
+#define _PAGE_BIT_PKEY_BIT2	61       /* Protection Keys, bit 3/4 */
+#define _PAGE_BIT_PKEY_BIT3	62       /* Protection Keys, bit 4/4 */
+#define _PAGE_BIT_NX		63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -47,6 +51,17 @@
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2)
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3)
+#else
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 0))
+#endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 05/26] x86, pkey: add PKRU xsave fields and data structure(s)
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


The protection keys register (PKRU) is saved and restored using
xsave.  Define the data structure that we will use to access it
inside the xsave buffer.

Note that we also have to widen the printk of the xsave feature
masks since this is feature 0x200 and we only did two characters
before.

---

 b/arch/x86/include/asm/fpu/types.h  |   16 ++++++++++++++++
 b/arch/x86/include/asm/fpu/xstate.h |    4 +++-
 b/arch/x86/kernel/fpu/xstate.c      |    7 ++++++-
 3 files changed, 25 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pkeys-03-xsave arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pkeys-03-xsave	2015-09-16 10:48:13.337059990 -0700
+++ b/arch/x86/include/asm/fpu/types.h	2015-09-16 10:48:13.344060307 -0700
@@ -109,6 +109,7 @@ enum xfeature {
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
+	XFEATURE_PKRU,
 
 	XFEATURE_MAX,
 };
@@ -121,6 +122,7 @@ enum xfeature {
 #define XFEATURE_MASK_OPMASK		(1 << XFEATURE_OPMASK)
 #define XFEATURE_MASK_ZMM_Hi256		(1 << XFEATURE_ZMM_Hi256)
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
+#define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
@@ -213,6 +215,20 @@ struct avx_512_hi16_state {
 	struct reg_512_bit		hi16_zmm[16];
 } __packed;
 
+/*
+ * State component 9: 32-bit PKRU register.
+ */
+struct pkru {
+	u32 pkru;
+} __packed;
+
+struct pkru_state {
+	union {
+		struct pkru		pkru;
+		u8			pad_to_8_bytes[8];
+	};
+} __packed;
+
 struct xstate_header {
 	u64				xfeatures;
 	u64				xcomp_bv;
diff -puN arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave	2015-09-16 10:48:13.339060081 -0700
+++ b/arch/x86/include/asm/fpu/xstate.h	2015-09-16 10:48:13.344060307 -0700
@@ -27,7 +27,9 @@
 				 XFEATURE_MASK_Hi16_ZMM)
 
 /* Supported features which require eager state saving */
-#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)
+#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | \
+				 XFEATURE_MASK_BNDCSR | \
+				 XFEATURE_MASK_PKRU)
 
 /* All currently supported features */
 #define XCNTXT_MASK	(XFEATURE_MASK_LAZY | XFEATURE_MASK_EAGER)
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2015-09-16 10:48:13.340060126 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-09-16 10:48:13.344060307 -0700
@@ -23,6 +23,8 @@ static const char *xfeature_names[] =
 	"AVX-512 opmask"		,
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
+	"unknown xstate feature (8)"	,
+	"Protection Keys User registers",
 	"unknown xstate feature"	,
 };
 
@@ -52,6 +54,7 @@ void fpu__xstate_clear_all_cpu_caps(void
 	setup_clear_cpu_cap(X86_FEATURE_AVX512ER);
 	setup_clear_cpu_cap(X86_FEATURE_AVX512CD);
 	setup_clear_cpu_cap(X86_FEATURE_MPX);
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
 }
 
 /*
@@ -230,7 +233,7 @@ static void __init print_xstate_feature(
 	const char *feature_name;
 
 	if (cpu_has_xfeatures(xstate_mask, &feature_name))
-		pr_info("x86/fpu: Supporting XSAVE feature 0x%02Lx: '%s'\n", xstate_mask, feature_name);
+		pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name);
 }
 
 /*
@@ -246,6 +249,7 @@ static void __init print_xstate_features
 	print_xstate_feature(XFEATURE_MASK_OPMASK);
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
+	print_xstate_feature(XFEATURE_MASK_PKRU);
 }
 
 /*
@@ -462,6 +466,7 @@ static void check_xstate_against_struct(
 	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
+	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 05/26] x86, pkey: add PKRU xsave fields and data structure(s)
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


The protection keys register (PKRU) is saved and restored using
xsave.  Define the data structure that we will use to access it
inside the xsave buffer.

Note that we also have to widen the printk of the xsave feature
masks since this is feature 0x200 and we only did two characters
before.

---

 b/arch/x86/include/asm/fpu/types.h  |   16 ++++++++++++++++
 b/arch/x86/include/asm/fpu/xstate.h |    4 +++-
 b/arch/x86/kernel/fpu/xstate.c      |    7 ++++++-
 3 files changed, 25 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pkeys-03-xsave arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pkeys-03-xsave	2015-09-16 10:48:13.337059990 -0700
+++ b/arch/x86/include/asm/fpu/types.h	2015-09-16 10:48:13.344060307 -0700
@@ -109,6 +109,7 @@ enum xfeature {
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
+	XFEATURE_PKRU,
 
 	XFEATURE_MAX,
 };
@@ -121,6 +122,7 @@ enum xfeature {
 #define XFEATURE_MASK_OPMASK		(1 << XFEATURE_OPMASK)
 #define XFEATURE_MASK_ZMM_Hi256		(1 << XFEATURE_ZMM_Hi256)
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
+#define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
@@ -213,6 +215,20 @@ struct avx_512_hi16_state {
 	struct reg_512_bit		hi16_zmm[16];
 } __packed;
 
+/*
+ * State component 9: 32-bit PKRU register.
+ */
+struct pkru {
+	u32 pkru;
+} __packed;
+
+struct pkru_state {
+	union {
+		struct pkru		pkru;
+		u8			pad_to_8_bytes[8];
+	};
+} __packed;
+
 struct xstate_header {
 	u64				xfeatures;
 	u64				xcomp_bv;
diff -puN arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave	2015-09-16 10:48:13.339060081 -0700
+++ b/arch/x86/include/asm/fpu/xstate.h	2015-09-16 10:48:13.344060307 -0700
@@ -27,7 +27,9 @@
 				 XFEATURE_MASK_Hi16_ZMM)
 
 /* Supported features which require eager state saving */
-#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)
+#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | \
+				 XFEATURE_MASK_BNDCSR | \
+				 XFEATURE_MASK_PKRU)
 
 /* All currently supported features */
 #define XCNTXT_MASK	(XFEATURE_MASK_LAZY | XFEATURE_MASK_EAGER)
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2015-09-16 10:48:13.340060126 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-09-16 10:48:13.344060307 -0700
@@ -23,6 +23,8 @@ static const char *xfeature_names[] =
 	"AVX-512 opmask"		,
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
+	"unknown xstate feature (8)"	,
+	"Protection Keys User registers",
 	"unknown xstate feature"	,
 };
 
@@ -52,6 +54,7 @@ void fpu__xstate_clear_all_cpu_caps(void
 	setup_clear_cpu_cap(X86_FEATURE_AVX512ER);
 	setup_clear_cpu_cap(X86_FEATURE_AVX512CD);
 	setup_clear_cpu_cap(X86_FEATURE_MPX);
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
 }
 
 /*
@@ -230,7 +233,7 @@ static void __init print_xstate_feature(
 	const char *feature_name;
 
 	if (cpu_has_xfeatures(xstate_mask, &feature_name))
-		pr_info("x86/fpu: Supporting XSAVE feature 0x%02Lx: '%s'\n", xstate_mask, feature_name);
+		pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name);
 }
 
 /*
@@ -246,6 +249,7 @@ static void __init print_xstate_features
 	print_xstate_feature(XFEATURE_MASK_OPMASK);
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
+	print_xstate_feature(XFEATURE_MASK_PKRU);
 }
 
 /*
@@ -462,6 +466,7 @@ static void check_xstate_against_struct(
 	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
+	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 07/26] x86, pkeys: new page fault error code bit: PF_PK
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


Note: "PK" is how the Intel SDM refers to this bit, so we also
use that nomenclature.

This only defines the bit, it does not plumb it anywhere to be
handled.

---

 b/arch/x86/mm/fault.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN arch/x86/mm/fault.c~pkeys-05-pfec arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-05-pfec	2015-09-16 10:48:14.219099976 -0700
+++ b/arch/x86/mm/fault.c	2015-09-16 10:48:14.222100112 -0700
@@ -33,6 +33,7 @@
  *   bit 2 ==	 0: kernel-mode access	1: user-mode access
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
+ *   bit 5 ==				1: protection keys block access
  */
 enum x86_pf_error_code {
 
@@ -41,6 +42,7 @@ enum x86_pf_error_code {
 	PF_USER		=		1 << 2,
 	PF_RSVD		=		1 << 3,
 	PF_INSTR	=		1 << 4,
+	PF_PK		=		1 << 5,
 };
 
 /*
@@ -916,7 +918,10 @@ static int spurious_fault_check(unsigned
 
 	if ((error_code & PF_INSTR) && !pte_exec(*pte))
 		return 0;
-
+	/*
+	 * Note: We do not do lazy flushing on protection key
+	 * changes, so no spurious fault will ever set PF_PK.
+	 */
 	return 1;
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 06/26] x86, pkeys: PTE bits for storing protection key
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


Previous documentation has referred to these 4 bits as "ignored".
That means that software could have made use of them.  But, as
far as I know, the kernel never used them.

They are still ignored when protection keys is not enabled, so
they could theoretically still get used for software purposes.

We also implement "empty" versions so that code that references
to them can be optimized away by the compiler when the config
option is not enabled.

---

 b/arch/x86/include/asm/pgtable_types.h |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits	2015-09-16 10:48:13.805081207 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-16 10:48:13.809081388 -0700
@@ -25,7 +25,11 @@
 #define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
+#define _PAGE_BIT_PKEY_BIT0	59       /* Protection Keys, bit 1/4 */
+#define _PAGE_BIT_PKEY_BIT1	60       /* Protection Keys, bit 2/4 */
+#define _PAGE_BIT_PKEY_BIT2	61       /* Protection Keys, bit 3/4 */
+#define _PAGE_BIT_PKEY_BIT3	62       /* Protection Keys, bit 4/4 */
+#define _PAGE_BIT_NX		63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -47,6 +51,17 @@
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2)
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3)
+#else
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 0))
+#endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


A protection key fault is very similar to any other access
error.  There must be a VMA, etc...  We even want to take
the same action (SIGSEGV) that we do with a normal access
fault.

However, we do need to let userspace know that something
is different.  We do this the same way what we did with
SEGV_BNDERR with Memory Protection eXtensions (MPX):
define a new SEGV code: SEGV_PKUERR.

We also add a siginfo field: si_pkey that reveals to
userspace which protection key was set on the PTE that
we faulted on.  There is no other easy way for
userspace to figure this out.  They could parse smaps
but that would be a bit cruel.

---

 b/arch/x86/include/asm/mmu_context.h   |   15 ++++++++++
 b/arch/x86/include/asm/pgtable.h       |   10 ++++++
 b/arch/x86/include/asm/pgtable_types.h |    5 +++
 b/arch/x86/mm/fault.c                  |   49 ++++++++++++++++++++++++++++++++-
 b/include/linux/mm.h                   |    2 +
 b/include/uapi/asm-generic/siginfo.h   |   11 ++++++-
 b/mm/memory.c                          |    4 +-
 7 files changed, 92 insertions(+), 4 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-09-siginfo arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-09-siginfo	2015-09-16 10:48:15.575161451 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2015-09-16 10:48:15.589162086 -0700
@@ -243,4 +243,19 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+static inline u16 vma_pkey(struct vm_area_struct *vma)
+{
+	u16 pkey = 0;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	unsigned long f = vma->vm_flags;
+	pkey |= (!!(f & VM_HIGH_ARCH_0)) << 0;
+	pkey |= (!!(f & VM_HIGH_ARCH_1)) << 1;
+	pkey |= (!!(f & VM_HIGH_ARCH_2)) << 2;
+	pkey |= (!!(f & VM_HIGH_ARCH_3)) << 3;
+#endif
+
+	return pkey;
+}
+
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable.h~pkeys-09-siginfo arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-09-siginfo	2015-09-16 10:48:15.577161542 -0700
+++ b/arch/x86/include/asm/pgtable.h	2015-09-16 10:48:15.590162131 -0700
@@ -881,6 +881,16 @@ static inline pte_t pte_swp_clear_soft_d
 }
 #endif
 
+static inline u32 pte_pkey(pte_t pte)
+{
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/* ifdef to avoid doing 59-bit shift on 32-bit values */
+	return (pte_flags(pte) & _PAGE_PKEY_MASK) >> _PAGE_BIT_PKEY_BIT0;
+#else
+	return 0;
+#endif
+}
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo	2015-09-16 10:48:15.579161632 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-16 10:48:15.590162131 -0700
@@ -64,6 +64,11 @@
 #endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
+#define _PAGE_PKEY_MASK (_PAGE_PKEY_BIT0 | \
+			 _PAGE_PKEY_BIT1 | \
+			 _PAGE_PKEY_BIT2 | \
+			 _PAGE_PKEY_BIT3)
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else
diff -puN arch/x86/mm/fault.c~pkeys-09-siginfo arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-09-siginfo	2015-09-16 10:48:15.580161678 -0700
+++ b/arch/x86/mm/fault.c	2015-09-16 10:48:15.591162177 -0700
@@ -15,12 +15,14 @@
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 
+#include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
 #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
 #include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
+#include <asm/mmu_context.h>		/* vma_pkey()			*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -169,6 +171,45 @@ is_prefetch(struct pt_regs *regs, unsign
 	return prefetch;
 }
 
+static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
+{
+	u16 ret;
+	spinlock_t *ptl;
+	pte_t *ptep;
+	pte_t pte;
+	int follow_ret;
+
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return 0;
+
+	follow_ret = follow_pte(tsk->mm, address, &ptep, &ptl);
+	if (!follow_ret) {
+		/*
+		 * On a successful follow, make sure to
+		 * drop the lock.
+		 */
+		pte = *ptep;
+		pte_unmap_unlock(ptep, ptl);
+		ret = pte_pkey(pte);
+	} else {
+		/*
+		 * There is no PTE.  Go looking for the pkey in
+		 * the VMA.  If we did not find a pkey violation
+		 * from either the PTE or the VMA, then it must
+		 * have been a fault from the hardware.  Perhaps
+		 * the PTE got zapped before we got in here.
+		 */
+		struct vm_area_struct *vma = find_vma(tsk->mm, address);
+		if (vma) {
+			ret = vma_pkey(vma);
+		} else {
+			WARN_ONCE(1, "no PTE or VMA @ %lx\n", address);
+			ret = 0;
+		}
+	}
+	return ret;
+}
+
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 		     struct task_struct *tsk, int fault)
@@ -186,6 +227,9 @@ force_sig_info_fault(int si_signo, int s
 		lsb = PAGE_SHIFT;
 	info.si_addr_lsb = lsb;
 
+	if (boot_cpu_has(X86_FEATURE_OSPKE) && si_code == SEGV_PKUERR)
+		info.si_pkey = fetch_pkey(address, tsk);
+
 	force_sig_info(si_signo, &info, tsk);
 }
 
@@ -842,7 +886,10 @@ static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+		__bad_area(regs, error_code, address, SEGV_PKUERR);
+	else
+		__bad_area(regs, error_code, address, SEGV_ACCERR);
 }
 
 static void
diff -puN include/linux/mm.h~pkeys-09-siginfo include/linux/mm.h
--- a/include/linux/mm.h~pkeys-09-siginfo	2015-09-16 10:48:15.582161768 -0700
+++ b/include/linux/mm.h	2015-09-16 10:48:15.591162177 -0700
@@ -1160,6 +1160,8 @@ void unmap_mapping_range(struct address_
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
+int follow_pte(struct mm_struct *mm, unsigned long address,
+	pte_t **ptepp, spinlock_t **ptlp);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
 		unsigned int flags, unsigned long *prot, resource_size_t *phys);
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
diff -puN include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo include/uapi/asm-generic/siginfo.h
--- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-16 10:48:15.584161859 -0700
+++ b/include/uapi/asm-generic/siginfo.h	2015-09-16 10:48:15.592162222 -0700
@@ -95,6 +95,13 @@ typedef struct siginfo {
 				void __user *_lower;
 				void __user *_upper;
 			} _addr_bnd;
+			int _pkey; /* FIXME: protection key value??
+				    * Do we really need this in here?
+				    * userspace can get the PKRU value in
+				    * the signal handler, but they do not
+				    * easily have access to the PKEY value
+				    * from the PTE.
+				    */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -137,6 +144,7 @@ typedef struct siginfo {
 #define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_lower	_sifields._sigfault._addr_bnd._lower
 #define si_upper	_sifields._sigfault._addr_bnd._upper
+#define si_pkey		_sifields._sigfault._pkey
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 #ifdef __ARCH_SIGSYS
@@ -206,7 +214,8 @@ typedef struct siginfo {
 #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
 #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
 #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
-#define NSIGSEGV	3
+#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed address bound checks */
+#define NSIGSEGV	4
 
 /*
  * SIGBUS si_codes
diff -puN mm/memory.c~pkeys-09-siginfo mm/memory.c
--- a/mm/memory.c~pkeys-09-siginfo	2015-09-16 10:48:15.585161904 -0700
+++ b/mm/memory.c	2015-09-16 10:48:15.593162267 -0700
@@ -3548,8 +3548,8 @@ out:
 	return -EINVAL;
 }
 
-static inline int follow_pte(struct mm_struct *mm, unsigned long address,
-			     pte_t **ptepp, spinlock_t **ptlp)
+int follow_pte(struct mm_struct *mm, unsigned long address,
+		     pte_t **ptepp, spinlock_t **ptlp)
 {
 	int res;
 
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 08/26] x86, pkeys: store protection in high VMA flags
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures.  The high 32 bits are unused on 64-bit
platforms.  We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.

Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.

This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.

Sparse complains about these constants unless we explicitly
call them "UL".

---

 b/arch/x86/Kconfig   |    1 +
 b/include/linux/mm.h |    7 +++++++
 b/mm/Kconfig         |    3 +++
 3 files changed, 11 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-07-eat-high-vma-flags arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-07-eat-high-vma-flags	2015-09-16 10:48:14.638118972 -0700
+++ b/arch/x86/Kconfig	2015-09-16 10:48:14.646119334 -0700
@@ -152,6 +152,7 @@ config X86
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN include/linux/mm.h~pkeys-07-eat-high-vma-flags include/linux/mm.h
--- a/include/linux/mm.h~pkeys-07-eat-high-vma-flags	2015-09-16 10:48:14.640119062 -0700
+++ b/include/linux/mm.h	2015-09-16 10:48:14.647119380 -0700
@@ -157,6 +157,13 @@ extern unsigned int kobjsize(const void
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_HIGH_ARCH_0  0x100000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_1  0x200000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_2  0x400000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_3  0x800000000UL	/* bit only usable on 64-bit architectures */
+#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff -puN mm/Kconfig~pkeys-07-eat-high-vma-flags mm/Kconfig
--- a/mm/Kconfig~pkeys-07-eat-high-vma-flags	2015-09-16 10:48:14.642119153 -0700
+++ b/mm/Kconfig	2015-09-16 10:48:14.647119380 -0700
@@ -680,3 +680,6 @@ config ZONE_DEVICE
 
 config FRAME_VECTOR
 	bool
+
+config ARCH_USES_HIGH_VMA_FLAGS
+	bool
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 11/26] x86, pkeys: add functions for set/fetch PKRU
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This adds the raw instructions to access PKRU as well as some
accessor functions that correctly handle when the CPU does
not support the instruction.  We don't use them here, but
we will use read_pkru() in the next patch.

I do not see an immediate use for write_pkru().  But, we put it
here for partity with its twin.

---

 b/arch/x86/include/asm/pgtable.h       |   15 +++++++++++++++
 b/arch/x86/include/asm/special_insns.h |   33 +++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff -puN arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions	2015-09-16 10:48:16.151187564 -0700
+++ b/arch/x86/include/asm/pgtable.h	2015-09-16 10:48:16.156187791 -0700
@@ -881,6 +881,21 @@ static inline pte_t pte_swp_clear_soft_d
 }
 #endif
 
+
+static inline u32 read_pkru(void)
+{
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		return __read_pkru();
+	return 0;
+}
+static inline void write_pkru(u32 pkru)
+{
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		__write_pkru(pkru);
+	else
+		VM_WARN_ON_ONCE(pkru);
+}
+
 static inline u32 pte_pkey(pte_t pte)
 {
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
diff -puN arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/special_insns.h
--- a/arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions	2015-09-16 10:48:16.152187610 -0700
+++ b/arch/x86/include/asm/special_insns.h	2015-09-16 10:48:16.156187791 -0700
@@ -98,6 +98,39 @@ static inline void native_write_cr8(unsi
 }
 #endif
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static inline u32 __read_pkru(void)
+{
+        unsigned int eax, edx;
+        unsigned int ecx = 0;
+        unsigned int pkru;
+
+        asm volatile(".byte 0x0f,0x01,0xee\n\t"
+                     : "=a" (eax), "=d" (edx)
+                     : "c" (ecx));
+        pkru = eax;
+        return pkru;
+}
+
+static inline void __write_pkru(u32 pkru)
+{
+        unsigned int eax = pkru;
+        unsigned int ecx = 0;
+        unsigned int edx = 0;
+
+        asm volatile(".byte 0x0f,0x01,0xef\n\t"
+                     : : "a" (eax), "c" (ecx), "d" (edx));
+}
+#else
+static inline u32 __read_pkru(void)
+{
+	return 0;
+}
+static inline void __write_pkru(u32 pkru)
+{
+}
+#endif
+
 static inline void native_wbinvd(void)
 {
 	asm volatile("wbinvd": : :"memory");
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 09/26] x86, pkeys: arch-specific protection bits
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


Lots of things seem to do:

        vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot.  So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT).  It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
	1. "prot" argument to system calls
	2. vma->vm_flags, filled from the mmap "prot"
	3. vma->vm_page prot, filled from vma->vm_flags
	4. the PTE itself.

The pseudocode for these for steps are as follows:

	mmap(PROT_PKEY*)
	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
	pte = pfn | vma->vm_page_prot

Note that this provides a new definitions for x86:

	arch_vm_get_page_prot()

---

 b/arch/x86/include/asm/pgtable_types.h |   12 ++++++++++--
 b/arch/x86/include/uapi/asm/mman.h     |   16 ++++++++++++++++
 b/include/linux/mm.h                   |    6 ++++++
 3 files changed, 32 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma	2015-09-16 10:48:15.105140143 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-16 10:48:15.112140461 -0700
@@ -111,7 +111,12 @@
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
 			 _PAGE_DIRTY)
 
-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify.  The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY)
@@ -227,7 +232,10 @@ enum page_cache_mode {
 /* PTE_PFN_MASK extracts the PFN from a (pte|pmd|pud|pgd)val_t */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)
 
-/* PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t */
+/*
+ *  PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t
+ *  This includes the protection key value.
+ */
 #define PTE_FLAGS_MASK		(~PTE_PFN_MASK)
 
 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma	2015-09-16 10:48:15.107140234 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-16 10:48:15.112140461 -0700
@@ -6,6 +6,22 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+#endif
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff -puN include/linux/mm.h~pkeys-08-store-pkey-in-vma include/linux/mm.h
--- a/include/linux/mm.h~pkeys-08-store-pkey-in-vma	2015-09-16 10:48:15.109140325 -0700
+++ b/include/linux/mm.h	2015-09-16 10:48:15.113140506 -0700
@@ -166,6 +166,12 @@ extern unsigned int kobjsize(const void
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+# define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+# define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#endif
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 08/26] x86, pkeys: store protection in high VMA flags
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures.  The high 32 bits are unused on 64-bit
platforms.  We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.

Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.

This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.

Sparse complains about these constants unless we explicitly
call them "UL".

---

 b/arch/x86/Kconfig   |    1 +
 b/include/linux/mm.h |    7 +++++++
 b/mm/Kconfig         |    3 +++
 3 files changed, 11 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-07-eat-high-vma-flags arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-07-eat-high-vma-flags	2015-09-16 10:48:14.638118972 -0700
+++ b/arch/x86/Kconfig	2015-09-16 10:48:14.646119334 -0700
@@ -152,6 +152,7 @@ config X86
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN include/linux/mm.h~pkeys-07-eat-high-vma-flags include/linux/mm.h
--- a/include/linux/mm.h~pkeys-07-eat-high-vma-flags	2015-09-16 10:48:14.640119062 -0700
+++ b/include/linux/mm.h	2015-09-16 10:48:14.647119380 -0700
@@ -157,6 +157,13 @@ extern unsigned int kobjsize(const void
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_HIGH_ARCH_0  0x100000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_1  0x200000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_2  0x400000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_3  0x800000000UL	/* bit only usable on 64-bit architectures */
+#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff -puN mm/Kconfig~pkeys-07-eat-high-vma-flags mm/Kconfig
--- a/mm/Kconfig~pkeys-07-eat-high-vma-flags	2015-09-16 10:48:14.642119153 -0700
+++ b/mm/Kconfig	2015-09-16 10:48:14.647119380 -0700
@@ -680,3 +680,6 @@ config ZONE_DEVICE
 
 config FRAME_VECTOR
 	bool
+
+config ARCH_USES_HIGH_VMA_FLAGS
+	bool
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


A protection key fault is very similar to any other access
error.  There must be a VMA, etc...  We even want to take
the same action (SIGSEGV) that we do with a normal access
fault.

However, we do need to let userspace know that something
is different.  We do this the same way what we did with
SEGV_BNDERR with Memory Protection eXtensions (MPX):
define a new SEGV code: SEGV_PKUERR.

We also add a siginfo field: si_pkey that reveals to
userspace which protection key was set on the PTE that
we faulted on.  There is no other easy way for
userspace to figure this out.  They could parse smaps
but that would be a bit cruel.

---

 b/arch/x86/include/asm/mmu_context.h   |   15 ++++++++++
 b/arch/x86/include/asm/pgtable.h       |   10 ++++++
 b/arch/x86/include/asm/pgtable_types.h |    5 +++
 b/arch/x86/mm/fault.c                  |   49 ++++++++++++++++++++++++++++++++-
 b/include/linux/mm.h                   |    2 +
 b/include/uapi/asm-generic/siginfo.h   |   11 ++++++-
 b/mm/memory.c                          |    4 +-
 7 files changed, 92 insertions(+), 4 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-09-siginfo arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-09-siginfo	2015-09-16 10:48:15.575161451 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2015-09-16 10:48:15.589162086 -0700
@@ -243,4 +243,19 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+static inline u16 vma_pkey(struct vm_area_struct *vma)
+{
+	u16 pkey = 0;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	unsigned long f = vma->vm_flags;
+	pkey |= (!!(f & VM_HIGH_ARCH_0)) << 0;
+	pkey |= (!!(f & VM_HIGH_ARCH_1)) << 1;
+	pkey |= (!!(f & VM_HIGH_ARCH_2)) << 2;
+	pkey |= (!!(f & VM_HIGH_ARCH_3)) << 3;
+#endif
+
+	return pkey;
+}
+
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable.h~pkeys-09-siginfo arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-09-siginfo	2015-09-16 10:48:15.577161542 -0700
+++ b/arch/x86/include/asm/pgtable.h	2015-09-16 10:48:15.590162131 -0700
@@ -881,6 +881,16 @@ static inline pte_t pte_swp_clear_soft_d
 }
 #endif
 
+static inline u32 pte_pkey(pte_t pte)
+{
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/* ifdef to avoid doing 59-bit shift on 32-bit values */
+	return (pte_flags(pte) & _PAGE_PKEY_MASK) >> _PAGE_BIT_PKEY_BIT0;
+#else
+	return 0;
+#endif
+}
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo	2015-09-16 10:48:15.579161632 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-16 10:48:15.590162131 -0700
@@ -64,6 +64,11 @@
 #endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
+#define _PAGE_PKEY_MASK (_PAGE_PKEY_BIT0 | \
+			 _PAGE_PKEY_BIT1 | \
+			 _PAGE_PKEY_BIT2 | \
+			 _PAGE_PKEY_BIT3)
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else
diff -puN arch/x86/mm/fault.c~pkeys-09-siginfo arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-09-siginfo	2015-09-16 10:48:15.580161678 -0700
+++ b/arch/x86/mm/fault.c	2015-09-16 10:48:15.591162177 -0700
@@ -15,12 +15,14 @@
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 
+#include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
 #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
 #include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
+#include <asm/mmu_context.h>		/* vma_pkey()			*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -169,6 +171,45 @@ is_prefetch(struct pt_regs *regs, unsign
 	return prefetch;
 }
 
+static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
+{
+	u16 ret;
+	spinlock_t *ptl;
+	pte_t *ptep;
+	pte_t pte;
+	int follow_ret;
+
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return 0;
+
+	follow_ret = follow_pte(tsk->mm, address, &ptep, &ptl);
+	if (!follow_ret) {
+		/*
+		 * On a successful follow, make sure to
+		 * drop the lock.
+		 */
+		pte = *ptep;
+		pte_unmap_unlock(ptep, ptl);
+		ret = pte_pkey(pte);
+	} else {
+		/*
+		 * There is no PTE.  Go looking for the pkey in
+		 * the VMA.  If we did not find a pkey violation
+		 * from either the PTE or the VMA, then it must
+		 * have been a fault from the hardware.  Perhaps
+		 * the PTE got zapped before we got in here.
+		 */
+		struct vm_area_struct *vma = find_vma(tsk->mm, address);
+		if (vma) {
+			ret = vma_pkey(vma);
+		} else {
+			WARN_ONCE(1, "no PTE or VMA @ %lx\n", address);
+			ret = 0;
+		}
+	}
+	return ret;
+}
+
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 		     struct task_struct *tsk, int fault)
@@ -186,6 +227,9 @@ force_sig_info_fault(int si_signo, int s
 		lsb = PAGE_SHIFT;
 	info.si_addr_lsb = lsb;
 
+	if (boot_cpu_has(X86_FEATURE_OSPKE) && si_code == SEGV_PKUERR)
+		info.si_pkey = fetch_pkey(address, tsk);
+
 	force_sig_info(si_signo, &info, tsk);
 }
 
@@ -842,7 +886,10 @@ static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+		__bad_area(regs, error_code, address, SEGV_PKUERR);
+	else
+		__bad_area(regs, error_code, address, SEGV_ACCERR);
 }
 
 static void
diff -puN include/linux/mm.h~pkeys-09-siginfo include/linux/mm.h
--- a/include/linux/mm.h~pkeys-09-siginfo	2015-09-16 10:48:15.582161768 -0700
+++ b/include/linux/mm.h	2015-09-16 10:48:15.591162177 -0700
@@ -1160,6 +1160,8 @@ void unmap_mapping_range(struct address_
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
+int follow_pte(struct mm_struct *mm, unsigned long address,
+	pte_t **ptepp, spinlock_t **ptlp);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
 		unsigned int flags, unsigned long *prot, resource_size_t *phys);
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
diff -puN include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo include/uapi/asm-generic/siginfo.h
--- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-16 10:48:15.584161859 -0700
+++ b/include/uapi/asm-generic/siginfo.h	2015-09-16 10:48:15.592162222 -0700
@@ -95,6 +95,13 @@ typedef struct siginfo {
 				void __user *_lower;
 				void __user *_upper;
 			} _addr_bnd;
+			int _pkey; /* FIXME: protection key value??
+				    * Do we really need this in here?
+				    * userspace can get the PKRU value in
+				    * the signal handler, but they do not
+				    * easily have access to the PKEY value
+				    * from the PTE.
+				    */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -137,6 +144,7 @@ typedef struct siginfo {
 #define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_lower	_sifields._sigfault._addr_bnd._lower
 #define si_upper	_sifields._sigfault._addr_bnd._upper
+#define si_pkey		_sifields._sigfault._pkey
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 #ifdef __ARCH_SIGSYS
@@ -206,7 +214,8 @@ typedef struct siginfo {
 #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
 #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
 #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
-#define NSIGSEGV	3
+#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed address bound checks */
+#define NSIGSEGV	4
 
 /*
  * SIGBUS si_codes
diff -puN mm/memory.c~pkeys-09-siginfo mm/memory.c
--- a/mm/memory.c~pkeys-09-siginfo	2015-09-16 10:48:15.585161904 -0700
+++ b/mm/memory.c	2015-09-16 10:48:15.593162267 -0700
@@ -3548,8 +3548,8 @@ out:
 	return -EINVAL;
 }
 
-static inline int follow_pte(struct mm_struct *mm, unsigned long address,
-			     pte_t **ptepp, spinlock_t **ptlp)
+int follow_pte(struct mm_struct *mm, unsigned long address,
+		     pte_t **ptepp, spinlock_t **ptlp)
 {
 	int res;
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 11/26] x86, pkeys: add functions for set/fetch PKRU
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This adds the raw instructions to access PKRU as well as some
accessor functions that correctly handle when the CPU does
not support the instruction.  We don't use them here, but
we will use read_pkru() in the next patch.

I do not see an immediate use for write_pkru().  But, we put it
here for partity with its twin.

---

 b/arch/x86/include/asm/pgtable.h       |   15 +++++++++++++++
 b/arch/x86/include/asm/special_insns.h |   33 +++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff -puN arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions	2015-09-16 10:48:16.151187564 -0700
+++ b/arch/x86/include/asm/pgtable.h	2015-09-16 10:48:16.156187791 -0700
@@ -881,6 +881,21 @@ static inline pte_t pte_swp_clear_soft_d
 }
 #endif
 
+
+static inline u32 read_pkru(void)
+{
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		return __read_pkru();
+	return 0;
+}
+static inline void write_pkru(u32 pkru)
+{
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		__write_pkru(pkru);
+	else
+		VM_WARN_ON_ONCE(pkru);
+}
+
 static inline u32 pte_pkey(pte_t pte)
 {
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
diff -puN arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/special_insns.h
--- a/arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions	2015-09-16 10:48:16.152187610 -0700
+++ b/arch/x86/include/asm/special_insns.h	2015-09-16 10:48:16.156187791 -0700
@@ -98,6 +98,39 @@ static inline void native_write_cr8(unsi
 }
 #endif
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static inline u32 __read_pkru(void)
+{
+        unsigned int eax, edx;
+        unsigned int ecx = 0;
+        unsigned int pkru;
+
+        asm volatile(".byte 0x0f,0x01,0xee\n\t"
+                     : "=a" (eax), "=d" (edx)
+                     : "c" (ecx));
+        pkru = eax;
+        return pkru;
+}
+
+static inline void __write_pkru(u32 pkru)
+{
+        unsigned int eax = pkru;
+        unsigned int ecx = 0;
+        unsigned int edx = 0;
+
+        asm volatile(".byte 0x0f,0x01,0xef\n\t"
+                     : : "a" (eax), "c" (ecx), "d" (edx));
+}
+#else
+static inline u32 __read_pkru(void)
+{
+	return 0;
+}
+static inline void __write_pkru(u32 pkru)
+{
+}
+#endif
+
 static inline void native_wbinvd(void)
 {
 	asm volatile("wbinvd": : :"memory");
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 09/26] x86, pkeys: arch-specific protection bits
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


Lots of things seem to do:

        vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot.  So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT).  It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
	1. "prot" argument to system calls
	2. vma->vm_flags, filled from the mmap "prot"
	3. vma->vm_page prot, filled from vma->vm_flags
	4. the PTE itself.

The pseudocode for these for steps are as follows:

	mmap(PROT_PKEY*)
	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
	pte = pfn | vma->vm_page_prot

Note that this provides a new definitions for x86:

	arch_vm_get_page_prot()

---

 b/arch/x86/include/asm/pgtable_types.h |   12 ++++++++++--
 b/arch/x86/include/uapi/asm/mman.h     |   16 ++++++++++++++++
 b/include/linux/mm.h                   |    6 ++++++
 3 files changed, 32 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma	2015-09-16 10:48:15.105140143 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-16 10:48:15.112140461 -0700
@@ -111,7 +111,12 @@
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
 			 _PAGE_DIRTY)
 
-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify.  The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY)
@@ -227,7 +232,10 @@ enum page_cache_mode {
 /* PTE_PFN_MASK extracts the PFN from a (pte|pmd|pud|pgd)val_t */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)
 
-/* PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t */
+/*
+ *  PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t
+ *  This includes the protection key value.
+ */
 #define PTE_FLAGS_MASK		(~PTE_PFN_MASK)
 
 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma	2015-09-16 10:48:15.107140234 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-16 10:48:15.112140461 -0700
@@ -6,6 +6,22 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+#endif
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff -puN include/linux/mm.h~pkeys-08-store-pkey-in-vma include/linux/mm.h
--- a/include/linux/mm.h~pkeys-08-store-pkey-in-vma	2015-09-16 10:48:15.109140325 -0700
+++ b/include/linux/mm.h	2015-09-16 10:48:15.113140506 -0700
@@ -166,6 +166,12 @@ extern unsigned int kobjsize(const void
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+# define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+# define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#endif
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 12/26] mm: factor out VMA fault permission checking
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This code matches a fault condition up with the VMA and ensures
that the VMA allows the fault to be handled instead of just
erroring out.

We will be extending this in a moment to comprehend protection
keys.

---

 b/mm/gup.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff -puN mm/gup.c~pkeys-10-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-10-pte-fault	2015-09-16 10:48:16.591207512 -0700
+++ b/mm/gup.c	2015-09-16 10:48:16.595207693 -0700
@@ -554,6 +554,17 @@ next_page:
 }
 EXPORT_SYMBOL(__get_user_pages);
 
+bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
+{
+        vm_flags_t vm_flags =
+		(fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+
+	if (!(vm_flags & vma->vm_flags))
+		return false;
+
+	return true;
+}
+
 /*
  * fixup_user_fault() - manually resolve a user page fault
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -585,15 +596,13 @@ int fixup_user_fault(struct task_struct
 		     unsigned long address, unsigned int fault_flags)
 {
 	struct vm_area_struct *vma;
-	vm_flags_t vm_flags;
 	int ret;
 
 	vma = find_extend_vma(mm, address);
 	if (!vma || address < vma->vm_start)
 		return -EFAULT;
 
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
-	if (!(vm_flags & vma->vm_flags))
+	if (!vma_permits_fault(vma, fault_flags))
 		return -EFAULT;
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 14/26] x86, pkeys: check VMAs and PTEs for protection keys
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action.  For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.

We try to do the same thing for protection keys.  Basically, we
try to make sure that if a user does this:

	mprotect(ptr, size, PROT_NONE);
	*ptr = foo;

they see the same effects with protection keys when they do this:

	mprotect(ptr, size, PROT_READ|PROT_WRITE);
	set_pkey(ptr, size, 4);
	wrpkru(0xffffff3f); // access disable pkey 4
	*ptr = foo;

The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.

We add two functions and expose them to generic code:

	arch_pte_access_permitted(pte, write)
	arch_vma_access_permitted(vma, write)

These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.

But, there are also cases where we do not want to respect
protection keys.  When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.

---

 b/arch/x86/include/asm/mmu_context.h |   51 ++++++++++++++++++++++++++++++++++-
 b/arch/x86/include/asm/pgtable.h     |   12 ++++++++
 b/arch/x86/mm/fault.c                |   25 +++++++++++++++--
 b/arch/x86/mm/gup.c                  |    3 ++
 b/include/asm-generic/mm_hooks.h     |   12 ++++++++
 b/mm/gup.c                           |   17 ++++++++++-
 b/mm/memory.c                        |    4 ++
 7 files changed, 118 insertions(+), 6 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-09-16 10:48:17.419245050 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2015-09-16 10:48:17.433245685 -0700
@@ -258,4 +258,53 @@ static inline u16 vma_pkey(struct vm_are
 }
 
 
-#endif /* _ASM_X86_MMU_CONTEXT_H */
+static inline bool __pkru_allows_pkey(u16 pkey, bool write)
+{
+	u32 pkru = read_pkru();
+
+	if (!__pkru_allows_read(pkru, pkey))
+		return false;
+	if (write && !__pkru_allows_write(pkru, pkey))
+		return false;
+
+	return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to PKRU for other
+ * processes or any way to tell *which * PKRU in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+	if (!current->mm)
+		return true;
+	/*
+	 * Should PKRU be enforced on the access to this VMA?  If
+	 * the VMA is from another process, then PKRU has no
+	 * relevance and should not be enforced.
+	 */
+	if (current->mm != vma->vm_mm)
+		return true;
+
+	return false;
+}
+
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* allow access if the VMA is not one from this process */
+	if (vma_is_foreign(vma))
+		return true;
+	return __pkru_allows_pkey(vma_pkey(vma), write);
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	return __pkru_allows_pkey(pte_pkey(pte), write);
+}
+
+#endif /* _ASM_X86_MMUeCONTEXT_H */
diff -puN arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault	2015-09-16 10:48:17.421245141 -0700
+++ b/arch/x86/include/asm/pgtable.h	2015-09-16 10:48:17.433245685 -0700
@@ -906,6 +906,18 @@ static inline u32 pte_pkey(pte_t pte)
 #endif
 }
 
+static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
+{
+	int pkru_access_disable_bit = pkey * 2;
+	return !(pkru & (1 << pkru_access_disable_bit));
+}
+
+static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
+{
+	int pkru_write_disable_bit = pkey * 2 + 1;
+	return !(pkru & (1 << pkru_write_disable_bit));
+}
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff -puN arch/x86/mm/fault.c~pkeys-11-pte-fault arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-11-pte-fault	2015-09-16 10:48:17.423245231 -0700
+++ b/arch/x86/mm/fault.c	2015-09-16 10:48:17.434245730 -0700
@@ -882,11 +882,21 @@ bad_area(struct pt_regs *regs, unsigned
 	__bad_area(regs, error_code, address, SEGV_MAPERR);
 }
 
+static inline bool bad_area_access_from_pkeys(unsigned long error_code,
+		struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return false;
+	if (error_code & PF_PK)
+		return true;
+	return false;
+}
+
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+		      struct vm_area_struct *vma, unsigned long address)
 {
-	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+	if (bad_area_access_from_pkeys(error_code, vma))
 		__bad_area(regs, error_code, address, SEGV_PKUERR);
 	else
 		__bad_area(regs, error_code, address, SEGV_ACCERR);
@@ -1057,6 +1067,15 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/*
+	 * Access or read was blocked by protection keys. We do
+	 * this check before any others because we do not want
+	 * to, for instance, confuse a protection-key-denied
+	 * write with one for which we should do a COW.
+	 */
+	if (error_code & PF_PK)
+		return 1;
+
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
@@ -1277,7 +1296,7 @@ retry:
 	 */
 good_area:
 	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
+		bad_area_access_error(regs, error_code, vma, address);
 		return;
 	}
 
diff -puN arch/x86/mm/gup.c~pkeys-11-pte-fault arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-11-pte-fault	2015-09-16 10:48:17.424245277 -0700
+++ b/arch/x86/mm/gup.c	2015-09-16 10:48:17.434245730 -0700
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/swap.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 
 static inline pte_t gup_get_pte(pte_t *ptep)
@@ -73,6 +74,8 @@ static inline int pte_allows_gup(pte_t p
 		return 0;
 	if (write && !pte_write(pte))
 		return 0;
+	if (!arch_pte_access_permitted(pte, write))
+		return 0;
 	return 1;
 }
 
diff -puN include/asm-generic/mm_hooks.h~pkeys-11-pte-fault include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-11-pte-fault	2015-09-16 10:48:17.426245367 -0700
+++ b/include/asm-generic/mm_hooks.h	2015-09-16 10:48:17.435245775 -0700
@@ -26,4 +26,16 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
 #endif	/* _ASM_GENERIC_MM_HOOKS_H */
diff -puN mm/gup.c~pkeys-11-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-11-pte-fault	2015-09-16 10:48:17.428245458 -0700
+++ b/mm/gup.c	2015-09-16 10:48:17.435245775 -0700
@@ -13,6 +13,7 @@
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
@@ -388,6 +389,8 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
+	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+		return -EFAULT;
 	return 0;
 }
 
@@ -556,12 +559,19 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-        vm_flags_t vm_flags =
-		(fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+	int write = (fault_flags & FAULT_FLAG_WRITE);
+	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
 		return false;
 
+	/*
+	 * The architecture might have a hardware protection
+	 * mechanism other than read/write that can deny access
+	 */
+	if (!arch_vma_access_permitted(vma, write))
+		return false;
+
 	return true;
 }
 
@@ -1079,6 +1089,9 @@ static int gup_pte_range(pmd_t pmd, unsi
 			pte_protnone(pte) || (write && !pte_write(pte)))
 			goto pte_unmap;
 
+		if (!arch_pte_access_permitted(pte, write))
+			goto out_unmap;
+
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
diff -puN mm/memory.c~pkeys-11-pte-fault mm/memory.c
--- a/mm/memory.c~pkeys-11-pte-fault	2015-09-16 10:48:17.429245503 -0700
+++ b/mm/memory.c	2015-09-16 10:48:17.437245866 -0700
@@ -64,6 +64,7 @@
 #include <linux/userfaultfd_k.h>
 
 #include <asm/io.h>
+#include <asm/mmu_context.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
@@ -3342,6 +3343,9 @@ static int __handle_mm_fault(struct mm_s
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+		return VM_FAULT_SIGSEGV;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 13/26] mm: simplify get_user_pages() PTE bit handling
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


The current get_user_pages() code is a wee bit more complicated
than it needs to be for pte bit checking.  Currently, it establishes
a mask of required pte _PAGE_* bits and ensures that the pte it
goes after has all those bits.

We need to use the bits for our _PAGE_PRESENT check since
pte_present() is also true for _PAGE_PROTNONE, and we have no
accessor for _PAGE_USER, so need it there as well.

But we might as well just use pte_write() since we have it and
let the compiler work its magic on optimizing it.

This also consolidates the three identical copies of this code.

---

 b/arch/x86/mm/gup.c |   34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff -puN arch/x86/mm/gup.c~pkeys-16-gup-swizzle arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-16-gup-swizzle	2015-09-16 10:48:17.002226145 -0700
+++ b/arch/x86/mm/gup.c	2015-09-16 10:48:17.006226326 -0700
@@ -63,6 +63,19 @@ retry:
 #endif
 }
 
+static inline int pte_allows_gup(pte_t pte, int write)
+{
+	/*
+	 * Note that pte_present() is true for !_PAGE_PRESENT
+	 * but _PAGE_PROTNONE, so we can not use it here.
+	 */
+	if (!(pte_flags(pte) & (_PAGE_PRESENT|_PAGE_USER)))
+		return 0;
+	if (write && !pte_write(pte))
+		return 0;
+	return 1;
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -71,13 +84,8 @@ retry:
 static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t *ptep;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-
 	ptep = pte_offset_map(&pmd, addr);
 	do {
 		pte_t pte = gup_get_pte(ptep);
@@ -88,8 +96,8 @@ static noinline int gup_pte_range(pmd_t
 			pte_unmap(ptep);
 			return 0;
 		}
-
-		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+		if (!pte_allows_gup(pte, write) ||
+		    pte_special(pte)) {
 			pte_unmap(ptep);
 			return 0;
 		}
@@ -117,15 +125,11 @@ static inline void get_head_page_multipl
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t pte = *(pte_t *)&pmd;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_flags(pte) & mask) != mask)
+	if (!pte_allows_gup(pte, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
@@ -194,15 +198,11 @@ static int gup_pmd_range(pud_t pud, unsi
 static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t pte = *(pte_t *)&pud;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_flags(pte) & mask) != mask)
+	if (!pte_allows_gup(pte, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 13/26] mm: simplify get_user_pages() PTE bit handling
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


The current get_user_pages() code is a wee bit more complicated
than it needs to be for pte bit checking.  Currently, it establishes
a mask of required pte _PAGE_* bits and ensures that the pte it
goes after has all those bits.

We need to use the bits for our _PAGE_PRESENT check since
pte_present() is also true for _PAGE_PROTNONE, and we have no
accessor for _PAGE_USER, so need it there as well.

But we might as well just use pte_write() since we have it and
let the compiler work its magic on optimizing it.

This also consolidates the three identical copies of this code.

---

 b/arch/x86/mm/gup.c |   34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff -puN arch/x86/mm/gup.c~pkeys-16-gup-swizzle arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-16-gup-swizzle	2015-09-16 10:48:17.002226145 -0700
+++ b/arch/x86/mm/gup.c	2015-09-16 10:48:17.006226326 -0700
@@ -63,6 +63,19 @@ retry:
 #endif
 }
 
+static inline int pte_allows_gup(pte_t pte, int write)
+{
+	/*
+	 * Note that pte_present() is true for !_PAGE_PRESENT
+	 * but _PAGE_PROTNONE, so we can not use it here.
+	 */
+	if (!(pte_flags(pte) & (_PAGE_PRESENT|_PAGE_USER)))
+		return 0;
+	if (write && !pte_write(pte))
+		return 0;
+	return 1;
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -71,13 +84,8 @@ retry:
 static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t *ptep;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-
 	ptep = pte_offset_map(&pmd, addr);
 	do {
 		pte_t pte = gup_get_pte(ptep);
@@ -88,8 +96,8 @@ static noinline int gup_pte_range(pmd_t
 			pte_unmap(ptep);
 			return 0;
 		}
-
-		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+		if (!pte_allows_gup(pte, write) ||
+		    pte_special(pte)) {
 			pte_unmap(ptep);
 			return 0;
 		}
@@ -117,15 +125,11 @@ static inline void get_head_page_multipl
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t pte = *(pte_t *)&pmd;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_flags(pte) & mask) != mask)
+	if (!pte_allows_gup(pte, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
@@ -194,15 +198,11 @@ static int gup_pmd_range(pud_t pud, unsi
 static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t pte = *(pte_t *)&pud;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_flags(pte) & mask) != mask)
+	if (!pte_allows_gup(pte, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 14/26] x86, pkeys: check VMAs and PTEs for protection keys
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action.  For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.

We try to do the same thing for protection keys.  Basically, we
try to make sure that if a user does this:

	mprotect(ptr, size, PROT_NONE);
	*ptr = foo;

they see the same effects with protection keys when they do this:

	mprotect(ptr, size, PROT_READ|PROT_WRITE);
	set_pkey(ptr, size, 4);
	wrpkru(0xffffff3f); // access disable pkey 4
	*ptr = foo;

The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.

We add two functions and expose them to generic code:

	arch_pte_access_permitted(pte, write)
	arch_vma_access_permitted(vma, write)

These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.

But, there are also cases where we do not want to respect
protection keys.  When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.

---

 b/arch/x86/include/asm/mmu_context.h |   51 ++++++++++++++++++++++++++++++++++-
 b/arch/x86/include/asm/pgtable.h     |   12 ++++++++
 b/arch/x86/mm/fault.c                |   25 +++++++++++++++--
 b/arch/x86/mm/gup.c                  |    3 ++
 b/include/asm-generic/mm_hooks.h     |   12 ++++++++
 b/mm/gup.c                           |   17 ++++++++++-
 b/mm/memory.c                        |    4 ++
 7 files changed, 118 insertions(+), 6 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-09-16 10:48:17.419245050 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2015-09-16 10:48:17.433245685 -0700
@@ -258,4 +258,53 @@ static inline u16 vma_pkey(struct vm_are
 }
 
 
-#endif /* _ASM_X86_MMU_CONTEXT_H */
+static inline bool __pkru_allows_pkey(u16 pkey, bool write)
+{
+	u32 pkru = read_pkru();
+
+	if (!__pkru_allows_read(pkru, pkey))
+		return false;
+	if (write && !__pkru_allows_write(pkru, pkey))
+		return false;
+
+	return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to PKRU for other
+ * processes or any way to tell *which * PKRU in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+	if (!current->mm)
+		return true;
+	/*
+	 * Should PKRU be enforced on the access to this VMA?  If
+	 * the VMA is from another process, then PKRU has no
+	 * relevance and should not be enforced.
+	 */
+	if (current->mm != vma->vm_mm)
+		return true;
+
+	return false;
+}
+
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* allow access if the VMA is not one from this process */
+	if (vma_is_foreign(vma))
+		return true;
+	return __pkru_allows_pkey(vma_pkey(vma), write);
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	return __pkru_allows_pkey(pte_pkey(pte), write);
+}
+
+#endif /* _ASM_X86_MMUeCONTEXT_H */
diff -puN arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault	2015-09-16 10:48:17.421245141 -0700
+++ b/arch/x86/include/asm/pgtable.h	2015-09-16 10:48:17.433245685 -0700
@@ -906,6 +906,18 @@ static inline u32 pte_pkey(pte_t pte)
 #endif
 }
 
+static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
+{
+	int pkru_access_disable_bit = pkey * 2;
+	return !(pkru & (1 << pkru_access_disable_bit));
+}
+
+static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
+{
+	int pkru_write_disable_bit = pkey * 2 + 1;
+	return !(pkru & (1 << pkru_write_disable_bit));
+}
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff -puN arch/x86/mm/fault.c~pkeys-11-pte-fault arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-11-pte-fault	2015-09-16 10:48:17.423245231 -0700
+++ b/arch/x86/mm/fault.c	2015-09-16 10:48:17.434245730 -0700
@@ -882,11 +882,21 @@ bad_area(struct pt_regs *regs, unsigned
 	__bad_area(regs, error_code, address, SEGV_MAPERR);
 }
 
+static inline bool bad_area_access_from_pkeys(unsigned long error_code,
+		struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return false;
+	if (error_code & PF_PK)
+		return true;
+	return false;
+}
+
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+		      struct vm_area_struct *vma, unsigned long address)
 {
-	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+	if (bad_area_access_from_pkeys(error_code, vma))
 		__bad_area(regs, error_code, address, SEGV_PKUERR);
 	else
 		__bad_area(regs, error_code, address, SEGV_ACCERR);
@@ -1057,6 +1067,15 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/*
+	 * Access or read was blocked by protection keys. We do
+	 * this check before any others because we do not want
+	 * to, for instance, confuse a protection-key-denied
+	 * write with one for which we should do a COW.
+	 */
+	if (error_code & PF_PK)
+		return 1;
+
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
@@ -1277,7 +1296,7 @@ retry:
 	 */
 good_area:
 	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
+		bad_area_access_error(regs, error_code, vma, address);
 		return;
 	}
 
diff -puN arch/x86/mm/gup.c~pkeys-11-pte-fault arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-11-pte-fault	2015-09-16 10:48:17.424245277 -0700
+++ b/arch/x86/mm/gup.c	2015-09-16 10:48:17.434245730 -0700
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/swap.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 
 static inline pte_t gup_get_pte(pte_t *ptep)
@@ -73,6 +74,8 @@ static inline int pte_allows_gup(pte_t p
 		return 0;
 	if (write && !pte_write(pte))
 		return 0;
+	if (!arch_pte_access_permitted(pte, write))
+		return 0;
 	return 1;
 }
 
diff -puN include/asm-generic/mm_hooks.h~pkeys-11-pte-fault include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-11-pte-fault	2015-09-16 10:48:17.426245367 -0700
+++ b/include/asm-generic/mm_hooks.h	2015-09-16 10:48:17.435245775 -0700
@@ -26,4 +26,16 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
 #endif	/* _ASM_GENERIC_MM_HOOKS_H */
diff -puN mm/gup.c~pkeys-11-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-11-pte-fault	2015-09-16 10:48:17.428245458 -0700
+++ b/mm/gup.c	2015-09-16 10:48:17.435245775 -0700
@@ -13,6 +13,7 @@
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
@@ -388,6 +389,8 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
+	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+		return -EFAULT;
 	return 0;
 }
 
@@ -556,12 +559,19 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-        vm_flags_t vm_flags =
-		(fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+	int write = (fault_flags & FAULT_FLAG_WRITE);
+	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
 		return false;
 
+	/*
+	 * The architecture might have a hardware protection
+	 * mechanism other than read/write that can deny access
+	 */
+	if (!arch_vma_access_permitted(vma, write))
+		return false;
+
 	return true;
 }
 
@@ -1079,6 +1089,9 @@ static int gup_pte_range(pmd_t pmd, unsi
 			pte_protnone(pte) || (write && !pte_write(pte)))
 			goto pte_unmap;
 
+		if (!arch_pte_access_permitted(pte, write))
+			goto out_unmap;
+
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
diff -puN mm/memory.c~pkeys-11-pte-fault mm/memory.c
--- a/mm/memory.c~pkeys-11-pte-fault	2015-09-16 10:48:17.429245503 -0700
+++ b/mm/memory.c	2015-09-16 10:48:17.437245866 -0700
@@ -64,6 +64,7 @@
 #include <linux/userfaultfd_k.h>
 
 #include <asm/io.h>
+#include <asm/mmu_context.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
@@ -3342,6 +3343,9 @@ static int __handle_mm_fault(struct mm_s
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+		return VM_FAULT_SIGSEGV;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 12/26] mm: factor out VMA fault permission checking
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This code matches a fault condition up with the VMA and ensures
that the VMA allows the fault to be handled instead of just
erroring out.

We will be extending this in a moment to comprehend protection
keys.

---

 b/mm/gup.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff -puN mm/gup.c~pkeys-10-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-10-pte-fault	2015-09-16 10:48:16.591207512 -0700
+++ b/mm/gup.c	2015-09-16 10:48:16.595207693 -0700
@@ -554,6 +554,17 @@ next_page:
 }
 EXPORT_SYMBOL(__get_user_pages);
 
+bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
+{
+        vm_flags_t vm_flags =
+		(fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+
+	if (!(vm_flags & vma->vm_flags))
+		return false;
+
+	return true;
+}
+
 /*
  * fixup_user_fault() - manually resolve a user page fault
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -585,15 +596,13 @@ int fixup_user_fault(struct task_struct
 		     unsigned long address, unsigned int fault_flags)
 {
 	struct vm_area_struct *vma;
-	vm_flags_t vm_flags;
 	int ret;
 
 	vma = find_extend_vma(mm, address);
 	if (!vma || address < vma->vm_start)
 		return -EFAULT;
 
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
-	if (!(vm_flags & vma->vm_flags))
+	if (!vma_permits_fault(vma, fault_flags))
 		return -EFAULT;
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 16/26] x86, pkeys: dump PKRU with other kernel registers
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


I'm a bit ambivalent about whether this is needed or not.

Protection Keys never affect kernel mappings.  But, they can
affect whether the kernel will fault when it touches a user
mapping.  But, the kernel doesn't touch user mappings without
some careful choreography and these accesses don't generally
result in oopses.

Should we dump out PKRU like this in our oopses?

---

 b/arch/x86/kernel/process_64.c |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps	2015-09-16 10:48:18.424290612 -0700
+++ b/arch/x86/kernel/process_64.c	2015-09-16 10:48:18.427290748 -0700
@@ -116,6 +116,8 @@ void __show_regs(struct pt_regs *regs, i
 	printk(KERN_DEFAULT "DR0: %016lx DR1: %016lx DR2: %016lx\n", d0, d1, d2);
 	printk(KERN_DEFAULT "DR3: %016lx DR6: %016lx DR7: %016lx\n", d3, d6, d7);
 
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		printk(KERN_DEFAULT "PKRU: %08x\n", read_pkru());
 }
 
 void release_thread(struct task_struct *dead_task)
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 17/26] x86, pkeys: dump PTE pkey in /proc/pid/smaps
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


The protection key can now be just as important as read/write
permissions on a VMA.  We need some debug mechanism to help
figure out if it is in play.  smaps seems like a logical
place to expose it.

arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.

We also use no #ifdef.  If protection keys is .config'd out
we will get the same function as if we used the weak generic
function.

---

 b/arch/x86/kernel/setup.c |    9 +++++++++
 b/fs/proc/task_mmu.c      |    5 +++++
 2 files changed, 14 insertions(+)

diff -puN arch/x86/kernel/setup.c~pkeys-40-smaps arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~pkeys-40-smaps	2015-09-16 10:48:18.838309381 -0700
+++ b/arch/x86/kernel/setup.c	2015-09-16 10:48:18.844309653 -0700
@@ -111,6 +111,7 @@
 #include <asm/mce.h>
 #include <asm/alternative.h>
 #include <asm/prom.h>
+#include <asm/special_insns.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1264,3 +1265,11 @@ static int __init register_kernel_offset
 	return 0;
 }
 __initcall(register_kernel_offset_dumper);
+
+void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
+}
diff -puN fs/proc/task_mmu.c~pkeys-40-smaps fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~pkeys-40-smaps	2015-09-16 10:48:18.840309472 -0700
+++ b/fs/proc/task_mmu.c	2015-09-16 10:48:18.844309653 -0700
@@ -625,6 +625,10 @@ static void show_smap_vma_flags(struct s
 	seq_putc(m, '\n');
 }
 
+void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+}
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
@@ -674,6 +678,7 @@ static int show_smap(struct seq_file *m,
 		   (vma->vm_flags & VM_LOCKED) ?
 			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
 
+	arch_show_smap(m, vma);
 	show_smap_vma_flags(m, vma);
 	m_cache_vma(m, vma);
 	return 0;
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 15/26] x86, pkeys: optimize fault handling in access_error()
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


We might not strictly have to make modifictions to
access_error() to check the VMA here.

If we do not, we will do this:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault(), allocates and maps page, sets pte.pkey=K
4. return to userspace
5. touch instruction reexecutes, but triggers PF_PK
6. do PKEY signal

What happens with this patch applied:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault() notices that K is inaccessible
4. do PKEY signal

We basically skip the fault that does an allocation.

So what this lets us do is protect areas from even being
*populated* unless it is accessible according to protection
keys.  That seems handy to me and makes protection keys work
more like an mprotect()'d mapping.

---

 b/arch/x86/mm/fault.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-15-access_error arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-15-access_error	2015-09-16 10:48:18.012271934 -0700
+++ b/arch/x86/mm/fault.c	2015-09-16 10:48:18.016272115 -0700
@@ -889,6 +889,9 @@ static inline bool bad_area_access_from_
 		return false;
 	if (error_code & PF_PK)
 		return true;
+	/* this checks permission keys on the VMA: */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE)))
+		return true;
 	return false;
 }
 
@@ -1075,6 +1078,13 @@ access_error(unsigned long error_code, s
 	 */
 	if (error_code & PF_PK)
 		return 1;
+	/*
+	 * Make sure to check the VMA so that we do not perform
+	 * faults just to hit a PF_PK as soon as we fill in a
+	 * page.
+	 */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE)))
+		return 1;
 
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 17/26] x86, pkeys: dump PTE pkey in /proc/pid/smaps
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


The protection key can now be just as important as read/write
permissions on a VMA.  We need some debug mechanism to help
figure out if it is in play.  smaps seems like a logical
place to expose it.

arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.

We also use no #ifdef.  If protection keys is .config'd out
we will get the same function as if we used the weak generic
function.

---

 b/arch/x86/kernel/setup.c |    9 +++++++++
 b/fs/proc/task_mmu.c      |    5 +++++
 2 files changed, 14 insertions(+)

diff -puN arch/x86/kernel/setup.c~pkeys-40-smaps arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~pkeys-40-smaps	2015-09-16 10:48:18.838309381 -0700
+++ b/arch/x86/kernel/setup.c	2015-09-16 10:48:18.844309653 -0700
@@ -111,6 +111,7 @@
 #include <asm/mce.h>
 #include <asm/alternative.h>
 #include <asm/prom.h>
+#include <asm/special_insns.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1264,3 +1265,11 @@ static int __init register_kernel_offset
 	return 0;
 }
 __initcall(register_kernel_offset_dumper);
+
+void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
+}
diff -puN fs/proc/task_mmu.c~pkeys-40-smaps fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~pkeys-40-smaps	2015-09-16 10:48:18.840309472 -0700
+++ b/fs/proc/task_mmu.c	2015-09-16 10:48:18.844309653 -0700
@@ -625,6 +625,10 @@ static void show_smap_vma_flags(struct s
 	seq_putc(m, '\n');
 }
 
+void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+}
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
@@ -674,6 +678,7 @@ static int show_smap(struct seq_file *m,
 		   (vma->vm_flags & VM_LOCKED) ?
 			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
 
+	arch_show_smap(m, vma);
 	show_smap_vma_flags(m, vma);
 	m_cache_vma(m, vma);
 	return 0;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 15/26] x86, pkeys: optimize fault handling in access_error()
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


We might not strictly have to make modifictions to
access_error() to check the VMA here.

If we do not, we will do this:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault(), allocates and maps page, sets pte.pkey=K
4. return to userspace
5. touch instruction reexecutes, but triggers PF_PK
6. do PKEY signal

What happens with this patch applied:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault() notices that K is inaccessible
4. do PKEY signal

We basically skip the fault that does an allocation.

So what this lets us do is protect areas from even being
*populated* unless it is accessible according to protection
keys.  That seems handy to me and makes protection keys work
more like an mprotect()'d mapping.

---

 b/arch/x86/mm/fault.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-15-access_error arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-15-access_error	2015-09-16 10:48:18.012271934 -0700
+++ b/arch/x86/mm/fault.c	2015-09-16 10:48:18.016272115 -0700
@@ -889,6 +889,9 @@ static inline bool bad_area_access_from_
 		return false;
 	if (error_code & PF_PK)
 		return true;
+	/* this checks permission keys on the VMA: */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE)))
+		return true;
 	return false;
 }
 
@@ -1075,6 +1078,13 @@ access_error(unsigned long error_code, s
 	 */
 	if (error_code & PF_PK)
 		return 1;
+	/*
+	 * Make sure to check the VMA so that we do not perform
+	 * faults just to hit a PF_PK as soon as we fill in a
+	 * page.
+	 */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE)))
+		return 1;
 
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 16/26] x86, pkeys: dump PKRU with other kernel registers
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


I'm a bit ambivalent about whether this is needed or not.

Protection Keys never affect kernel mappings.  But, they can
affect whether the kernel will fault when it touches a user
mapping.  But, the kernel doesn't touch user mappings without
some careful choreography and these accesses don't generally
result in oopses.

Should we dump out PKRU like this in our oopses?

---

 b/arch/x86/kernel/process_64.c |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps	2015-09-16 10:48:18.424290612 -0700
+++ b/arch/x86/kernel/process_64.c	2015-09-16 10:48:18.427290748 -0700
@@ -116,6 +116,8 @@ void __show_regs(struct pt_regs *regs, i
 	printk(KERN_DEFAULT "DR0: %016lx DR1: %016lx DR2: %016lx\n", d0, d1, d2);
 	printk(KERN_DEFAULT "DR3: %016lx DR6: %016lx DR7: %016lx\n", d3, d6, d7);
 
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		printk(KERN_DEFAULT "PKRU: %08x\n", read_pkru());
 }
 
 void release_thread(struct task_struct *dead_task)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 19/26] [NEWSYSCALL] mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This plumbs a protection key through calc_vm_flag_bits().
We could of done this in calc_vm_prot_bits(), but I did not
feel super strongly which way to go.  It was pretty arbitrary
which one to use.

---

 b/arch/powerpc/include/asm/mman.h  |    5 +++--
 b/drivers/char/agp/frontend.c      |    2 +-
 b/drivers/staging/android/ashmem.c |    7 ++++---
 b/include/linux/mman.h             |    6 +++---
 b/mm/mmap.c                        |    2 +-
 b/mm/mprotect.c                    |    2 +-
 b/mm/nommu.c                       |    2 +-
 7 files changed, 14 insertions(+), 12 deletions(-)

diff -puN arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits arch/powerpc/include/asm/mman.h
--- a/arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.704348642 -0700
+++ b/arch/powerpc/include/asm/mman.h	2015-09-16 10:48:19.717349232 -0700
@@ -18,11 +18,12 @@
  * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
  * here.  How important is the optimization?
  */
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot)
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+		unsigned long pkey)
 {
 	return (prot & PROT_SAO) ? VM_SAO : 0;
 }
-#define arch_calc_vm_prot_bits(prot) arch_calc_vm_prot_bits(prot)
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
diff -puN drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits drivers/char/agp/frontend.c
--- a/drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.706348733 -0700
+++ b/drivers/char/agp/frontend.c	2015-09-16 10:48:19.718349277 -0700
@@ -156,7 +156,7 @@ static pgprot_t agp_convert_mmap_flags(i
 {
 	unsigned long prot_bits;
 
-	prot_bits = calc_vm_prot_bits(prot) | VM_SHARED;
+	prot_bits = calc_vm_prot_bits(prot, 0) | VM_SHARED;
 	return vm_get_page_prot(prot_bits);
 }
 
diff -puN drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits drivers/staging/android/ashmem.c
--- a/drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.707348778 -0700
+++ b/drivers/staging/android/ashmem.c	2015-09-16 10:48:19.718349277 -0700
@@ -351,7 +351,8 @@ out:
 	return ret;
 }
 
-static inline vm_flags_t calc_vm_may_flags(unsigned long prot)
+static inline vm_flags_t calc_vm_may_flags(unsigned long prot,
+		unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_MAYREAD) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_MAYWRITE) |
@@ -372,8 +373,8 @@ static int ashmem_mmap(struct file *file
 	}
 
 	/* requested protection bits must match our allowed protection mask */
-	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask)) &
-		     calc_vm_prot_bits(PROT_MASK))) {
+	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask, 0)) &
+		     calc_vm_prot_bits(PROT_MASK, 0))) {
 		ret = -EPERM;
 		goto out;
 	}
diff -puN include/linux/mman.h~pkeys-84-calc_vm_prot_bits include/linux/mman.h
--- a/include/linux/mman.h~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.709348869 -0700
+++ b/include/linux/mman.h	2015-09-16 10:48:19.719349322 -0700
@@ -35,7 +35,7 @@ static inline void vm_unacct_memory(long
  */
 
 #ifndef arch_calc_vm_prot_bits
-#define arch_calc_vm_prot_bits(prot) 0
+#define arch_calc_vm_prot_bits(prot, pkey) 0
 #endif
 
 #ifndef arch_vm_get_page_prot
@@ -70,12 +70,12 @@ static inline int arch_validate_prot(uns
  * Combine the mmap "prot" argument into "vm_flags" used internally.
  */
 static inline unsigned long
-calc_vm_prot_bits(unsigned long prot)
+calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
-	       arch_calc_vm_prot_bits(prot);
+	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
 /*
diff -puN mm/mmap.c~pkeys-84-calc_vm_prot_bits mm/mmap.c
--- a/mm/mmap.c~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.711348959 -0700
+++ b/mm/mmap.c	2015-09-16 10:48:19.720349367 -0700
@@ -1311,7 +1311,7 @@ unsigned long do_mmap(struct file *file,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
diff -puN mm/mprotect.c~pkeys-84-calc_vm_prot_bits mm/mprotect.c
--- a/mm/mprotect.c~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.712349005 -0700
+++ b/mm/mprotect.c	2015-09-16 10:48:19.720349367 -0700
@@ -373,7 +373,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot);
+	vm_flags = calc_vm_prot_bits(prot, 0);
 
 	down_write(&current->mm->mmap_sem);
 
diff -puN mm/nommu.c~pkeys-84-calc_vm_prot_bits mm/nommu.c
--- a/mm/nommu.c~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.714349096 -0700
+++ b/mm/nommu.c	2015-09-16 10:48:19.721349413 -0700
@@ -1084,7 +1084,7 @@ static unsigned long determine_vm_flags(
 {
 	unsigned long vm_flags;
 
-	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags);
+	vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
 	/* vm_flags |= mm->def_flags; */
 
 	if (!(capabilities & NOMMU_MAP_DIRECT)) {
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 20/26] [NEWSYSCALL] mm: implement new mprotect_pkey() system call
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


mprotect_pkey() is just like mprotect, except it also takes a
protection key as an argument.  On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

	pkey_deny_access(11); // random pkey
	int real_prot = PROT_READ|PROT_WRITE;
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = mprotect_pkey(ptr, PAGE_SIZE, real_prot, 11);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set.

---

 b/mm/Kconfig    |    7 +++++++
 b/mm/mprotect.c |   20 +++++++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff -puN mm/Kconfig~pkeys-85-mprotect_pkey mm/Kconfig
--- a/mm/Kconfig~pkeys-85-mprotect_pkey	2015-09-16 10:48:20.270374302 -0700
+++ b/mm/Kconfig	2015-09-16 10:48:20.275374529 -0700
@@ -683,3 +683,10 @@ config FRAME_VECTOR
 
 config ARCH_USES_HIGH_VMA_FLAGS
 	bool
+
+config NR_PROTECTION_KEYS
+	int
+	# Everything supports a _single_ key, so allow folks to
+	# at least call APIs that take keys, but require that the
+	# key be 0.
+	default 1
diff -puN mm/mprotect.c~pkeys-85-mprotect_pkey mm/mprotect.c
--- a/mm/mprotect.c~pkeys-85-mprotect_pkey	2015-09-16 10:48:20.272374393 -0700
+++ b/mm/mprotect.c	2015-09-16 10:48:20.276374574 -0700
@@ -344,8 +344,8 @@ fail:
 	return error;
 }
 
-SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
-		unsigned long, prot)
+static int do_mprotect_key(unsigned long start, size_t len,
+		unsigned long prot, unsigned long key)
 {
 	unsigned long vm_flags, nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
@@ -365,6 +365,8 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 		return -ENOMEM;
 	if (!arch_validate_prot(prot))
 		return -EINVAL;
+	if (key >= CONFIG_NR_PROTECTION_KEYS)
+		return -EINVAL;
 
 	reqprot = prot;
 	/*
@@ -373,7 +375,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot, 0);
+	vm_flags = calc_vm_prot_bits(prot, key);
 
 	down_write(&current->mm->mmap_sem);
 
@@ -443,3 +445,15 @@ out:
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
+
+SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot)
+{
+	return do_mprotect_key(start, len, prot, 0);
+}
+
+SYSCALL_DEFINE4(mprotect_key, unsigned long, start, size_t, len,
+		unsigned long, prot, unsigned long, key)
+{
+	return do_mprotect_key(start, len, prot, key);
+}
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 18/26] x86, pkeys: add Kconfig prompt to existing config option
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


I don't have a strong opinion on whether we need this or not.
Protection Keys has relatively little code associated with it,
and it is not a heavyweight feature to keep enabled.  However,
I can imagine that folks would still appreciate being able to
disable it.

Here's the option if folks want it.

---

 b/arch/x86/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-40-kconfig-prompt arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-40-kconfig-prompt	2015-09-16 10:48:19.287329737 -0700
+++ b/arch/x86/Kconfig	2015-09-16 10:48:19.291329919 -0700
@@ -1696,8 +1696,18 @@ config X86_INTEL_MPX
 	  If unsure, say N.
 
 config X86_INTEL_MEMORY_PROTECTION_KEYS
+	prompt "Intel Memory Protection Keys"
 	def_bool y
+	# Note: only available in 64-bit mode
 	depends on CPU_SUP_INTEL && X86_64
+	---help---
+	  Memory Protection Keys provides a mechanism for enforcing
+	  page-based protections, but without requiring modification of the
+	  page tables when an application changes protection domains.
+
+	  For details, see Documentation/x86/protection-keys.txt
+
+	  If unsure, say y.
 
 config EFI
 	bool "EFI runtime service support"
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 18/26] x86, pkeys: add Kconfig prompt to existing config option
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


I don't have a strong opinion on whether we need this or not.
Protection Keys has relatively little code associated with it,
and it is not a heavyweight feature to keep enabled.  However,
I can imagine that folks would still appreciate being able to
disable it.

Here's the option if folks want it.

---

 b/arch/x86/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-40-kconfig-prompt arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-40-kconfig-prompt	2015-09-16 10:48:19.287329737 -0700
+++ b/arch/x86/Kconfig	2015-09-16 10:48:19.291329919 -0700
@@ -1696,8 +1696,18 @@ config X86_INTEL_MPX
 	  If unsure, say N.
 
 config X86_INTEL_MEMORY_PROTECTION_KEYS
+	prompt "Intel Memory Protection Keys"
 	def_bool y
+	# Note: only available in 64-bit mode
 	depends on CPU_SUP_INTEL && X86_64
+	---help---
+	  Memory Protection Keys provides a mechanism for enforcing
+	  page-based protections, but without requiring modification of the
+	  page tables when an application changes protection domains.
+
+	  For details, see Documentation/x86/protection-keys.txt
+
+	  If unsure, say y.
 
 config EFI
 	bool "EFI runtime service support"
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 20/26] [NEWSYSCALL] mm: implement new mprotect_pkey() system call
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


mprotect_pkey() is just like mprotect, except it also takes a
protection key as an argument.  On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

	pkey_deny_access(11); // random pkey
	int real_prot = PROT_READ|PROT_WRITE;
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = mprotect_pkey(ptr, PAGE_SIZE, real_prot, 11);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set.

---

 b/mm/Kconfig    |    7 +++++++
 b/mm/mprotect.c |   20 +++++++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff -puN mm/Kconfig~pkeys-85-mprotect_pkey mm/Kconfig
--- a/mm/Kconfig~pkeys-85-mprotect_pkey	2015-09-16 10:48:20.270374302 -0700
+++ b/mm/Kconfig	2015-09-16 10:48:20.275374529 -0700
@@ -683,3 +683,10 @@ config FRAME_VECTOR
 
 config ARCH_USES_HIGH_VMA_FLAGS
 	bool
+
+config NR_PROTECTION_KEYS
+	int
+	# Everything supports a _single_ key, so allow folks to
+	# at least call APIs that take keys, but require that the
+	# key be 0.
+	default 1
diff -puN mm/mprotect.c~pkeys-85-mprotect_pkey mm/mprotect.c
--- a/mm/mprotect.c~pkeys-85-mprotect_pkey	2015-09-16 10:48:20.272374393 -0700
+++ b/mm/mprotect.c	2015-09-16 10:48:20.276374574 -0700
@@ -344,8 +344,8 @@ fail:
 	return error;
 }
 
-SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
-		unsigned long, prot)
+static int do_mprotect_key(unsigned long start, size_t len,
+		unsigned long prot, unsigned long key)
 {
 	unsigned long vm_flags, nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
@@ -365,6 +365,8 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 		return -ENOMEM;
 	if (!arch_validate_prot(prot))
 		return -EINVAL;
+	if (key >= CONFIG_NR_PROTECTION_KEYS)
+		return -EINVAL;
 
 	reqprot = prot;
 	/*
@@ -373,7 +375,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot, 0);
+	vm_flags = calc_vm_prot_bits(prot, key);
 
 	down_write(&current->mm->mmap_sem);
 
@@ -443,3 +445,15 @@ out:
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
+
+SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot)
+{
+	return do_mprotect_key(start, len, prot, 0);
+}
+
+SYSCALL_DEFINE4(mprotect_key, unsigned long, start, size_t, len,
+		unsigned long, prot, unsigned long, key)
+{
+	return do_mprotect_key(start, len, prot, key);
+}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 19/26] [NEWSYSCALL] mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This plumbs a protection key through calc_vm_flag_bits().
We could of done this in calc_vm_prot_bits(), but I did not
feel super strongly which way to go.  It was pretty arbitrary
which one to use.

---

 b/arch/powerpc/include/asm/mman.h  |    5 +++--
 b/drivers/char/agp/frontend.c      |    2 +-
 b/drivers/staging/android/ashmem.c |    7 ++++---
 b/include/linux/mman.h             |    6 +++---
 b/mm/mmap.c                        |    2 +-
 b/mm/mprotect.c                    |    2 +-
 b/mm/nommu.c                       |    2 +-
 7 files changed, 14 insertions(+), 12 deletions(-)

diff -puN arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits arch/powerpc/include/asm/mman.h
--- a/arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.704348642 -0700
+++ b/arch/powerpc/include/asm/mman.h	2015-09-16 10:48:19.717349232 -0700
@@ -18,11 +18,12 @@
  * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
  * here.  How important is the optimization?
  */
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot)
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+		unsigned long pkey)
 {
 	return (prot & PROT_SAO) ? VM_SAO : 0;
 }
-#define arch_calc_vm_prot_bits(prot) arch_calc_vm_prot_bits(prot)
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
diff -puN drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits drivers/char/agp/frontend.c
--- a/drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.706348733 -0700
+++ b/drivers/char/agp/frontend.c	2015-09-16 10:48:19.718349277 -0700
@@ -156,7 +156,7 @@ static pgprot_t agp_convert_mmap_flags(i
 {
 	unsigned long prot_bits;
 
-	prot_bits = calc_vm_prot_bits(prot) | VM_SHARED;
+	prot_bits = calc_vm_prot_bits(prot, 0) | VM_SHARED;
 	return vm_get_page_prot(prot_bits);
 }
 
diff -puN drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits drivers/staging/android/ashmem.c
--- a/drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.707348778 -0700
+++ b/drivers/staging/android/ashmem.c	2015-09-16 10:48:19.718349277 -0700
@@ -351,7 +351,8 @@ out:
 	return ret;
 }
 
-static inline vm_flags_t calc_vm_may_flags(unsigned long prot)
+static inline vm_flags_t calc_vm_may_flags(unsigned long prot,
+		unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_MAYREAD) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_MAYWRITE) |
@@ -372,8 +373,8 @@ static int ashmem_mmap(struct file *file
 	}
 
 	/* requested protection bits must match our allowed protection mask */
-	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask)) &
-		     calc_vm_prot_bits(PROT_MASK))) {
+	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask, 0)) &
+		     calc_vm_prot_bits(PROT_MASK, 0))) {
 		ret = -EPERM;
 		goto out;
 	}
diff -puN include/linux/mman.h~pkeys-84-calc_vm_prot_bits include/linux/mman.h
--- a/include/linux/mman.h~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.709348869 -0700
+++ b/include/linux/mman.h	2015-09-16 10:48:19.719349322 -0700
@@ -35,7 +35,7 @@ static inline void vm_unacct_memory(long
  */
 
 #ifndef arch_calc_vm_prot_bits
-#define arch_calc_vm_prot_bits(prot) 0
+#define arch_calc_vm_prot_bits(prot, pkey) 0
 #endif
 
 #ifndef arch_vm_get_page_prot
@@ -70,12 +70,12 @@ static inline int arch_validate_prot(uns
  * Combine the mmap "prot" argument into "vm_flags" used internally.
  */
 static inline unsigned long
-calc_vm_prot_bits(unsigned long prot)
+calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
-	       arch_calc_vm_prot_bits(prot);
+	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
 /*
diff -puN mm/mmap.c~pkeys-84-calc_vm_prot_bits mm/mmap.c
--- a/mm/mmap.c~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.711348959 -0700
+++ b/mm/mmap.c	2015-09-16 10:48:19.720349367 -0700
@@ -1311,7 +1311,7 @@ unsigned long do_mmap(struct file *file,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
diff -puN mm/mprotect.c~pkeys-84-calc_vm_prot_bits mm/mprotect.c
--- a/mm/mprotect.c~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.712349005 -0700
+++ b/mm/mprotect.c	2015-09-16 10:48:19.720349367 -0700
@@ -373,7 +373,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot);
+	vm_flags = calc_vm_prot_bits(prot, 0);
 
 	down_write(&current->mm->mmap_sem);
 
diff -puN mm/nommu.c~pkeys-84-calc_vm_prot_bits mm/nommu.c
--- a/mm/nommu.c~pkeys-84-calc_vm_prot_bits	2015-09-16 10:48:19.714349096 -0700
+++ b/mm/nommu.c	2015-09-16 10:48:19.721349413 -0700
@@ -1084,7 +1084,7 @@ static unsigned long determine_vm_flags(
 {
 	unsigned long vm_flags;
 
-	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags);
+	vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
 	/* vm_flags |= mm->def_flags; */
 
 	if (!(capabilities & NOMMU_MAP_DIRECT)) {
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 22/26] [HIJACKPROT] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


If a system call takes a PROT_{NONE,EXEC,WRITE,...} argument,
this adds support to it to take a protection key.

	mmap()
	mrprotect()
	drivers/char/agp/frontend.c's ioctl(AGPIOC_RESERVE)

This does not include direct support for shmat() since it uses
a different set of permission bits.  You can use mprotect()
after the attach to assign an attched SHM segment a protection
key.

---

 b/arch/x86/include/uapi/asm/mman.h       |    6 ++++++
 b/include/uapi/asm-generic/mman-common.h |    4 ++++
 2 files changed, 10 insertions(+)

diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-80-user-abi-bits arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-80-user-abi-bits	2015-09-16 09:45:54.123412488 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-16 09:45:54.129412761 -0700
@@ -20,6 +20,12 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot) (	\
+		((prot) & PROT_PKEY0 ? VM_PKEY_BIT0 : 0) |	\
+		((prot) & PROT_PKEY1 ? VM_PKEY_BIT1 : 0) |	\
+		((prot) & PROT_PKEY2 ? VM_PKEY_BIT2 : 0) |	\
+		((prot) & PROT_PKEY3 ? VM_PKEY_BIT3 : 0))
 #endif
 
 #include <asm-generic/mman.h>
diff -puN include/uapi/asm-generic/mman-common.h~pkeys-80-user-abi-bits include/uapi/asm-generic/mman-common.h
--- a/include/uapi/asm-generic/mman-common.h~pkeys-80-user-abi-bits	2015-09-16 09:45:54.125412579 -0700
+++ b/include/uapi/asm-generic/mman-common.h	2015-09-16 09:45:54.128412715 -0700
@@ -10,6 +10,10 @@
 #define PROT_WRITE	0x2		/* page can be written */
 #define PROT_EXEC	0x4		/* page can be executed */
 #define PROT_SEM	0x8		/* page may be used for atomic ops */
+#define PROT_PKEY0	0x10		/* protection key value (bit 0) */
+#define PROT_PKEY1	0x20		/* protection key value (bit 1) */
+#define PROT_PKEY2	0x40		/* protection key value (bit 2) */
+#define PROT_PKEY3	0x80		/* protection key value (bit 3) */
 #define PROT_NONE	0x0		/* page can not be accessed */
 #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
 #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 21/26] [NEWSYSCALL] x86: wire up mprotect_key() system call
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This is all that we need to get the new system call itself
working on x86.

---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 b/arch/x86/include/uapi/asm/mman.h       |    7 +++++++
 b/mm/Kconfig                             |    1 +
 4 files changed, 10 insertions(+)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key	2015-09-16 10:48:20.711394295 -0700
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-09-16 10:48:20.719394658 -0700
@@ -382,3 +382,4 @@
 373	i386	shutdown		sys_shutdown
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
+394	i386	mprotect_key		sys_mprotect_key
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key	2015-09-16 10:48:20.712394341 -0700
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-09-16 10:48:20.719394658 -0700
@@ -331,6 +331,7 @@
 322	64	execveat		stub_execveat
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
+394	common	mprotect_key		sys_mprotect_key
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key	2015-09-16 10:48:20.714394431 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-16 10:48:20.720394703 -0700
@@ -20,6 +20,13 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot, key) ( 		\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
+
 #endif
 
 #include <asm-generic/mman.h>
diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
--- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-09-16 10:48:20.716394522 -0700
+++ b/mm/Kconfig	2015-09-16 10:48:20.720394703 -0700
@@ -689,4 +689,5 @@ config NR_PROTECTION_KEYS
 	# Everything supports a _single_ key, so allow folks to
 	# at least call APIs that take keys, but require that the
 	# key be 0.
+	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
 	default 1
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 21/26] [NEWSYSCALL] x86: wire up mprotect_key() system call
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This is all that we need to get the new system call itself
working on x86.

---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 b/arch/x86/include/uapi/asm/mman.h       |    7 +++++++
 b/mm/Kconfig                             |    1 +
 4 files changed, 10 insertions(+)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key	2015-09-16 10:48:20.711394295 -0700
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-09-16 10:48:20.719394658 -0700
@@ -382,3 +382,4 @@
 373	i386	shutdown		sys_shutdown
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
+394	i386	mprotect_key		sys_mprotect_key
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key	2015-09-16 10:48:20.712394341 -0700
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-09-16 10:48:20.719394658 -0700
@@ -331,6 +331,7 @@
 322	64	execveat		stub_execveat
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
+394	common	mprotect_key		sys_mprotect_key
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key	2015-09-16 10:48:20.714394431 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-16 10:48:20.720394703 -0700
@@ -20,6 +20,13 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot, key) ( 		\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
+
 #endif
 
 #include <asm-generic/mman.h>
diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
--- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-09-16 10:48:20.716394522 -0700
+++ b/mm/Kconfig	2015-09-16 10:48:20.720394703 -0700
@@ -689,4 +689,5 @@ config NR_PROTECTION_KEYS
 	# Everything supports a _single_ key, so allow folks to
 	# at least call APIs that take keys, but require that the
 	# key be 0.
+	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
 	default 1
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 22/26] [HIJACKPROT] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


If a system call takes a PROT_{NONE,EXEC,WRITE,...} argument,
this adds support to it to take a protection key.

	mmap()
	mrprotect()
	drivers/char/agp/frontend.c's ioctl(AGPIOC_RESERVE)

This does not include direct support for shmat() since it uses
a different set of permission bits.  You can use mprotect()
after the attach to assign an attched SHM segment a protection
key.

---

 b/arch/x86/include/uapi/asm/mman.h       |    6 ++++++
 b/include/uapi/asm-generic/mman-common.h |    4 ++++
 2 files changed, 10 insertions(+)

diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-80-user-abi-bits arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-80-user-abi-bits	2015-09-16 09:45:54.123412488 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-16 09:45:54.129412761 -0700
@@ -20,6 +20,12 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot) (	\
+		((prot) & PROT_PKEY0 ? VM_PKEY_BIT0 : 0) |	\
+		((prot) & PROT_PKEY1 ? VM_PKEY_BIT1 : 0) |	\
+		((prot) & PROT_PKEY2 ? VM_PKEY_BIT2 : 0) |	\
+		((prot) & PROT_PKEY3 ? VM_PKEY_BIT3 : 0))
 #endif
 
 #include <asm-generic/mman.h>
diff -puN include/uapi/asm-generic/mman-common.h~pkeys-80-user-abi-bits include/uapi/asm-generic/mman-common.h
--- a/include/uapi/asm-generic/mman-common.h~pkeys-80-user-abi-bits	2015-09-16 09:45:54.125412579 -0700
+++ b/include/uapi/asm-generic/mman-common.h	2015-09-16 09:45:54.128412715 -0700
@@ -10,6 +10,10 @@
 #define PROT_WRITE	0x2		/* page can be written */
 #define PROT_EXEC	0x4		/* page can be executed */
 #define PROT_SEM	0x8		/* page may be used for atomic ops */
+#define PROT_PKEY0	0x10		/* protection key value (bit 0) */
+#define PROT_PKEY1	0x20		/* protection key value (bit 1) */
+#define PROT_PKEY2	0x40		/* protection key value (bit 2) */
+#define PROT_PKEY3	0x80		/* protection key value (bit 3) */
 #define PROT_NONE	0x0		/* page can not be accessed */
 #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
 #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 23/26] [HIJACKPROT] x86, pkeys: add x86 version of arch_validate_prot()
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This allows more than just the traditional PROT_* flags to
be passed in to mprotect(), etc... on x86.

---

 b/arch/x86/include/uapi/asm/mman.h |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-81-arch_validate_prot arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-81-arch_validate_prot	2015-09-16 09:45:54.564432490 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-16 09:45:54.567432626 -0700
@@ -6,6 +6,8 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#include <asm-generic/mman.h>
+
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 /*
  * Take the 4 protection key bits out of the vma->vm_flags
@@ -26,8 +28,20 @@
 		((prot) & PROT_PKEY1 ? VM_PKEY_BIT1 : 0) |	\
 		((prot) & PROT_PKEY2 ? VM_PKEY_BIT2 : 0) |	\
 		((prot) & PROT_PKEY3 ? VM_PKEY_BIT3 : 0))
-#endif
 
-#include <asm-generic/mman.h>
+#ifndef arch_validate_prot
+/*
+ * This is called from mprotect().  PROT_GROWSDOWN and PROT_GROWSUP have
+ * already been masked out.
+ *
+ * Returns true if the prot flags are valid
+ */
+#define arch_validate_prot(prot) (\
+	(prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM |	\
+	 PROT_PKEY0 | PROT_PKEY1 | PROT_PKEY2 | PROT_PKEY3)) == 0)	\
+
+#endif /* arch_validate_prot */
+
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
 #endif /* _ASM_X86_MMAN_H */
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 25/26] x86, pkeys: actually enable Memory Protection Keys in CPU
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This sets the bit in 'cr4' to actually enable the protection
keys feature.  We also include a boot-time disable for the
feature "nopku".

Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid
bit to appear set.  At this point in boot, identify_cpu()
has already run the actual CPUID instructions and populated
the "cpu features" structures.  We need to go back and
re-run identify_cpu() to make sure it gets updated values.

We *could* simply re-populate the 11th word of the cpuid
data, but this is probably quick enough.

Also note that with the cpu_has() check and X86_FEATURE_PKU
present in disabled-features.h, we do not need an #ifdef
for setup_pku().

---

 b/Documentation/kernel-parameters.txt |    3 +++
 b/arch/x86/kernel/cpu/common.c        |   26 ++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch	2015-09-16 09:45:55.420471313 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-09-16 09:45:55.426471585 -0700
@@ -289,6 +289,31 @@ static __always_inline void setup_smap(s
 }
 
 /*
+ * Protection Keys are not available in 32-bit mode.
+ */
+static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+{
+	if (!cpu_has(c, X86_FEATURE_PKU))
+		return;
+
+	cr4_set_bits(X86_CR4_PKE);
+	/*
+	 * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+	 * cpuid bit to be set.  We need to ensure that we
+	 * update that bit in this CPU's "cpu_info".
+	 */
+	get_cpu_cap(c);
+}
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static __init int setup_disable_pku(char *arg)
+{
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
+	return 1;
+}
+__setup("nopku", setup_disable_pku);
+#endif /* CONFIG_X86_64 */
+
+/*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
  * software.  Add those features to this table to auto-disable them.
@@ -947,6 +972,7 @@ static void identify_cpu(struct cpuinfo_
 	init_hypervisor(c);
 	x86_init_rdrand(c);
 	x86_init_cache_qos(c);
+	setup_pku(c);
 
 	/*
 	 * Clear/Set all flags overriden by options, need do it
diff -puN Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch	2015-09-16 09:45:55.422471404 -0700
+++ b/Documentation/kernel-parameters.txt	2015-09-16 09:45:55.427471630 -0700
@@ -955,6 +955,9 @@ bytes respectively. Such letter suffixes
 			See Documentation/x86/intel_mpx.txt for more
 			information about the feature.
 
+	nopku		[X86] Disable Memory Protection Keys CPU feature found
+			in some Intel CPUs.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 24/26] [HIJACKPROT] x86, pkeys: mask off pkeys bits in mprotect()
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This is a quick hack that puts very x86-specific bits in to
mprotect.c.  I will fix this up properly if we decide to go
forward with the PROT_* scheme for the user ABI for setting up
protection keys.

---

 b/arch/x86/include/uapi/asm/mman.h |    9 +++++----
 b/mm/mprotect.c                    |   13 ++++++++++++-
 2 files changed, 17 insertions(+), 5 deletions(-)

diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-82-mprotect-flag-copy arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-82-mprotect-flag-copy	2015-09-16 09:45:54.977451221 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-16 09:45:54.982451448 -0700
@@ -24,10 +24,11 @@
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
 
 #define arch_calc_vm_prot_bits(prot) (	\
-		((prot) & PROT_PKEY0 ? VM_PKEY_BIT0 : 0) |	\
-		((prot) & PROT_PKEY1 ? VM_PKEY_BIT1 : 0) |	\
-		((prot) & PROT_PKEY2 ? VM_PKEY_BIT2 : 0) |	\
-		((prot) & PROT_PKEY3 ? VM_PKEY_BIT3 : 0))
+		(!boot_cpu_has(X86_FEATURE_OSPKE) ? 0 :			\
+			((prot) & PROT_PKEY0 ? VM_PKEY_BIT0 : 0) |	\
+			((prot) & PROT_PKEY1 ? VM_PKEY_BIT1 : 0) |	\
+			((prot) & PROT_PKEY2 ? VM_PKEY_BIT2 : 0) |	\
+			((prot) & PROT_PKEY3 ? VM_PKEY_BIT3 : 0)))
 
 #ifndef arch_validate_prot
 /*
diff -puN mm/mprotect.c~pkeys-82-mprotect-flag-copy mm/mprotect.c
--- a/mm/mprotect.c~pkeys-82-mprotect-flag-copy	2015-09-16 09:45:54.978451266 -0700
+++ b/mm/mprotect.c	2015-09-16 09:45:54.982451448 -0700
@@ -344,6 +344,15 @@ fail:
 	return error;
 }
 
+static unsigned long vm_flags_unaffected_by_mprotect(unsigned long vm_flags)
+{
+	unsigned long mask_off = VM_READ | VM_WRITE | VM_EXEC;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	mask_off |= VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3;
+#endif
+	return vm_flags & ~mask_off;
+}
+
 SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 		unsigned long, prot)
 {
@@ -407,8 +416,10 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
+		/* Set the vm_flags from the PROT_* bits passed to mprotect */
 		newflags = vm_flags;
-		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
+		/* Copy over all other VMA flags unaffected by mprotect */
+		newflags |= vm_flags_unaffected_by_mprotect(vma->vm_flags);
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
 		if ((newflags & ~(newflags >> 4)) & (VM_READ | VM_WRITE | VM_EXEC)) {
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 24/26] [HIJACKPROT] x86, pkeys: mask off pkeys bits in mprotect()
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This is a quick hack that puts very x86-specific bits in to
mprotect.c.  I will fix this up properly if we decide to go
forward with the PROT_* scheme for the user ABI for setting up
protection keys.

---

 b/arch/x86/include/uapi/asm/mman.h |    9 +++++----
 b/mm/mprotect.c                    |   13 ++++++++++++-
 2 files changed, 17 insertions(+), 5 deletions(-)

diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-82-mprotect-flag-copy arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-82-mprotect-flag-copy	2015-09-16 09:45:54.977451221 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-16 09:45:54.982451448 -0700
@@ -24,10 +24,11 @@
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
 
 #define arch_calc_vm_prot_bits(prot) (	\
-		((prot) & PROT_PKEY0 ? VM_PKEY_BIT0 : 0) |	\
-		((prot) & PROT_PKEY1 ? VM_PKEY_BIT1 : 0) |	\
-		((prot) & PROT_PKEY2 ? VM_PKEY_BIT2 : 0) |	\
-		((prot) & PROT_PKEY3 ? VM_PKEY_BIT3 : 0))
+		(!boot_cpu_has(X86_FEATURE_OSPKE) ? 0 :			\
+			((prot) & PROT_PKEY0 ? VM_PKEY_BIT0 : 0) |	\
+			((prot) & PROT_PKEY1 ? VM_PKEY_BIT1 : 0) |	\
+			((prot) & PROT_PKEY2 ? VM_PKEY_BIT2 : 0) |	\
+			((prot) & PROT_PKEY3 ? VM_PKEY_BIT3 : 0)))
 
 #ifndef arch_validate_prot
 /*
diff -puN mm/mprotect.c~pkeys-82-mprotect-flag-copy mm/mprotect.c
--- a/mm/mprotect.c~pkeys-82-mprotect-flag-copy	2015-09-16 09:45:54.978451266 -0700
+++ b/mm/mprotect.c	2015-09-16 09:45:54.982451448 -0700
@@ -344,6 +344,15 @@ fail:
 	return error;
 }
 
+static unsigned long vm_flags_unaffected_by_mprotect(unsigned long vm_flags)
+{
+	unsigned long mask_off = VM_READ | VM_WRITE | VM_EXEC;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	mask_off |= VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3;
+#endif
+	return vm_flags & ~mask_off;
+}
+
 SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 		unsigned long, prot)
 {
@@ -407,8 +416,10 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
+		/* Set the vm_flags from the PROT_* bits passed to mprotect */
 		newflags = vm_flags;
-		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
+		/* Copy over all other VMA flags unaffected by mprotect */
+		newflags |= vm_flags_unaffected_by_mprotect(vma->vm_flags);
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
 		if ((newflags & ~(newflags >> 4)) & (VM_READ | VM_WRITE | VM_EXEC)) {
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 23/26] [HIJACKPROT] x86, pkeys: add x86 version of arch_validate_prot()
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This allows more than just the traditional PROT_* flags to
be passed in to mprotect(), etc... on x86.

---

 b/arch/x86/include/uapi/asm/mman.h |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-81-arch_validate_prot arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-81-arch_validate_prot	2015-09-16 09:45:54.564432490 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-16 09:45:54.567432626 -0700
@@ -6,6 +6,8 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#include <asm-generic/mman.h>
+
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 /*
  * Take the 4 protection key bits out of the vma->vm_flags
@@ -26,8 +28,20 @@
 		((prot) & PROT_PKEY1 ? VM_PKEY_BIT1 : 0) |	\
 		((prot) & PROT_PKEY2 ? VM_PKEY_BIT2 : 0) |	\
 		((prot) & PROT_PKEY3 ? VM_PKEY_BIT3 : 0))
-#endif
 
-#include <asm-generic/mman.h>
+#ifndef arch_validate_prot
+/*
+ * This is called from mprotect().  PROT_GROWSDOWN and PROT_GROWSUP have
+ * already been masked out.
+ *
+ * Returns true if the prot flags are valid
+ */
+#define arch_validate_prot(prot) (\
+	(prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM |	\
+	 PROT_PKEY0 | PROT_PKEY1 | PROT_PKEY2 | PROT_PKEY3)) == 0)	\
+
+#endif /* arch_validate_prot */
+
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
 #endif /* _ASM_X86_MMAN_H */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 25/26] x86, pkeys: actually enable Memory Protection Keys in CPU
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm


This sets the bit in 'cr4' to actually enable the protection
keys feature.  We also include a boot-time disable for the
feature "nopku".

Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid
bit to appear set.  At this point in boot, identify_cpu()
has already run the actual CPUID instructions and populated
the "cpu features" structures.  We need to go back and
re-run identify_cpu() to make sure it gets updated values.

We *could* simply re-populate the 11th word of the cpuid
data, but this is probably quick enough.

Also note that with the cpu_has() check and X86_FEATURE_PKU
present in disabled-features.h, we do not need an #ifdef
for setup_pku().

---

 b/Documentation/kernel-parameters.txt |    3 +++
 b/arch/x86/kernel/cpu/common.c        |   26 ++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch	2015-09-16 09:45:55.420471313 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-09-16 09:45:55.426471585 -0700
@@ -289,6 +289,31 @@ static __always_inline void setup_smap(s
 }
 
 /*
+ * Protection Keys are not available in 32-bit mode.
+ */
+static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+{
+	if (!cpu_has(c, X86_FEATURE_PKU))
+		return;
+
+	cr4_set_bits(X86_CR4_PKE);
+	/*
+	 * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+	 * cpuid bit to be set.  We need to ensure that we
+	 * update that bit in this CPU's "cpu_info".
+	 */
+	get_cpu_cap(c);
+}
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static __init int setup_disable_pku(char *arg)
+{
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
+	return 1;
+}
+__setup("nopku", setup_disable_pku);
+#endif /* CONFIG_X86_64 */
+
+/*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
  * software.  Add those features to this table to auto-disable them.
@@ -947,6 +972,7 @@ static void identify_cpu(struct cpuinfo_
 	init_hypervisor(c);
 	x86_init_rdrand(c);
 	x86_init_cache_qos(c);
+	setup_pku(c);
 
 	/*
 	 * Clear/Set all flags overriden by options, need do it
diff -puN Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch	2015-09-16 09:45:55.422471404 -0700
+++ b/Documentation/kernel-parameters.txt	2015-09-16 09:45:55.427471630 -0700
@@ -955,6 +955,9 @@ bytes respectively. Such letter suffixes
 			See Documentation/x86/intel_mpx.txt for more
 			information about the feature.
 
+	nopku		[X86] Disable Memory Protection Keys CPU feature found
+			in some Intel CPUs.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 26/26] x86, pkeys: Documentation
  2015-09-16 17:49 ` Dave Hansen
@ 2015-09-16 17:49   ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm



---

 b/Documentation/x86/protection-keys.txt |   65 ++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff -puN /dev/null Documentation/x86/protection-keys.txt
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/Documentation/x86/protection-keys.txt	2015-09-16 09:45:55.874491904 -0700
@@ -0,0 +1,65 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+which will be found on future Intel CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.  It works by
+dedicating 4 previously ignored bits in each page table entry to a
+"protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+        mprotect(ptr, size, PROT_NONE);
+	something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+	mprotect(ptr, size, PROT_READ|PROT_WRITE);
+        set_pkey(ptr, size, 4);
+        wrpkru(0xffffff3f); // access disable pkey 4
+	something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+        *ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+	read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
+
+=========
+
+Changes in v005:
+ * completed "software enforcement of PKEYs"
+ * fixed a ton of bugs
+
+Changes in v004:
+ * bunch of code updates including working signal handling
+
+Changes in v003:
+ * update to new FPU code, and add a bunch of XSAVE patches
+   to the beginning
+
+Changes in v002:
+
+ * make mprotect() actually work
+
+
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [PATCH 26/26] x86, pkeys: Documentation
@ 2015-09-16 17:49   ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:49 UTC (permalink / raw)
  To: dave; +Cc: x86, linux-kernel, linux-mm



---

 b/Documentation/x86/protection-keys.txt |   65 ++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff -puN /dev/null Documentation/x86/protection-keys.txt
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/Documentation/x86/protection-keys.txt	2015-09-16 09:45:55.874491904 -0700
@@ -0,0 +1,65 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+which will be found on future Intel CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.  It works by
+dedicating 4 previously ignored bits in each page table entry to a
+"protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+        mprotect(ptr, size, PROT_NONE);
+	something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+	mprotect(ptr, size, PROT_READ|PROT_WRITE);
+        set_pkey(ptr, size, 4);
+        wrpkru(0xffffff3f); // access disable pkey 4
+	something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+        *ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+	read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
+
+=========
+
+Changes in v005:
+ * completed "software enforcement of PKEYs"
+ * fixed a ton of bugs
+
+Changes in v004:
+ * bunch of code updates including working signal handling
+
+Changes in v003:
+ * update to new FPU code, and add a bunch of XSAVE patches
+   to the beginning
+
+Changes in v002:
+
+ * make mprotect() actually work
+
+
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Fwd: [PATCH 00/26] [RFCv2] x86: Memory Protection Keys
  2015-09-16 17:49 ` Dave Hansen
                   ` (26 preceding siblings ...)
  (?)
@ 2015-09-16 17:51 ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-16 17:51 UTC (permalink / raw)
  To: linux-arch

I'm sending this along to linux-arch as an FYI.

If your architecture has "protection keys" or "storage keys" or some
similar mechanism, I'd appreciate a look through these patches,
especially the syscalls.

-------- Forwarded Message --------
Subject: [PATCH 00/26] [RFCv2] x86: Memory Protection Keys
Date: Wed, 16 Sep 2015 10:49:03 -0700
From: Dave Hansen <dave@sr71.net>
To: dave@sr71.net
CC: x86@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org

MM reviewers, if you are going to look at one thing, please look
at patch 14 which adds a bunch of additional vma/pte permission
checks.  Everybody else, please take a look at the two syscall
alternatives, especially the non-x86 folk.

This is a second big, fat RFC.  This code is not runnable to
anyone outside of Intel unless they have some special hardware or
a fancy simulator.  If you are interested in running this for
real, please get in touch with me.  Hardware is available to
a very small but nonzero number of people.

Since the last posting, I have implemented almost all of the
"software enforcement" for protection keys.  Basically, in places
where we look at VMA or PTE permissions, we try to enforce
protection keys to make it act similarly to mprotect().  This is
the part of the approach that really needs the most review and is
almost entirely contained in the "check VMAs and PTEs for
protection keys".

I also implemented a new system call.  There are basically two
possibilities for plumbing protection keys out to userspace.
I've included *both* approaches here:
1. Create a new system call: mprotect_key().  It's mprotect(),
   plus a protection key.  The patches implementing this have
   [NEWSYSCALL] in the subject.
2. Hijack some space in the PROT_* bits and pass a protection key
   in there.  That way, existing system calls like mmap(),
   mprotect(), etc... just work.  The patches implementing this
   have [HIJACKPROT] in the subject and must be applied without
   the [NEWSYSCALL] ones.

There is still work left to do here.  Current TODO:
 * Build on something other than x86
 * Do some more exhaustive x86 randconfig tests
 * Make sure DAX mappings work
 * Pound on some of the modified paths to ensure limited
   performance impact from modifications to hot paths.

This set is also available here (with the new syscall):

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git
pkeys-v005

A version with the modification of the PROT_ syscalls is tagged
as 'pkeys-v005-protsyscalls'.

=== diffstat (new syscall version) ===

 Documentation/kernel-parameters.txt         |    3
 Documentation/x86/protection-keys.txt       |   65 ++++++++++++++++++++
 arch/powerpc/include/asm/mman.h             |    5 -
 arch/x86/Kconfig                            |   15 ++++
 arch/x86/entry/syscalls/syscall_32.tbl      |    1
 arch/x86/entry/syscalls/syscall_64.tbl      |    1
 arch/x86/include/asm/cpufeature.h           |   54 ++++++++++------
 arch/x86/include/asm/disabled-features.h    |   12 +++
 arch/x86/include/asm/fpu/types.h            |   17 +++++
 arch/x86/include/asm/fpu/xstate.h           |    4 -
 arch/x86/include/asm/mmu_context.h          |   66 ++++++++++++++++++++
 arch/x86/include/asm/pgtable.h              |   37 +++++++++++
 arch/x86/include/asm/pgtable_types.h        |   34 +++++++++-
 arch/x86/include/asm/required-features.h    |    4 +
 arch/x86/include/asm/special_insns.h        |   33 ++++++++++
 arch/x86/include/uapi/asm/mman.h            |   23 +++++++
 arch/x86/include/uapi/asm/processor-flags.h |    2
 arch/x86/kernel/cpu/common.c                |   27 ++++++++
 arch/x86/kernel/fpu/xstate.c                |   10 ++-
 arch/x86/kernel/process_64.c                |    2
 arch/x86/kernel/setup.c                     |    9 ++
 arch/x86/mm/fault.c                         |   89
++++++++++++++++++++++++++--
 arch/x86/mm/gup.c                           |   37 ++++++-----
 drivers/char/agp/frontend.c                 |    2
 drivers/staging/android/ashmem.c            |    3
 fs/proc/task_mmu.c                          |    5 +
 include/asm-generic/mm_hooks.h              |   12 +++
 include/linux/mm.h                          |   15 ++++
 include/linux/mman.h                        |    6 -
 include/uapi/asm-generic/siginfo.h          |   11 +++
 mm/Kconfig                                  |   11 +++
 mm/gup.c                                    |   28 +++++++-
 mm/memory.c                                 |    8 +-
 mm/mmap.c                                   |    2
 mm/mprotect.c                               |   20 +++++-
 35 files changed, 607 insertions(+), 66 deletions(-)

== FEATURE OVERVIEW ==

Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU
feature which will be found in future Intel CPUs.  The work here
was done with the aid of simulators.

Memory Protection Keys provides a mechanism for enforcing
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.  It
works by dedicating 4 previously ignored bits in each page table
entry to assigning a "protection key", giving 16 possible keys to
each page mapping.

There is also a new user-accessible register (PKRU) with two
separate bits (Access Disable and Write Disable) for each key.
Being a CPU register, PKRU is inherently thread-local,
potentially giving each thread a different set of protections
from every other thread.

There are two new instructions (RDPKRU/WRPKRU) for reading and
writing to the new register.  The feature is only available in
64-bit mode, even though there is theoretically space in the PAE
PTEs.  These permissions are enforced on data access only and
have no effect on instruction fetches.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-09-16 17:49   ` Dave Hansen
@ 2015-09-20  8:55     ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-20  8:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Dave Hansen <dave@sr71.net> wrote:

> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
> +which will be found on future Intel CPUs.
> +
> +Memory Protection Keys provides a mechanism for enforcing page-based
> +protections, but without requiring modification of the page tables
> +when an application changes protection domains.  It works by
> +dedicating 4 previously ignored bits in each page table entry to a
> +"protection key", giving 16 possible keys.

Wondering how user-space is supposed to discover the number of protection keys,
is that CPUID leaf based, or hardcoded on the CPU feature bit?

> +There is also a new user-accessible register (PKRU) with two separate
> +bits (Access Disable and Write Disable) for each key.  Being a CPU
> +register, PKRU is inherently thread-local, potentially giving each
> +thread a different set of protections from every other thread.
> +
> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
> +to the new register.  The feature is only available in 64-bit mode,
> +even though there is theoretically space in the PAE PTEs.  These
> +permissions are enforced on data access only and have no effect on
> +instruction fetches.

Another question, related to enumeration as well: I'm wondering whether there's 
any way for the kernel to allocate a bit or two for its own purposes - such as 
protecting crypto keys? Or is the facility fundamentally intended for user-space 
use only?

Just a quick example: let's assume the kernel has an information leak hole, a way 
to read any kernel address and pass that to the kernel attacker. Let's also assume 
that the main crypto-keys of the kernel are protected by protection-keys. The code 
exposing the information leak will very likely have protection-key protected areas 
masked out, so the scope of the information leak is mitigated to a certain degree, 
the crypto keys are not readable.

Similarly, the pmem (persistent memory) driver could employ protection keys to 
keep terabytes of data 'masked out' most of the time - protecting data from kernel 
space memory corruption bugs.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-09-20  8:55     ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-20  8:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Dave Hansen <dave@sr71.net> wrote:

> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
> +which will be found on future Intel CPUs.
> +
> +Memory Protection Keys provides a mechanism for enforcing page-based
> +protections, but without requiring modification of the page tables
> +when an application changes protection domains.  It works by
> +dedicating 4 previously ignored bits in each page table entry to a
> +"protection key", giving 16 possible keys.

Wondering how user-space is supposed to discover the number of protection keys,
is that CPUID leaf based, or hardcoded on the CPU feature bit?

> +There is also a new user-accessible register (PKRU) with two separate
> +bits (Access Disable and Write Disable) for each key.  Being a CPU
> +register, PKRU is inherently thread-local, potentially giving each
> +thread a different set of protections from every other thread.
> +
> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
> +to the new register.  The feature is only available in 64-bit mode,
> +even though there is theoretically space in the PAE PTEs.  These
> +permissions are enforced on data access only and have no effect on
> +instruction fetches.

Another question, related to enumeration as well: I'm wondering whether there's 
any way for the kernel to allocate a bit or two for its own purposes - such as 
protecting crypto keys? Or is the facility fundamentally intended for user-space 
use only?

Just a quick example: let's assume the kernel has an information leak hole, a way 
to read any kernel address and pass that to the kernel attacker. Let's also assume 
that the main crypto-keys of the kernel are protected by protection-keys. The code 
exposing the information leak will very likely have protection-key protected areas 
masked out, so the scope of the information leak is mitigated to a certain degree, 
the crypto keys are not readable.

Similarly, the pmem (persistent memory) driver could employ protection keys to 
keep terabytes of data 'masked out' most of the time - protecting data from kernel 
space memory corruption bugs.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-09-20  8:55     ` Ingo Molnar
@ 2015-09-21  4:34       ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-21  4:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 09/20/2015 01:55 AM, Ingo Molnar wrote:
> * Dave Hansen <dave@sr71.net> wrote:
>> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
>> +which will be found on future Intel CPUs.
>> +
>> +Memory Protection Keys provides a mechanism for enforcing page-based
>> +protections, but without requiring modification of the page tables
>> +when an application changes protection domains.  It works by
>> +dedicating 4 previously ignored bits in each page table entry to a
>> +"protection key", giving 16 possible keys.
> 
> Wondering how user-space is supposed to discover the number of protection keys,
> is that CPUID leaf based, or hardcoded on the CPU feature bit?

The 16 keys are essentially hard-coded from the cpuid bit.

>> +There is also a new user-accessible register (PKRU) with two separate
>> +bits (Access Disable and Write Disable) for each key.  Being a CPU
>> +register, PKRU is inherently thread-local, potentially giving each
>> +thread a different set of protections from every other thread.
>> +
>> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
>> +to the new register.  The feature is only available in 64-bit mode,
>> +even though there is theoretically space in the PAE PTEs.  These
>> +permissions are enforced on data access only and have no effect on
>> +instruction fetches.
> 
> Another question, related to enumeration as well: I'm wondering whether there's 
> any way for the kernel to allocate a bit or two for its own purposes - such as 
> protecting crypto keys? Or is the facility fundamentally intended for user-space 
> use only?

No, that's not possible with the current setup.

Userspace has complete control over the contents of the PKRU register
with unprivileged instructions.  So the kernel can not practically
protect any of its own data with this.

> Similarly, the pmem (persistent memory) driver could employ protection keys to 
> keep terabytes of data 'masked out' most of the time - protecting data from kernel 
> space memory corruption bugs.

I wish we could do this, but we can not with the current implementation.


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-09-21  4:34       ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-21  4:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 09/20/2015 01:55 AM, Ingo Molnar wrote:
> * Dave Hansen <dave@sr71.net> wrote:
>> +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
>> +which will be found on future Intel CPUs.
>> +
>> +Memory Protection Keys provides a mechanism for enforcing page-based
>> +protections, but without requiring modification of the page tables
>> +when an application changes protection domains.  It works by
>> +dedicating 4 previously ignored bits in each page table entry to a
>> +"protection key", giving 16 possible keys.
> 
> Wondering how user-space is supposed to discover the number of protection keys,
> is that CPUID leaf based, or hardcoded on the CPU feature bit?

The 16 keys are essentially hard-coded from the cpuid bit.

>> +There is also a new user-accessible register (PKRU) with two separate
>> +bits (Access Disable and Write Disable) for each key.  Being a CPU
>> +register, PKRU is inherently thread-local, potentially giving each
>> +thread a different set of protections from every other thread.
>> +
>> +There are two new instructions (RDPKRU/WRPKRU) for reading and writing
>> +to the new register.  The feature is only available in 64-bit mode,
>> +even though there is theoretically space in the PAE PTEs.  These
>> +permissions are enforced on data access only and have no effect on
>> +instruction fetches.
> 
> Another question, related to enumeration as well: I'm wondering whether there's 
> any way for the kernel to allocate a bit or two for its own purposes - such as 
> protecting crypto keys? Or is the facility fundamentally intended for user-space 
> use only?

No, that's not possible with the current setup.

Userspace has complete control over the contents of the PKRU register
with unprivileged instructions.  So the kernel can not practically
protect any of its own data with this.

> Similarly, the pmem (persistent memory) driver could employ protection keys to 
> keep terabytes of data 'masked out' most of the time - protecting data from kernel 
> space memory corruption bugs.

I wish we could do this, but we can not with the current implementation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 05/26] x86, pkey: add PKRU xsave fields and data structure(s)
  2015-09-16 17:49   ` Dave Hansen
@ 2015-09-22 19:53     ` Thomas Gleixner
  -1 siblings, 0 replies; 172+ messages in thread
From: Thomas Gleixner @ 2015-09-22 19:53 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, linux-kernel, linux-mm

On Wed, 16 Sep 2015, Dave Hansen wrote:
> --- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2015-09-16 10:48:13.340060126 -0700
> +++ b/arch/x86/kernel/fpu/xstate.c	2015-09-16 10:48:13.344060307 -0700
> @@ -23,6 +23,8 @@ static const char *xfeature_names[] =
>  	"AVX-512 opmask"		,
>  	"AVX-512 Hi256"			,
>  	"AVX-512 ZMM_Hi256"		,
> +	"unknown xstate feature (8)"	,

It's not unknown. It's PT, right?

> +	"Protection Keys User registers",
>  	"unknown xstate feature"	,
>  };

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 05/26] x86, pkey: add PKRU xsave fields and data structure(s)
@ 2015-09-22 19:53     ` Thomas Gleixner
  0 siblings, 0 replies; 172+ messages in thread
From: Thomas Gleixner @ 2015-09-22 19:53 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, linux-kernel, linux-mm

On Wed, 16 Sep 2015, Dave Hansen wrote:
> --- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2015-09-16 10:48:13.340060126 -0700
> +++ b/arch/x86/kernel/fpu/xstate.c	2015-09-16 10:48:13.344060307 -0700
> @@ -23,6 +23,8 @@ static const char *xfeature_names[] =
>  	"AVX-512 opmask"		,
>  	"AVX-512 Hi256"			,
>  	"AVX-512 ZMM_Hi256"		,
> +	"unknown xstate feature (8)"	,

It's not unknown. It's PT, right?

> +	"Protection Keys User registers",
>  	"unknown xstate feature"	,
>  };

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 05/26] x86, pkey: add PKRU xsave fields and data structure(s)
  2015-09-22 19:53     ` Thomas Gleixner
@ 2015-09-22 19:58       ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-22 19:58 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: x86, linux-kernel, linux-mm

On 09/22/2015 12:53 PM, Thomas Gleixner wrote:
> On Wed, 16 Sep 2015, Dave Hansen wrote:
>> --- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2015-09-16 10:48:13.340060126 -0700
>> +++ b/arch/x86/kernel/fpu/xstate.c	2015-09-16 10:48:13.344060307 -0700
>> @@ -23,6 +23,8 @@ static const char *xfeature_names[] =
>>  	"AVX-512 opmask"		,
>>  	"AVX-512 Hi256"			,
>>  	"AVX-512 ZMM_Hi256"		,
>> +	"unknown xstate feature (8)"	,
> 
> It's not unknown. It's PT, right?

Yes, it's the Processor Trace state.

I'll give it a real name and also a comment about it being unused.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 05/26] x86, pkey: add PKRU xsave fields and data structure(s)
@ 2015-09-22 19:58       ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-22 19:58 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: x86, linux-kernel, linux-mm

On 09/22/2015 12:53 PM, Thomas Gleixner wrote:
> On Wed, 16 Sep 2015, Dave Hansen wrote:
>> --- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2015-09-16 10:48:13.340060126 -0700
>> +++ b/arch/x86/kernel/fpu/xstate.c	2015-09-16 10:48:13.344060307 -0700
>> @@ -23,6 +23,8 @@ static const char *xfeature_names[] =
>>  	"AVX-512 opmask"		,
>>  	"AVX-512 Hi256"			,
>>  	"AVX-512 ZMM_Hi256"		,
>> +	"unknown xstate feature (8)"	,
> 
> It's not unknown. It's PT, right?

Yes, it's the Processor Trace state.

I'll give it a real name and also a comment about it being unused.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-16 17:49   ` Dave Hansen
@ 2015-09-22 20:03     ` Thomas Gleixner
  -1 siblings, 0 replies; 172+ messages in thread
From: Thomas Gleixner @ 2015-09-22 20:03 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, linux-kernel, linux-mm

On Wed, 16 Sep 2015, Dave Hansen wrote:
>  
> +static inline u16 vma_pkey(struct vm_area_struct *vma)
> +{
> +	u16 pkey = 0;
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> +	unsigned long f = vma->vm_flags;
> +	pkey |= (!!(f & VM_HIGH_ARCH_0)) << 0;
> +	pkey |= (!!(f & VM_HIGH_ARCH_1)) << 1;
> +	pkey |= (!!(f & VM_HIGH_ARCH_2)) << 2;
> +	pkey |= (!!(f & VM_HIGH_ARCH_3)) << 3;

Eew. What's wrong with:

     pkey = (vma->vm_flags & VM_PKEY_MASK) >> VM_PKEY_SHIFT;

???

> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)

So here we get a u16 and assign it to si_pkey

> +	if (boot_cpu_has(X86_FEATURE_OSPKE) && si_code == SEGV_PKUERR)
> +		info.si_pkey = fetch_pkey(address, tsk);

which is int.

> +			int _pkey; /* FIXME: protection key value??

Inconsistent at least.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-22 20:03     ` Thomas Gleixner
  0 siblings, 0 replies; 172+ messages in thread
From: Thomas Gleixner @ 2015-09-22 20:03 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, linux-kernel, linux-mm

On Wed, 16 Sep 2015, Dave Hansen wrote:
>  
> +static inline u16 vma_pkey(struct vm_area_struct *vma)
> +{
> +	u16 pkey = 0;
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> +	unsigned long f = vma->vm_flags;
> +	pkey |= (!!(f & VM_HIGH_ARCH_0)) << 0;
> +	pkey |= (!!(f & VM_HIGH_ARCH_1)) << 1;
> +	pkey |= (!!(f & VM_HIGH_ARCH_2)) << 2;
> +	pkey |= (!!(f & VM_HIGH_ARCH_3)) << 3;

Eew. What's wrong with:

     pkey = (vma->vm_flags & VM_PKEY_MASK) >> VM_PKEY_SHIFT;

???

> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)

So here we get a u16 and assign it to si_pkey

> +	if (boot_cpu_has(X86_FEATURE_OSPKE) && si_code == SEGV_PKUERR)
> +		info.si_pkey = fetch_pkey(address, tsk);

which is int.

> +			int _pkey; /* FIXME: protection key value??

Inconsistent at least.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 11/26] x86, pkeys: add functions for set/fetch PKRU
  2015-09-16 17:49   ` Dave Hansen
@ 2015-09-22 20:05     ` Thomas Gleixner
  -1 siblings, 0 replies; 172+ messages in thread
From: Thomas Gleixner @ 2015-09-22 20:05 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, linux-kernel, linux-mm

On Wed, 16 Sep 2015, Dave Hansen wrote:

> 
> This adds the raw instructions to access PKRU as well as some
> accessor functions that correctly handle when the CPU does
> not support the instruction.  We don't use them here, but
> we will use read_pkru() in the next patch.
> 
> I do not see an immediate use for write_pkru().  But, we put it
> here for partity with its twin.

So that read_pkru() doesn't feel so lonely? I can't follow that logic.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 11/26] x86, pkeys: add functions for set/fetch PKRU
@ 2015-09-22 20:05     ` Thomas Gleixner
  0 siblings, 0 replies; 172+ messages in thread
From: Thomas Gleixner @ 2015-09-22 20:05 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, linux-kernel, linux-mm

On Wed, 16 Sep 2015, Dave Hansen wrote:

> 
> This adds the raw instructions to access PKRU as well as some
> accessor functions that correctly handle when the CPU does
> not support the instruction.  We don't use them here, but
> we will use read_pkru() in the next patch.
> 
> I do not see an immediate use for write_pkru().  But, we put it
> here for partity with its twin.

So that read_pkru() doesn't feel so lonely? I can't follow that logic.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-22 20:03     ` Thomas Gleixner
@ 2015-09-22 20:21       ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-22 20:21 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: x86, linux-kernel, linux-mm

On 09/22/2015 01:03 PM, Thomas Gleixner wrote:
> On Wed, 16 Sep 2015, Dave Hansen wrote:
>>  
>> +static inline u16 vma_pkey(struct vm_area_struct *vma)
>> +{
>> +	u16 pkey = 0;
>> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>> +	unsigned long f = vma->vm_flags;
>> +	pkey |= (!!(f & VM_HIGH_ARCH_0)) << 0;
>> +	pkey |= (!!(f & VM_HIGH_ARCH_1)) << 1;
>> +	pkey |= (!!(f & VM_HIGH_ARCH_2)) << 2;
>> +	pkey |= (!!(f & VM_HIGH_ARCH_3)) << 3;
> 
> Eew. What's wrong with:
> 
>      pkey = (vma->vm_flags & VM_PKEY_MASK) >> VM_PKEY_SHIFT;

I didn't do that only because we don't have any other need for
VM_PKEY_MASK or VM_PKEY_SHIFT.  We could do:

#define VM_PKEY_MASK (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2...)

static inline u16 vma_pkey(struct vm_area_struct *vma)
{
	int vm_pkey_shift = __ffs(VM_PKEY_MASK)
	return (vma->vm_flags & VM_PKEY_MASK) >> vm_pkey_shift;
}

That's probably the same number of lines of code in the end.  The
compiler _probably_ ends up doing the same thing either way.

>> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
> 
> So here we get a u16 and assign it to si_pkey
> 
>> +	if (boot_cpu_has(X86_FEATURE_OSPKE) && si_code == SEGV_PKUERR)
>> +		info.si_pkey = fetch_pkey(address, tsk);
> 
> which is int.
> 
>> +			int _pkey; /* FIXME: protection key value??
> 
> Inconsistent at least.

So I defined all the kernel-internal types as u16 since I *know* the
size of the hardware.

The user-exposed ones should probably be a bit more generic.  I did just
realize that this is an int and my proposed syscall is a long.  That I
definitely need to make consistent.

Does anybody care whether it's an int or a long?

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-22 20:21       ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-22 20:21 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: x86, linux-kernel, linux-mm

On 09/22/2015 01:03 PM, Thomas Gleixner wrote:
> On Wed, 16 Sep 2015, Dave Hansen wrote:
>>  
>> +static inline u16 vma_pkey(struct vm_area_struct *vma)
>> +{
>> +	u16 pkey = 0;
>> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
>> +	unsigned long f = vma->vm_flags;
>> +	pkey |= (!!(f & VM_HIGH_ARCH_0)) << 0;
>> +	pkey |= (!!(f & VM_HIGH_ARCH_1)) << 1;
>> +	pkey |= (!!(f & VM_HIGH_ARCH_2)) << 2;
>> +	pkey |= (!!(f & VM_HIGH_ARCH_3)) << 3;
> 
> Eew. What's wrong with:
> 
>      pkey = (vma->vm_flags & VM_PKEY_MASK) >> VM_PKEY_SHIFT;

I didn't do that only because we don't have any other need for
VM_PKEY_MASK or VM_PKEY_SHIFT.  We could do:

#define VM_PKEY_MASK (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2...)

static inline u16 vma_pkey(struct vm_area_struct *vma)
{
	int vm_pkey_shift = __ffs(VM_PKEY_MASK)
	return (vma->vm_flags & VM_PKEY_MASK) >> vm_pkey_shift;
}

That's probably the same number of lines of code in the end.  The
compiler _probably_ ends up doing the same thing either way.

>> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
> 
> So here we get a u16 and assign it to si_pkey
> 
>> +	if (boot_cpu_has(X86_FEATURE_OSPKE) && si_code == SEGV_PKUERR)
>> +		info.si_pkey = fetch_pkey(address, tsk);
> 
> which is int.
> 
>> +			int _pkey; /* FIXME: protection key value??
> 
> Inconsistent at least.

So I defined all the kernel-internal types as u16 since I *know* the
size of the hardware.

The user-exposed ones should probably be a bit more generic.  I did just
realize that this is an int and my proposed syscall is a long.  That I
definitely need to make consistent.

Does anybody care whether it's an int or a long?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 11/26] x86, pkeys: add functions for set/fetch PKRU
  2015-09-22 20:05     ` Thomas Gleixner
@ 2015-09-22 20:22       ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-22 20:22 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: x86, linux-kernel, linux-mm

On 09/22/2015 01:05 PM, Thomas Gleixner wrote:
> On Wed, 16 Sep 2015, Dave Hansen wrote:
>> This adds the raw instructions to access PKRU as well as some
>> accessor functions that correctly handle when the CPU does
>> not support the instruction.  We don't use them here, but
>> we will use read_pkru() in the next patch.
>>
>> I do not see an immediate use for write_pkru().  But, we put it
>> here for partity with its twin.
> 
> So that read_pkru() doesn't feel so lonely? I can't follow that logic.

I was actually using it in a few places, but it fell out of later
versions of the patch.  I'm happy to kill it.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 11/26] x86, pkeys: add functions for set/fetch PKRU
@ 2015-09-22 20:22       ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-22 20:22 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: x86, linux-kernel, linux-mm

On 09/22/2015 01:05 PM, Thomas Gleixner wrote:
> On Wed, 16 Sep 2015, Dave Hansen wrote:
>> This adds the raw instructions to access PKRU as well as some
>> accessor functions that correctly handle when the CPU does
>> not support the instruction.  We don't use them here, but
>> we will use read_pkru() in the next patch.
>>
>> I do not see an immediate use for write_pkru().  But, we put it
>> here for partity with its twin.
> 
> So that read_pkru() doesn't feel so lonely? I can't follow that logic.

I was actually using it in a few places, but it fell out of later
versions of the patch.  I'm happy to kill it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-22 20:21       ` Dave Hansen
@ 2015-09-22 20:27         ` Thomas Gleixner
  -1 siblings, 0 replies; 172+ messages in thread
From: Thomas Gleixner @ 2015-09-22 20:27 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, linux-kernel, linux-mm

On Tue, 22 Sep 2015, Dave Hansen wrote:
> On 09/22/2015 01:03 PM, Thomas Gleixner wrote:
> > On Wed, 16 Sep 2015, Dave Hansen wrote:
> >>  
> >> +static inline u16 vma_pkey(struct vm_area_struct *vma)
> >> +{
> >> +	u16 pkey = 0;
> >> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> >> +	unsigned long f = vma->vm_flags;
> >> +	pkey |= (!!(f & VM_HIGH_ARCH_0)) << 0;
> >> +	pkey |= (!!(f & VM_HIGH_ARCH_1)) << 1;
> >> +	pkey |= (!!(f & VM_HIGH_ARCH_2)) << 2;
> >> +	pkey |= (!!(f & VM_HIGH_ARCH_3)) << 3;
> > 
> > Eew. What's wrong with:
> > 
> >      pkey = (vma->vm_flags & VM_PKEY_MASK) >> VM_PKEY_SHIFT;
> 
> I didn't do that only because we don't have any other need for
> VM_PKEY_MASK or VM_PKEY_SHIFT.  We could do:
> 
> #define VM_PKEY_MASK (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2...)
> 
> static inline u16 vma_pkey(struct vm_area_struct *vma)
> {
> 	int vm_pkey_shift = __ffs(VM_PKEY_MASK)
> 	return (vma->vm_flags & VM_PKEY_MASK) >> vm_pkey_shift;
> }
> 
> That's probably the same number of lines of code in the end.  The
> compiler _probably_ ends up doing the same thing either way.
> 
> >> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
> > 
> > So here we get a u16 and assign it to si_pkey
> > 
> >> +	if (boot_cpu_has(X86_FEATURE_OSPKE) && si_code == SEGV_PKUERR)
> >> +		info.si_pkey = fetch_pkey(address, tsk);
> > 
> > which is int.
> > 
> >> +			int _pkey; /* FIXME: protection key value??
> > 
> > Inconsistent at least.
> 
> So I defined all the kernel-internal types as u16 since I *know* the
> size of the hardware.
> 
> The user-exposed ones should probably be a bit more generic.  I did just
> realize that this is an int and my proposed syscall is a long.  That I
> definitely need to make consistent.
> 
> Does anybody care whether it's an int or a long?

long is frowned upon due to 32/64bit. Even if that key stuff is only
available on 64bit for now ....

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-22 20:27         ` Thomas Gleixner
  0 siblings, 0 replies; 172+ messages in thread
From: Thomas Gleixner @ 2015-09-22 20:27 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, linux-kernel, linux-mm

On Tue, 22 Sep 2015, Dave Hansen wrote:
> On 09/22/2015 01:03 PM, Thomas Gleixner wrote:
> > On Wed, 16 Sep 2015, Dave Hansen wrote:
> >>  
> >> +static inline u16 vma_pkey(struct vm_area_struct *vma)
> >> +{
> >> +	u16 pkey = 0;
> >> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> >> +	unsigned long f = vma->vm_flags;
> >> +	pkey |= (!!(f & VM_HIGH_ARCH_0)) << 0;
> >> +	pkey |= (!!(f & VM_HIGH_ARCH_1)) << 1;
> >> +	pkey |= (!!(f & VM_HIGH_ARCH_2)) << 2;
> >> +	pkey |= (!!(f & VM_HIGH_ARCH_3)) << 3;
> > 
> > Eew. What's wrong with:
> > 
> >      pkey = (vma->vm_flags & VM_PKEY_MASK) >> VM_PKEY_SHIFT;
> 
> I didn't do that only because we don't have any other need for
> VM_PKEY_MASK or VM_PKEY_SHIFT.  We could do:
> 
> #define VM_PKEY_MASK (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2...)
> 
> static inline u16 vma_pkey(struct vm_area_struct *vma)
> {
> 	int vm_pkey_shift = __ffs(VM_PKEY_MASK)
> 	return (vma->vm_flags & VM_PKEY_MASK) >> vm_pkey_shift;
> }
> 
> That's probably the same number of lines of code in the end.  The
> compiler _probably_ ends up doing the same thing either way.
> 
> >> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
> > 
> > So here we get a u16 and assign it to si_pkey
> > 
> >> +	if (boot_cpu_has(X86_FEATURE_OSPKE) && si_code == SEGV_PKUERR)
> >> +		info.si_pkey = fetch_pkey(address, tsk);
> > 
> > which is int.
> > 
> >> +			int _pkey; /* FIXME: protection key value??
> > 
> > Inconsistent at least.
> 
> So I defined all the kernel-internal types as u16 since I *know* the
> size of the hardware.
> 
> The user-exposed ones should probably be a bit more generic.  I did just
> realize that this is an int and my proposed syscall is a long.  That I
> definitely need to make consistent.
> 
> Does anybody care whether it's an int or a long?

long is frowned upon due to 32/64bit. Even if that key stuff is only
available on 64bit for now ....

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-22 20:27         ` Thomas Gleixner
@ 2015-09-22 20:29           ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-22 20:29 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: x86, linux-kernel, linux-mm

On 09/22/2015 01:27 PM, Thomas Gleixner wrote:
>> > 
>> > So I defined all the kernel-internal types as u16 since I *know* the
>> > size of the hardware.
>> > 
>> > The user-exposed ones should probably be a bit more generic.  I did just
>> > realize that this is an int and my proposed syscall is a long.  That I
>> > definitely need to make consistent.
>> > 
>> > Does anybody care whether it's an int or a long?
> long is frowned upon due to 32/64bit. Even if that key stuff is only
> available on 64bit for now ....

Well, it can be used by 32-bit apps on 64-bit kernels.

Ahh, that's why we don't see any longs in the siginfo.  So does that
mean 'int' is still our best bet in siginfo?



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-22 20:29           ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-22 20:29 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: x86, linux-kernel, linux-mm

On 09/22/2015 01:27 PM, Thomas Gleixner wrote:
>> > 
>> > So I defined all the kernel-internal types as u16 since I *know* the
>> > size of the hardware.
>> > 
>> > The user-exposed ones should probably be a bit more generic.  I did just
>> > realize that this is an int and my proposed syscall is a long.  That I
>> > definitely need to make consistent.
>> > 
>> > Does anybody care whether it's an int or a long?
> long is frowned upon due to 32/64bit. Even if that key stuff is only
> available on 64bit for now ....

Well, it can be used by 32-bit apps on 64-bit kernels.

Ahh, that's why we don't see any longs in the siginfo.  So does that
mean 'int' is still our best bet in siginfo?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-22 20:29           ` Dave Hansen
@ 2015-09-23  8:05             ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-23  8:05 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Thomas Gleixner, x86, linux-kernel, linux-mm, Peter Zijlstra


* Dave Hansen <dave@sr71.net> wrote:

> On 09/22/2015 01:27 PM, Thomas Gleixner wrote:
> >> > 
> >> > So I defined all the kernel-internal types as u16 since I *know* the
> >> > size of the hardware.
> >> > 
> >> > The user-exposed ones should probably be a bit more generic.  I did just
> >> > realize that this is an int and my proposed syscall is a long.  That I
> >> > definitely need to make consistent.
> >> > 
> >> > Does anybody care whether it's an int or a long?
> > long is frowned upon due to 32/64bit. Even if that key stuff is only
> > available on 64bit for now ....
> 
> Well, it can be used by 32-bit apps on 64-bit kernels.
> 
> Ahh, that's why we don't see any longs in the siginfo.  So does that
> mean 'int' is still our best bet in siginfo?

Use {s|u}{8|16|32|64} integer types in ABI relevant interfaces please, they are 
our most unambiguous and constant types.

Here that would mean s32 or u32?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-23  8:05             ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-23  8:05 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Thomas Gleixner, x86, linux-kernel, linux-mm, Peter Zijlstra


* Dave Hansen <dave@sr71.net> wrote:

> On 09/22/2015 01:27 PM, Thomas Gleixner wrote:
> >> > 
> >> > So I defined all the kernel-internal types as u16 since I *know* the
> >> > size of the hardware.
> >> > 
> >> > The user-exposed ones should probably be a bit more generic.  I did just
> >> > realize that this is an int and my proposed syscall is a long.  That I
> >> > definitely need to make consistent.
> >> > 
> >> > Does anybody care whether it's an int or a long?
> > long is frowned upon due to 32/64bit. Even if that key stuff is only
> > available on 64bit for now ....
> 
> Well, it can be used by 32-bit apps on 64-bit kernels.
> 
> Ahh, that's why we don't see any longs in the siginfo.  So does that
> mean 'int' is still our best bet in siginfo?

Use {s|u}{8|16|32|64} integer types in ABI relevant interfaces please, they are 
our most unambiguous and constant types.

Here that would mean s32 or u32?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-16 17:49   ` Dave Hansen
@ 2015-09-24  9:23     ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-24  9:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Dave Hansen <dave@sr71.net> wrote:

> A protection key fault is very similar to any other access
> error.  There must be a VMA, etc...  We even want to take
> the same action (SIGSEGV) that we do with a normal access
> fault.
> 
> However, we do need to let userspace know that something
> is different.  We do this the same way what we did with
> SEGV_BNDERR with Memory Protection eXtensions (MPX):
> define a new SEGV code: SEGV_PKUERR.
> 
> We also add a siginfo field: si_pkey that reveals to
> userspace which protection key was set on the PTE that
> we faulted on.  There is no other easy way for
> userspace to figure this out.  They could parse smaps
> but that would be a bit cruel.

> diff -puN arch/x86/mm/fault.c~pkeys-09-siginfo arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~pkeys-09-siginfo	2015-09-16 10:48:15.580161678 -0700
> +++ b/arch/x86/mm/fault.c	2015-09-16 10:48:15.591162177 -0700
> @@ -15,12 +15,14 @@
>  #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
>  #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
>  
> +#include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
>  #include <asm/traps.h>			/* dotraplinkage, ...		*/
>  #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
>  #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
>  #include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
>  #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
>  #include <asm/vm86.h>			/* struct vm86			*/
> +#include <asm/mmu_context.h>		/* vma_pkey()			*/
>  
>  #define CREATE_TRACE_POINTS
>  #include <asm/trace/exceptions.h>
> @@ -169,6 +171,45 @@ is_prefetch(struct pt_regs *regs, unsign
>  	return prefetch;
>  }
>  
> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
> +{
> +	u16 ret;
> +	spinlock_t *ptl;
> +	pte_t *ptep;
> +	pte_t pte;
> +	int follow_ret;
> +
> +	if (!boot_cpu_has(X86_FEATURE_OSPKE))
> +		return 0;
> +
> +	follow_ret = follow_pte(tsk->mm, address, &ptep, &ptl);
> +	if (!follow_ret) {
> +		/*
> +		 * On a successful follow, make sure to
> +		 * drop the lock.
> +		 */
> +		pte = *ptep;
> +		pte_unmap_unlock(ptep, ptl);
> +		ret = pte_pkey(pte);
> +	} else {
> +		/*
> +		 * There is no PTE.  Go looking for the pkey in
> +		 * the VMA.  If we did not find a pkey violation
> +		 * from either the PTE or the VMA, then it must
> +		 * have been a fault from the hardware.  Perhaps
> +		 * the PTE got zapped before we got in here.
> +		 */
> +		struct vm_area_struct *vma = find_vma(tsk->mm, address);
> +		if (vma) {
> +			ret = vma_pkey(vma);
> +		} else {
> +			WARN_ONCE(1, "no PTE or VMA @ %lx\n", address);
> +			ret = 0;
> +		}
> +	}
> +	return ret;

Yeah, so I have three observations:

1)

I don't think this warning is entirely right, because this is a fundamentally racy 
op.

fetch_pkey(), called by force_sign_info_fault(), can be called while not holding 
the vma - and if we race with any other thread of the mm, the vma might be gone 
already.

So any threaded app using pkeys and vmas in parallel could trigger that WARN_ON().

2)

And note that this is a somewhat new scenario: in regular page faults, 
'error_code' always carries a then-valid cause of the page fault with itself. So 
we can put that into the siginfo and can be sure that it's the reason for the 
fault.

With the above pkey code, we fetch the pte separately from the fault, and without 
synchronizing with the fault - and we cannot do that, nor do we want to.

So I think this code should just accept the fact that races may happen. Perhaps 
warn if we get here with only a single mm user. (but even that would be a bit racy 
as we don't serialize against exit())

3)

For user-space that somehow wants to handle pkeys dynamically and drive them via 
faults, this seems somewhat inefficient: we already do a find_vma() in the primary 
fault lookup - and with the typical pkey usecase it will find a vma, just with the 
wrong access permissions. But when we generate the siginfo here, why do we do a 
find_vma() again? Why not pass the vma to the siginfo generating function?

> --- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-16 10:48:15.584161859 -0700
> +++ b/include/uapi/asm-generic/siginfo.h	2015-09-16 10:48:15.592162222 -0700
> @@ -95,6 +95,13 @@ typedef struct siginfo {
>  				void __user *_lower;
>  				void __user *_upper;
>  			} _addr_bnd;
> +			int _pkey; /* FIXME: protection key value??
> +				    * Do we really need this in here?
> +				    * userspace can get the PKRU value in
> +				    * the signal handler, but they do not
> +				    * easily have access to the PKEY value
> +				    * from the PTE.
> +				    */
>  		} _sigfault;

A couple of comments:

1)

Please use our ABI types - this one should be 'u32' I think.

We could use 'u8' as well here, and mark another 3 bytes next to it as reserved 
for future flags. Right now protection keys use 4 bits, but do you really think 
they'll ever grow beyond 8 bits? PTE bits are a scarce resource in general.

2)

To answer your question in the comment: it looks useful to have some sort of 
'extended page fault error code' information here, which shows why the page fault 
happened. With the regular error_code it's easy - with protection keys there's 16 
separate keys possible and user-space might not know the actual key value in the 
pte.

3)

Please add suitable self-tests to tools/tests/selftests/x86/ that both documents 
the preferred usage of pkeys, demonstrates all implemented aspects the new ABI and 
provokes a fault and prints the resulting siginfo, etc.

> @@ -206,7 +214,8 @@ typedef struct siginfo {
>  #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
>  #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
>  #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
> -#define NSIGSEGV	3
> +#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed address bound checks */
> +#define NSIGSEGV	4

You copy & pasted the MPX comment here, it should read something like:

   #define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection keys checks */

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-24  9:23     ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-24  9:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Dave Hansen <dave@sr71.net> wrote:

> A protection key fault is very similar to any other access
> error.  There must be a VMA, etc...  We even want to take
> the same action (SIGSEGV) that we do with a normal access
> fault.
> 
> However, we do need to let userspace know that something
> is different.  We do this the same way what we did with
> SEGV_BNDERR with Memory Protection eXtensions (MPX):
> define a new SEGV code: SEGV_PKUERR.
> 
> We also add a siginfo field: si_pkey that reveals to
> userspace which protection key was set on the PTE that
> we faulted on.  There is no other easy way for
> userspace to figure this out.  They could parse smaps
> but that would be a bit cruel.

> diff -puN arch/x86/mm/fault.c~pkeys-09-siginfo arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~pkeys-09-siginfo	2015-09-16 10:48:15.580161678 -0700
> +++ b/arch/x86/mm/fault.c	2015-09-16 10:48:15.591162177 -0700
> @@ -15,12 +15,14 @@
>  #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
>  #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
>  
> +#include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
>  #include <asm/traps.h>			/* dotraplinkage, ...		*/
>  #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
>  #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
>  #include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
>  #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
>  #include <asm/vm86.h>			/* struct vm86			*/
> +#include <asm/mmu_context.h>		/* vma_pkey()			*/
>  
>  #define CREATE_TRACE_POINTS
>  #include <asm/trace/exceptions.h>
> @@ -169,6 +171,45 @@ is_prefetch(struct pt_regs *regs, unsign
>  	return prefetch;
>  }
>  
> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
> +{
> +	u16 ret;
> +	spinlock_t *ptl;
> +	pte_t *ptep;
> +	pte_t pte;
> +	int follow_ret;
> +
> +	if (!boot_cpu_has(X86_FEATURE_OSPKE))
> +		return 0;
> +
> +	follow_ret = follow_pte(tsk->mm, address, &ptep, &ptl);
> +	if (!follow_ret) {
> +		/*
> +		 * On a successful follow, make sure to
> +		 * drop the lock.
> +		 */
> +		pte = *ptep;
> +		pte_unmap_unlock(ptep, ptl);
> +		ret = pte_pkey(pte);
> +	} else {
> +		/*
> +		 * There is no PTE.  Go looking for the pkey in
> +		 * the VMA.  If we did not find a pkey violation
> +		 * from either the PTE or the VMA, then it must
> +		 * have been a fault from the hardware.  Perhaps
> +		 * the PTE got zapped before we got in here.
> +		 */
> +		struct vm_area_struct *vma = find_vma(tsk->mm, address);
> +		if (vma) {
> +			ret = vma_pkey(vma);
> +		} else {
> +			WARN_ONCE(1, "no PTE or VMA @ %lx\n", address);
> +			ret = 0;
> +		}
> +	}
> +	return ret;

Yeah, so I have three observations:

1)

I don't think this warning is entirely right, because this is a fundamentally racy 
op.

fetch_pkey(), called by force_sign_info_fault(), can be called while not holding 
the vma - and if we race with any other thread of the mm, the vma might be gone 
already.

So any threaded app using pkeys and vmas in parallel could trigger that WARN_ON().

2)

And note that this is a somewhat new scenario: in regular page faults, 
'error_code' always carries a then-valid cause of the page fault with itself. So 
we can put that into the siginfo and can be sure that it's the reason for the 
fault.

With the above pkey code, we fetch the pte separately from the fault, and without 
synchronizing with the fault - and we cannot do that, nor do we want to.

So I think this code should just accept the fact that races may happen. Perhaps 
warn if we get here with only a single mm user. (but even that would be a bit racy 
as we don't serialize against exit())

3)

For user-space that somehow wants to handle pkeys dynamically and drive them via 
faults, this seems somewhat inefficient: we already do a find_vma() in the primary 
fault lookup - and with the typical pkey usecase it will find a vma, just with the 
wrong access permissions. But when we generate the siginfo here, why do we do a 
find_vma() again? Why not pass the vma to the siginfo generating function?

> --- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-16 10:48:15.584161859 -0700
> +++ b/include/uapi/asm-generic/siginfo.h	2015-09-16 10:48:15.592162222 -0700
> @@ -95,6 +95,13 @@ typedef struct siginfo {
>  				void __user *_lower;
>  				void __user *_upper;
>  			} _addr_bnd;
> +			int _pkey; /* FIXME: protection key value??
> +				    * Do we really need this in here?
> +				    * userspace can get the PKRU value in
> +				    * the signal handler, but they do not
> +				    * easily have access to the PKEY value
> +				    * from the PTE.
> +				    */
>  		} _sigfault;

A couple of comments:

1)

Please use our ABI types - this one should be 'u32' I think.

We could use 'u8' as well here, and mark another 3 bytes next to it as reserved 
for future flags. Right now protection keys use 4 bits, but do you really think 
they'll ever grow beyond 8 bits? PTE bits are a scarce resource in general.

2)

To answer your question in the comment: it looks useful to have some sort of 
'extended page fault error code' information here, which shows why the page fault 
happened. With the regular error_code it's easy - with protection keys there's 16 
separate keys possible and user-space might not know the actual key value in the 
pte.

3)

Please add suitable self-tests to tools/tests/selftests/x86/ that both documents 
the preferred usage of pkeys, demonstrates all implemented aspects the new ABI and 
provokes a fault and prints the resulting siginfo, etc.

> @@ -206,7 +214,8 @@ typedef struct siginfo {
>  #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
>  #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
>  #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
> -#define NSIGSEGV	3
> +#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed address bound checks */
> +#define NSIGSEGV	4

You copy & pasted the MPX comment here, it should read something like:

   #define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection keys checks */

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-24  9:23     ` Ingo Molnar
@ 2015-09-24  9:30       ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-24  9:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Ingo Molnar <mingo@kernel.org> wrote:

> > --- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-16 10:48:15.584161859 -0700
> > +++ b/include/uapi/asm-generic/siginfo.h	2015-09-16 10:48:15.592162222 -0700
> > @@ -95,6 +95,13 @@ typedef struct siginfo {
> >  				void __user *_lower;
> >  				void __user *_upper;
> >  			} _addr_bnd;
> > +			int _pkey; /* FIXME: protection key value??
> > +				    * Do we really need this in here?
> > +				    * userspace can get the PKRU value in
> > +				    * the signal handler, but they do not
> > +				    * easily have access to the PKEY value
> > +				    * from the PTE.
> > +				    */
> >  		} _sigfault;
> 
> A couple of comments:
> 
> 1)
> 
> Please use our ABI types - this one should be 'u32' I think.
> 
> We could use 'u8' as well here, and mark another 3 bytes next to it as reserved 
> for future flags. Right now protection keys use 4 bits, but do you really think 
> they'll ever grow beyond 8 bits? PTE bits are a scarce resource in general.
> 
> 2)
> 
> To answer your question in the comment: it looks useful to have some sort of 
> 'extended page fault error code' information here, which shows why the page fault 
> happened. With the regular error_code it's easy - with protection keys there's 16 
> separate keys possible and user-space might not know the actual key value in the 
> pte.

Btw., alternatively we could also say that user-space should know what protection 
key it used when it created the mapping - there's no need to recover it for every 
page fault.

OTOH, as long as we don't do a separate find_vma(), it looks cheap enough to look 
up the pkey value of that address and give it to user-space in the signal frame.

Btw., how does pkey support interact with hugepages?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-24  9:30       ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-24  9:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Ingo Molnar <mingo@kernel.org> wrote:

> > --- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-16 10:48:15.584161859 -0700
> > +++ b/include/uapi/asm-generic/siginfo.h	2015-09-16 10:48:15.592162222 -0700
> > @@ -95,6 +95,13 @@ typedef struct siginfo {
> >  				void __user *_lower;
> >  				void __user *_upper;
> >  			} _addr_bnd;
> > +			int _pkey; /* FIXME: protection key value??
> > +				    * Do we really need this in here?
> > +				    * userspace can get the PKRU value in
> > +				    * the signal handler, but they do not
> > +				    * easily have access to the PKEY value
> > +				    * from the PTE.
> > +				    */
> >  		} _sigfault;
> 
> A couple of comments:
> 
> 1)
> 
> Please use our ABI types - this one should be 'u32' I think.
> 
> We could use 'u8' as well here, and mark another 3 bytes next to it as reserved 
> for future flags. Right now protection keys use 4 bits, but do you really think 
> they'll ever grow beyond 8 bits? PTE bits are a scarce resource in general.
> 
> 2)
> 
> To answer your question in the comment: it looks useful to have some sort of 
> 'extended page fault error code' information here, which shows why the page fault 
> happened. With the regular error_code it's easy - with protection keys there's 16 
> separate keys possible and user-space might not know the actual key value in the 
> pte.

Btw., alternatively we could also say that user-space should know what protection 
key it used when it created the mapping - there's no need to recover it for every 
page fault.

OTOH, as long as we don't do a separate find_vma(), it looks cheap enough to look 
up the pkey value of that address and give it to user-space in the signal frame.

Btw., how does pkey support interact with hugepages?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-09-21  4:34       ` Dave Hansen
@ 2015-09-24  9:49         ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-24  9:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook


* Dave Hansen <dave@sr71.net> wrote:

> > Another question, related to enumeration as well: I'm wondering whether 
> > there's any way for the kernel to allocate a bit or two for its own purposes - 
> > such as protecting crypto keys? Or is the facility fundamentally intended for 
> > user-space use only?
> 
> No, that's not possible with the current setup.

Ok, then another question, have you considered the following usecase:

AFAICS pkeys only affect data loads and stores. Instruction fetches are notably 
absent from the documentation. Can you clarify that instructions can be fetched 
and executed from PTE_READ but pkeys-all-access-disabled pags?

If yes then this could be a significant security feature / usecase for pkeys: 
executable sections of shared libraries and binaries could be mapped with pkey 
access disabled. If I read the Intel documentation correctly then that should be 
possible.

The advantage of doing that is that an existing attack method to circumvent ASLR 
(or to scout out an unknown binary) is to use an existing (user-space) information 
leak to read the address space of a server process - and to use that to figure out 
the actual code present at that address.

The code signature can then be be used to identify the precise layout of the 
binary, and/or to create ROP gadgets - to escallate permissions using an otherwise 
not exploitable buffer overflow.

I.e. AFAICS pkeys could be used to create true '--x' permissions for executable 
(user-space) pages.

But I might be reading it wrong ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-09-24  9:49         ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-24  9:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook


* Dave Hansen <dave@sr71.net> wrote:

> > Another question, related to enumeration as well: I'm wondering whether 
> > there's any way for the kernel to allocate a bit or two for its own purposes - 
> > such as protecting crypto keys? Or is the facility fundamentally intended for 
> > user-space use only?
> 
> No, that's not possible with the current setup.

Ok, then another question, have you considered the following usecase:

AFAICS pkeys only affect data loads and stores. Instruction fetches are notably 
absent from the documentation. Can you clarify that instructions can be fetched 
and executed from PTE_READ but pkeys-all-access-disabled pags?

If yes then this could be a significant security feature / usecase for pkeys: 
executable sections of shared libraries and binaries could be mapped with pkey 
access disabled. If I read the Intel documentation correctly then that should be 
possible.

The advantage of doing that is that an existing attack method to circumvent ASLR 
(or to scout out an unknown binary) is to use an existing (user-space) information 
leak to read the address space of a server process - and to use that to figure out 
the actual code present at that address.

The code signature can then be be used to identify the precise layout of the 
binary, and/or to create ROP gadgets - to escallate permissions using an otherwise 
not exploitable buffer overflow.

I.e. AFAICS pkeys could be used to create true '--x' permissions for executable 
(user-space) pages.

But I might be reading it wrong ...

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-24  9:23     ` Ingo Molnar
@ 2015-09-24 17:15       ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-24 17:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner, borntraeger

Christian, can you tell us how big s390's storage protection keys are?
See the discussion below about siginfo...

On 09/24/2015 02:23 AM, Ingo Molnar wrote:
>> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
>> +{
...
>> +		struct vm_area_struct *vma = find_vma(tsk->mm, address);
>> +		if (vma) {
>> +			ret = vma_pkey(vma);
>> +		} else {
>> +			WARN_ONCE(1, "no PTE or VMA @ %lx\n", address);
>> +			ret = 0;
>> +		}
>> +	}
>> +	return ret;
> 
> Yeah, so I have three observations:
> 
> 1)
> 
> I don't think this warning is entirely right, because this is a fundamentally racy 
> op.
> 
> fetch_pkey(), called by force_sign_info_fault(), can be called while not holding 
> the vma - and if we race with any other thread of the mm, the vma might be gone 
> already.
> 
> So any threaded app using pkeys and vmas in parallel could trigger that WARN_ON().

Agreed.  I'll remove the warning.

> 2)
> 
> And note that this is a somewhat new scenario: in regular page faults, 
> 'error_code' always carries a then-valid cause of the page fault with itself. So 
> we can put that into the siginfo and can be sure that it's the reason for the 
> fault.
> 
> With the above pkey code, we fetch the pte separately from the fault, and without 
> synchronizing with the fault - and we cannot do that, nor do we want to.
> 
> So I think this code should just accept the fact that races may happen. Perhaps 
> warn if we get here with only a single mm user. (but even that would be a bit racy 
> as we don't serialize against exit())

Good point.

> 3)
> 
> For user-space that somehow wants to handle pkeys dynamically and drive them via 
> faults, this seems somewhat inefficient: we already do a find_vma() in the primary 
> fault lookup - and with the typical pkey usecase it will find a vma, just with the 
> wrong access permissions. But when we generate the siginfo here, why do we do a 
> find_vma() again? Why not pass the vma to the siginfo generating function?

My assumption was that the signal generation case was pretty slow.
find_vma() is almost guaranteed to hit the vmacache, and we already hold
mmap_sem, so the cost is pretty tiny.

I'm happy to change it if you're really concerned, but I didn't think it
would be worth the trouble of plumbing it down.

>> --- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-16 10:48:15.584161859 -0700
>> +++ b/include/uapi/asm-generic/siginfo.h	2015-09-16 10:48:15.592162222 -0700
>> @@ -95,6 +95,13 @@ typedef struct siginfo {
>>  				void __user *_lower;
>>  				void __user *_upper;
>>  			} _addr_bnd;
>> +			int _pkey; /* FIXME: protection key value??
>> +				    * Do we really need this in here?
>> +				    * userspace can get the PKRU value in
>> +				    * the signal handler, but they do not
>> +				    * easily have access to the PKEY value
>> +				    * from the PTE.
>> +				    */
>>  		} _sigfault;
> 
> A couple of comments:
> 
> 1)
> 
> Please use our ABI types - this one should be 'u32' I think.
> 
> We could use 'u8' as well here, and mark another 3 bytes next to it as reserved 
> for future flags. Right now protection keys use 4 bits, but do you really think 
> they'll ever grow beyond 8 bits? PTE bits are a scarce resource in general.

I don't expect them to get bigger, at least with anything resembling the
current architecture.  Agreed about the scarcity of PTE bits.

siginfo.h is shared everywhere, so I'd ideally like to put a type in
there that all the other architectures can use.

> 3)
> 
> Please add suitable self-tests to tools/tests/selftests/x86/ that both documents 
> the preferred usage of pkeys, demonstrates all implemented aspects the new ABI and 
> provokes a fault and prints the resulting siginfo, etc.
> 
>> @@ -206,7 +214,8 @@ typedef struct siginfo {
>>  #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
>>  #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
>>  #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
>> -#define NSIGSEGV	3
>> +#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed address bound checks */
>> +#define NSIGSEGV	4
> 
> You copy & pasted the MPX comment here, it should read something like:
> 
>    #define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection keys checks */

Whoops.  Will fix.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-24 17:15       ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-24 17:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner, borntraeger

Christian, can you tell us how big s390's storage protection keys are?
See the discussion below about siginfo...

On 09/24/2015 02:23 AM, Ingo Molnar wrote:
>> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
>> +{
...
>> +		struct vm_area_struct *vma = find_vma(tsk->mm, address);
>> +		if (vma) {
>> +			ret = vma_pkey(vma);
>> +		} else {
>> +			WARN_ONCE(1, "no PTE or VMA @ %lx\n", address);
>> +			ret = 0;
>> +		}
>> +	}
>> +	return ret;
> 
> Yeah, so I have three observations:
> 
> 1)
> 
> I don't think this warning is entirely right, because this is a fundamentally racy 
> op.
> 
> fetch_pkey(), called by force_sign_info_fault(), can be called while not holding 
> the vma - and if we race with any other thread of the mm, the vma might be gone 
> already.
> 
> So any threaded app using pkeys and vmas in parallel could trigger that WARN_ON().

Agreed.  I'll remove the warning.

> 2)
> 
> And note that this is a somewhat new scenario: in regular page faults, 
> 'error_code' always carries a then-valid cause of the page fault with itself. So 
> we can put that into the siginfo and can be sure that it's the reason for the 
> fault.
> 
> With the above pkey code, we fetch the pte separately from the fault, and without 
> synchronizing with the fault - and we cannot do that, nor do we want to.
> 
> So I think this code should just accept the fact that races may happen. Perhaps 
> warn if we get here with only a single mm user. (but even that would be a bit racy 
> as we don't serialize against exit())

Good point.

> 3)
> 
> For user-space that somehow wants to handle pkeys dynamically and drive them via 
> faults, this seems somewhat inefficient: we already do a find_vma() in the primary 
> fault lookup - and with the typical pkey usecase it will find a vma, just with the 
> wrong access permissions. But when we generate the siginfo here, why do we do a 
> find_vma() again? Why not pass the vma to the siginfo generating function?

My assumption was that the signal generation case was pretty slow.
find_vma() is almost guaranteed to hit the vmacache, and we already hold
mmap_sem, so the cost is pretty tiny.

I'm happy to change it if you're really concerned, but I didn't think it
would be worth the trouble of plumbing it down.

>> --- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-16 10:48:15.584161859 -0700
>> +++ b/include/uapi/asm-generic/siginfo.h	2015-09-16 10:48:15.592162222 -0700
>> @@ -95,6 +95,13 @@ typedef struct siginfo {
>>  				void __user *_lower;
>>  				void __user *_upper;
>>  			} _addr_bnd;
>> +			int _pkey; /* FIXME: protection key value??
>> +				    * Do we really need this in here?
>> +				    * userspace can get the PKRU value in
>> +				    * the signal handler, but they do not
>> +				    * easily have access to the PKEY value
>> +				    * from the PTE.
>> +				    */
>>  		} _sigfault;
> 
> A couple of comments:
> 
> 1)
> 
> Please use our ABI types - this one should be 'u32' I think.
> 
> We could use 'u8' as well here, and mark another 3 bytes next to it as reserved 
> for future flags. Right now protection keys use 4 bits, but do you really think 
> they'll ever grow beyond 8 bits? PTE bits are a scarce resource in general.

I don't expect them to get bigger, at least with anything resembling the
current architecture.  Agreed about the scarcity of PTE bits.

siginfo.h is shared everywhere, so I'd ideally like to put a type in
there that all the other architectures can use.

> 3)
> 
> Please add suitable self-tests to tools/tests/selftests/x86/ that both documents 
> the preferred usage of pkeys, demonstrates all implemented aspects the new ABI and 
> provokes a fault and prints the resulting siginfo, etc.
> 
>> @@ -206,7 +214,8 @@ typedef struct siginfo {
>>  #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
>>  #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
>>  #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
>> -#define NSIGSEGV	3
>> +#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed address bound checks */
>> +#define NSIGSEGV	4
> 
> You copy & pasted the MPX comment here, it should read something like:
> 
>    #define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection keys checks */

Whoops.  Will fix.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-24  9:30       ` Ingo Molnar
@ 2015-09-24 17:41         ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-24 17:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 09/24/2015 02:30 AM, Ingo Molnar wrote:
>> To answer your question in the comment: it looks useful to have some sort of 
>> 'extended page fault error code' information here, which shows why the page fault 
>> happened. With the regular error_code it's easy - with protection keys there's 16 
>> separate keys possible and user-space might not know the actual key value in the 
>> pte.
> 
> Btw., alternatively we could also say that user-space should know what protection 
> key it used when it created the mapping - there's no need to recover it for every 
> page fault.

That's true.  We don't, for instance, tell userspace whether it was a
write that caused a fault.

But, other than smaps we don't have *any* way to tell userspace what
protection key a page has.  I think some mechanism is going to be
required for this to be reasonably debuggable.

> OTOH, as long as we don't do a separate find_vma(), it looks cheap enough to look 
> up the pkey value of that address and give it to user-space in the signal frame.

I still think that find_vma() in this case is pretty darn cheap,
definitely if you compare it to the cost of the entire fault path.

> Btw., how does pkey support interact with hugepages?

Surprisingly little.  I've made sure that everything works with huge
pages and that the (huge) PTEs and VMAs get set up correctly, but I'm
not sure I had to touch the huge page code at all.  I have test code to
ensure that it works the same as with small pages, but everything worked
pretty naturally.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-24 17:41         ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-24 17:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 09/24/2015 02:30 AM, Ingo Molnar wrote:
>> To answer your question in the comment: it looks useful to have some sort of 
>> 'extended page fault error code' information here, which shows why the page fault 
>> happened. With the regular error_code it's easy - with protection keys there's 16 
>> separate keys possible and user-space might not know the actual key value in the 
>> pte.
> 
> Btw., alternatively we could also say that user-space should know what protection 
> key it used when it created the mapping - there's no need to recover it for every 
> page fault.

That's true.  We don't, for instance, tell userspace whether it was a
write that caused a fault.

But, other than smaps we don't have *any* way to tell userspace what
protection key a page has.  I think some mechanism is going to be
required for this to be reasonably debuggable.

> OTOH, as long as we don't do a separate find_vma(), it looks cheap enough to look 
> up the pkey value of that address and give it to user-space in the signal frame.

I still think that find_vma() in this case is pretty darn cheap,
definitely if you compare it to the cost of the entire fault path.

> Btw., how does pkey support interact with hugepages?

Surprisingly little.  I've made sure that everything works with huge
pages and that the (huge) PTEs and VMAs get set up correctly, but I'm
not sure I had to touch the huge page code at all.  I have test code to
ensure that it works the same as with small pages, but everything worked
pretty naturally.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-09-24  9:49         ` Ingo Molnar
@ 2015-09-24 19:10           ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-24 19:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook

On 09/24/2015 02:49 AM, Ingo Molnar wrote:
> * Dave Hansen <dave@sr71.net> wrote:
>>> Another question, related to enumeration as well: I'm wondering whether 
>>> there's any way for the kernel to allocate a bit or two for its own purposes - 
>>> such as protecting crypto keys? Or is the facility fundamentally intended for 
>>> user-space use only?
>>
>> No, that's not possible with the current setup.
> 
> Ok, then another question, have you considered the following usecase:
> 
> AFAICS pkeys only affect data loads and stores. Instruction fetches are notably 
> absent from the documentation. Can you clarify that instructions can be fetched 
> and executed from PTE_READ but pkeys-all-access-disabled pags?

That is my understanding.  I don't have a test for it, but I'll go make one.

> If yes then this could be a significant security feature / usecase for pkeys: 
> executable sections of shared libraries and binaries could be mapped with pkey 
> access disabled. If I read the Intel documentation correctly then that should be 
> possible.

Agreed.  I've even heard from some researchers who are interested in this:

https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf

> I.e. AFAICS pkeys could be used to create true '--x' permissions for executable 
> (user-space) pages.

Just remember that all of the protections are dependent on the contents
of PKRU.  If an attacker controls the Access-Disable bit in PKRU for the
executable-only region, you're sunk.

But, that either requires being able to construct and execute arbitrary
code *or* call existing code that sets PKRU to the desired values.
Which, I guess, gets harder to do if all of the the wrpkru's are *in*
the execute-only area.


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-09-24 19:10           ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-24 19:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook

On 09/24/2015 02:49 AM, Ingo Molnar wrote:
> * Dave Hansen <dave@sr71.net> wrote:
>>> Another question, related to enumeration as well: I'm wondering whether 
>>> there's any way for the kernel to allocate a bit or two for its own purposes - 
>>> such as protecting crypto keys? Or is the facility fundamentally intended for 
>>> user-space use only?
>>
>> No, that's not possible with the current setup.
> 
> Ok, then another question, have you considered the following usecase:
> 
> AFAICS pkeys only affect data loads and stores. Instruction fetches are notably 
> absent from the documentation. Can you clarify that instructions can be fetched 
> and executed from PTE_READ but pkeys-all-access-disabled pags?

That is my understanding.  I don't have a test for it, but I'll go make one.

> If yes then this could be a significant security feature / usecase for pkeys: 
> executable sections of shared libraries and binaries could be mapped with pkey 
> access disabled. If I read the Intel documentation correctly then that should be 
> possible.

Agreed.  I've even heard from some researchers who are interested in this:

https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf

> I.e. AFAICS pkeys could be used to create true '--x' permissions for executable 
> (user-space) pages.

Just remember that all of the protections are dependent on the contents
of PKRU.  If an attacker controls the Access-Disable bit in PKRU for the
executable-only region, you're sunk.

But, that either requires being able to construct and execute arbitrary
code *or* call existing code that sets PKRU to the desired values.
Which, I guess, gets harder to do if all of the the wrpkru's are *in*
the execute-only area.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-09-24 19:10           ` Dave Hansen
@ 2015-09-24 19:17             ` Andy Lutomirski
  -1 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-09-24 19:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, X86 ML, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Kees Cook

On Thu, Sep 24, 2015 at 12:10 PM, Dave Hansen <dave@sr71.net> wrote:
> On 09/24/2015 02:49 AM, Ingo Molnar wrote:
>> * Dave Hansen <dave@sr71.net> wrote:
>>>> Another question, related to enumeration as well: I'm wondering whether
>>>> there's any way for the kernel to allocate a bit or two for its own purposes -
>>>> such as protecting crypto keys? Or is the facility fundamentally intended for
>>>> user-space use only?
>>>
>>> No, that's not possible with the current setup.
>>
>> Ok, then another question, have you considered the following usecase:
>>
>> AFAICS pkeys only affect data loads and stores. Instruction fetches are notably
>> absent from the documentation. Can you clarify that instructions can be fetched
>> and executed from PTE_READ but pkeys-all-access-disabled pags?
>
> That is my understanding.  I don't have a test for it, but I'll go make one.
>
>> If yes then this could be a significant security feature / usecase for pkeys:
>> executable sections of shared libraries and binaries could be mapped with pkey
>> access disabled. If I read the Intel documentation correctly then that should be
>> possible.
>
> Agreed.  I've even heard from some researchers who are interested in this:
>
> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>
>> I.e. AFAICS pkeys could be used to create true '--x' permissions for executable
>> (user-space) pages.
>
> Just remember that all of the protections are dependent on the contents
> of PKRU.  If an attacker controls the Access-Disable bit in PKRU for the
> executable-only region, you're sunk.
>
> But, that either requires being able to construct and execute arbitrary
> code *or* call existing code that sets PKRU to the desired values.
> Which, I guess, gets harder to do if all of the the wrpkru's are *in*
> the execute-only area.
>

This may mean that we want to have a way for binaries to indicate that
they want their --x segments to be loaded with a particular protection
key.  The right way to do that might be using an ELF note, and I also
want to use ELF notes to allow turning off vsyscalls, so maybe it's
time to write an ELF note parser in the kernel.

--Andy

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-09-24 19:17             ` Andy Lutomirski
  0 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-09-24 19:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, X86 ML, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Kees Cook

On Thu, Sep 24, 2015 at 12:10 PM, Dave Hansen <dave@sr71.net> wrote:
> On 09/24/2015 02:49 AM, Ingo Molnar wrote:
>> * Dave Hansen <dave@sr71.net> wrote:
>>>> Another question, related to enumeration as well: I'm wondering whether
>>>> there's any way for the kernel to allocate a bit or two for its own purposes -
>>>> such as protecting crypto keys? Or is the facility fundamentally intended for
>>>> user-space use only?
>>>
>>> No, that's not possible with the current setup.
>>
>> Ok, then another question, have you considered the following usecase:
>>
>> AFAICS pkeys only affect data loads and stores. Instruction fetches are notably
>> absent from the documentation. Can you clarify that instructions can be fetched
>> and executed from PTE_READ but pkeys-all-access-disabled pags?
>
> That is my understanding.  I don't have a test for it, but I'll go make one.
>
>> If yes then this could be a significant security feature / usecase for pkeys:
>> executable sections of shared libraries and binaries could be mapped with pkey
>> access disabled. If I read the Intel documentation correctly then that should be
>> possible.
>
> Agreed.  I've even heard from some researchers who are interested in this:
>
> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>
>> I.e. AFAICS pkeys could be used to create true '--x' permissions for executable
>> (user-space) pages.
>
> Just remember that all of the protections are dependent on the contents
> of PKRU.  If an attacker controls the Access-Disable bit in PKRU for the
> executable-only region, you're sunk.
>
> But, that either requires being able to construct and execute arbitrary
> code *or* call existing code that sets PKRU to the desired values.
> Which, I guess, gets harder to do if all of the the wrpkru's are *in*
> the execute-only area.
>

This may mean that we want to have a way for binaries to indicate that
they want their --x segments to be loaded with a particular protection
key.  The right way to do that might be using an ELF note, and I also
want to use ELF notes to allow turning off vsyscalls, so maybe it's
time to write an ELF note parser in the kernel.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-09-24 19:10           ` Dave Hansen
@ 2015-09-25  6:15             ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-25  6:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook


* Dave Hansen <dave@sr71.net> wrote:

> > I.e. AFAICS pkeys could be used to create true '--x' permissions for executable 
> > (user-space) pages.
> 
> Just remember that all of the protections are dependent on the contents of PKRU.  
> If an attacker controls the Access-Disable bit in PKRU for the executable-only 
> region, you're sunk.

The same is true if the attacker can execute mprotect() calls.

> But, that either requires being able to construct and execute arbitrary code 
> *or* call existing code that sets PKRU to the desired values. Which, I guess, 
> gets harder to do if all of the the wrpkru's are *in* the execute-only area.

Exactly. True --x executable regions makes it harder to 'upgrade' limited attacks.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-09-25  6:15             ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-25  6:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook


* Dave Hansen <dave@sr71.net> wrote:

> > I.e. AFAICS pkeys could be used to create true '--x' permissions for executable 
> > (user-space) pages.
> 
> Just remember that all of the protections are dependent on the contents of PKRU.  
> If an attacker controls the Access-Disable bit in PKRU for the executable-only 
> region, you're sunk.

The same is true if the attacker can execute mprotect() calls.

> But, that either requires being able to construct and execute arbitrary code 
> *or* call existing code that sets PKRU to the desired values. Which, I guess, 
> gets harder to do if all of the the wrpkru's are *in* the execute-only area.

Exactly. True --x executable regions makes it harder to 'upgrade' limited attacks.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-24 17:41         ` Dave Hansen
@ 2015-09-25  7:11           ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-25  7:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Dave Hansen <dave@sr71.net> wrote:

> On 09/24/2015 02:30 AM, Ingo Molnar wrote:
> >> To answer your question in the comment: it looks useful to have some sort of 
> >> 'extended page fault error code' information here, which shows why the page fault 
> >> happened. With the regular error_code it's easy - with protection keys there's 16 
> >> separate keys possible and user-space might not know the actual key value in the 
> >> pte.
> > 
> > Btw., alternatively we could also say that user-space should know what protection 
> > key it used when it created the mapping - there's no need to recover it for every 
> > page fault.
> 
> That's true.  We don't, for instance, tell userspace whether it was a
> write that caused a fault.

I think we do put it into the signal frame, see setup_sigcontext():

                put_user_ex(current->thread.error_code, &sc->err);

and 'error_code & PF_WRITE' tells us whether it's a write fault.

And I'm pretty sure applications like Valgrind rely on this.

> But, other than smaps we don't have *any* way to tell userspace what protection 
> key a page has.  I think some mechanism is going to be required for this to be 
> reasonably debuggable.

I think it's a conceptual extension of sigcontext::err and we need it for similar 
reasons.

> > OTOH, as long as we don't do a separate find_vma(), it looks cheap enough to 
> > look up the pkey value of that address and give it to user-space in the signal 
> > frame.
> 
> I still think that find_vma() in this case is pretty darn cheap, definitely if 
> you compare it to the cost of the entire fault path.

So where's the problem? We have already looked up the vma and know whether there's 
any vma there or not. Why not pass in that pointer and be done with it? Why 
complicate the code by looking up a second time (and exposing us to various 
races)?

> > Btw., how does pkey support interact with hugepages?
> 
> Surprisingly little.  I've made sure that everything works with huge pages and 
> that the (huge) PTEs and VMAs get set up correctly, but I'm not sure I had to 
> touch the huge page code at all.  I have test code to ensure that it works the 
> same as with small pages, but everything worked pretty naturally.

Yeah, so the reason I'm asking about expectations is that this code:

+       follow_ret = follow_pte(tsk->mm, address, &ptep, &ptl);
+       if (!follow_ret) {
+               /*
+                * On a successful follow, make sure to
+                * drop the lock.
+                */
+               pte = *ptep;
+               pte_unmap_unlock(ptep, ptl);
+               ret = pte_pkey(pte);

is visibly hugepage-unsafe: if a vma is hugepage mapped, there are no ptes, only 
pmds - and the protection key index lives in the pmd. We don't seem to recover 
that information properly.

In any case, please put those hugepage tests into tools/tests/selftests/x86/ as 
well, as part of the pkey series.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-25  7:11           ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-25  7:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Dave Hansen <dave@sr71.net> wrote:

> On 09/24/2015 02:30 AM, Ingo Molnar wrote:
> >> To answer your question in the comment: it looks useful to have some sort of 
> >> 'extended page fault error code' information here, which shows why the page fault 
> >> happened. With the regular error_code it's easy - with protection keys there's 16 
> >> separate keys possible and user-space might not know the actual key value in the 
> >> pte.
> > 
> > Btw., alternatively we could also say that user-space should know what protection 
> > key it used when it created the mapping - there's no need to recover it for every 
> > page fault.
> 
> That's true.  We don't, for instance, tell userspace whether it was a
> write that caused a fault.

I think we do put it into the signal frame, see setup_sigcontext():

                put_user_ex(current->thread.error_code, &sc->err);

and 'error_code & PF_WRITE' tells us whether it's a write fault.

And I'm pretty sure applications like Valgrind rely on this.

> But, other than smaps we don't have *any* way to tell userspace what protection 
> key a page has.  I think some mechanism is going to be required for this to be 
> reasonably debuggable.

I think it's a conceptual extension of sigcontext::err and we need it for similar 
reasons.

> > OTOH, as long as we don't do a separate find_vma(), it looks cheap enough to 
> > look up the pkey value of that address and give it to user-space in the signal 
> > frame.
> 
> I still think that find_vma() in this case is pretty darn cheap, definitely if 
> you compare it to the cost of the entire fault path.

So where's the problem? We have already looked up the vma and know whether there's 
any vma there or not. Why not pass in that pointer and be done with it? Why 
complicate the code by looking up a second time (and exposing us to various 
races)?

> > Btw., how does pkey support interact with hugepages?
> 
> Surprisingly little.  I've made sure that everything works with huge pages and 
> that the (huge) PTEs and VMAs get set up correctly, but I'm not sure I had to 
> touch the huge page code at all.  I have test code to ensure that it works the 
> same as with small pages, but everything worked pretty naturally.

Yeah, so the reason I'm asking about expectations is that this code:

+       follow_ret = follow_pte(tsk->mm, address, &ptep, &ptl);
+       if (!follow_ret) {
+               /*
+                * On a successful follow, make sure to
+                * drop the lock.
+                */
+               pte = *ptep;
+               pte_unmap_unlock(ptep, ptl);
+               ret = pte_pkey(pte);

is visibly hugepage-unsafe: if a vma is hugepage mapped, there are no ptes, only 
pmds - and the protection key index lives in the pmd. We don't seem to recover 
that information properly.

In any case, please put those hugepage tests into tools/tests/selftests/x86/ as 
well, as part of the pkey series.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-09-24 19:17             ` Andy Lutomirski
@ 2015-09-25  7:16               ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-25  7:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, X86 ML, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Kees Cook


* Andy Lutomirski <luto@amacapital.net> wrote:

> This may mean that we want to have a way for binaries to indicate that they want 
> their --x segments to be loaded with a particular protection key.  The right way 
> to do that might be using an ELF note, and I also want to use ELF notes to allow 
> turning off vsyscalls, so maybe it's time to write an ELF note parser in the 
> kernel.

That would be absolutely lovely for many other reasons as well, and we should also 
add a tool to tools/ to edit/expand/shrink those ELF notes on existing systems.

I.e. make it really easy to augment security policies on an existing distro, using 
any filesystem (not just ACL capable ones) and using the binary only. Linux 
binaries could carry capabilities information, etc. etc.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-09-25  7:16               ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-25  7:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, X86 ML, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Kees Cook


* Andy Lutomirski <luto@amacapital.net> wrote:

> This may mean that we want to have a way for binaries to indicate that they want 
> their --x segments to be loaded with a particular protection key.  The right way 
> to do that might be using an ELF note, and I also want to use ELF notes to allow 
> turning off vsyscalls, so maybe it's time to write an ELF note parser in the 
> kernel.

That would be absolutely lovely for many other reasons as well, and we should also 
add a tool to tools/ to edit/expand/shrink those ELF notes on existing systems.

I.e. make it really easy to augment security policies on an existing distro, using 
any filesystem (not just ACL capable ones) and using the binary only. Linux 
binaries could carry capabilities information, etc. etc.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-25  7:11           ` Ingo Molnar
@ 2015-09-25 23:18             ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-25 23:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 09/25/2015 12:11 AM, Ingo Molnar wrote:
>>> > > Btw., how does pkey support interact with hugepages?
>> > 
>> > Surprisingly little.  I've made sure that everything works with huge pages and 
>> > that the (huge) PTEs and VMAs get set up correctly, but I'm not sure I had to 
>> > touch the huge page code at all.  I have test code to ensure that it works the 
>> > same as with small pages, but everything worked pretty naturally.
> Yeah, so the reason I'm asking about expectations is that this code:
> 
> +       follow_ret = follow_pte(tsk->mm, address, &ptep, &ptl);
> +       if (!follow_ret) {
> +               /*
> +                * On a successful follow, make sure to
> +                * drop the lock.
> +                */
> +               pte = *ptep;
> +               pte_unmap_unlock(ptep, ptl);
> +               ret = pte_pkey(pte);
> 
> is visibly hugepage-unsafe: if a vma is hugepage mapped, there are no ptes, only 
> pmds - and the protection key index lives in the pmd. We don't seem to recover 
> that information properly.

You got me on this one.  I assumed that follow_pte() handled huge pages.
 It does not.

But, the code still worked.  Since follow_pte() fails for all huge
pages, it just falls back to pulling the protection key out of the VMA,
which _does_ work for huge pages.

I've actually removed the PTE walking and I just now use the VMA
directly.  I don't see a ton of additional value from walking the page
tables when we can get what we need from the VMA.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-25 23:18             ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-25 23:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 09/25/2015 12:11 AM, Ingo Molnar wrote:
>>> > > Btw., how does pkey support interact with hugepages?
>> > 
>> > Surprisingly little.  I've made sure that everything works with huge pages and 
>> > that the (huge) PTEs and VMAs get set up correctly, but I'm not sure I had to 
>> > touch the huge page code at all.  I have test code to ensure that it works the 
>> > same as with small pages, but everything worked pretty naturally.
> Yeah, so the reason I'm asking about expectations is that this code:
> 
> +       follow_ret = follow_pte(tsk->mm, address, &ptep, &ptl);
> +       if (!follow_ret) {
> +               /*
> +                * On a successful follow, make sure to
> +                * drop the lock.
> +                */
> +               pte = *ptep;
> +               pte_unmap_unlock(ptep, ptl);
> +               ret = pte_pkey(pte);
> 
> is visibly hugepage-unsafe: if a vma is hugepage mapped, there are no ptes, only 
> pmds - and the protection key index lives in the pmd. We don't seem to recover 
> that information properly.

You got me on this one.  I assumed that follow_pte() handled huge pages.
 It does not.

But, the code still worked.  Since follow_pte() fails for all huge
pages, it just falls back to pulling the protection key out of the VMA,
which _does_ work for huge pages.

I've actually removed the PTE walking and I just now use the VMA
directly.  I don't see a ton of additional value from walking the page
tables when we can get what we need from the VMA.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-25 23:18             ` Dave Hansen
@ 2015-09-26  6:20               ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-26  6:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Dave Hansen <dave@sr71.net> wrote:

> On 09/25/2015 12:11 AM, Ingo Molnar wrote:
> >>> > > Btw., how does pkey support interact with hugepages?
> >> > 
> >> > Surprisingly little.  I've made sure that everything works with huge pages and 
> >> > that the (huge) PTEs and VMAs get set up correctly, but I'm not sure I had to 
> >> > touch the huge page code at all.  I have test code to ensure that it works the 
> >> > same as with small pages, but everything worked pretty naturally.
> > Yeah, so the reason I'm asking about expectations is that this code:
> > 
> > +       follow_ret = follow_pte(tsk->mm, address, &ptep, &ptl);
> > +       if (!follow_ret) {
> > +               /*
> > +                * On a successful follow, make sure to
> > +                * drop the lock.
> > +                */
> > +               pte = *ptep;
> > +               pte_unmap_unlock(ptep, ptl);
> > +               ret = pte_pkey(pte);
> > 
> > is visibly hugepage-unsafe: if a vma is hugepage mapped, there are no ptes, only 
> > pmds - and the protection key index lives in the pmd. We don't seem to recover 
> > that information properly.
> 
> You got me on this one.  I assumed that follow_pte() handled huge pages.
>  It does not.
> 
> But, the code still worked.  Since follow_pte() fails for all huge
> pages, it just falls back to pulling the protection key out of the VMA,
> which _does_ work for huge pages.

That might be true for explicit hugetlb vmas, but what about transparent hugepages 
that can show up in regular vmas?

> I've actually removed the PTE walking and I just now use the VMA directly.  I 
> don't see a ton of additional value from walking the page tables when we can get 
> what we need from the VMA.

That's actually good, because it's also cheap, especially if we can get rid of the 
extra find_vma().

and we (thankfully) have no non-linear vmas to worry about anymore.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-26  6:20               ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-26  6:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Dave Hansen <dave@sr71.net> wrote:

> On 09/25/2015 12:11 AM, Ingo Molnar wrote:
> >>> > > Btw., how does pkey support interact with hugepages?
> >> > 
> >> > Surprisingly little.  I've made sure that everything works with huge pages and 
> >> > that the (huge) PTEs and VMAs get set up correctly, but I'm not sure I had to 
> >> > touch the huge page code at all.  I have test code to ensure that it works the 
> >> > same as with small pages, but everything worked pretty naturally.
> > Yeah, so the reason I'm asking about expectations is that this code:
> > 
> > +       follow_ret = follow_pte(tsk->mm, address, &ptep, &ptl);
> > +       if (!follow_ret) {
> > +               /*
> > +                * On a successful follow, make sure to
> > +                * drop the lock.
> > +                */
> > +               pte = *ptep;
> > +               pte_unmap_unlock(ptep, ptl);
> > +               ret = pte_pkey(pte);
> > 
> > is visibly hugepage-unsafe: if a vma is hugepage mapped, there are no ptes, only 
> > pmds - and the protection key index lives in the pmd. We don't seem to recover 
> > that information properly.
> 
> You got me on this one.  I assumed that follow_pte() handled huge pages.
>  It does not.
> 
> But, the code still worked.  Since follow_pte() fails for all huge
> pages, it just falls back to pulling the protection key out of the VMA,
> which _does_ work for huge pages.

That might be true for explicit hugetlb vmas, but what about transparent hugepages 
that can show up in regular vmas?

> I've actually removed the PTE walking and I just now use the VMA directly.  I 
> don't see a ton of additional value from walking the page tables when we can get 
> what we need from the VMA.

That's actually good, because it's also cheap, especially if we can get rid of the 
extra find_vma().

and we (thankfully) have no non-linear vmas to worry about anymore.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-26  6:20               ` Ingo Molnar
@ 2015-09-27 22:39                 ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-27 22:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 09/25/2015 11:20 PM, Ingo Molnar wrote:
> * Dave Hansen <dave@sr71.net> wrote:
...
>> Since follow_pte() fails for all huge
>> pages, it just falls back to pulling the protection key out of the VMA,
>> which _does_ work for huge pages.
> 
> That might be true for explicit hugetlb vmas, but what about transparent hugepages 
> that can show up in regular vmas?

All PTEs (large or small) established under a given VMA have the same
protection key.  Any change in protection key for a range will either
change or split the VMA.

So I think it's safe to rely on the VMA entirely.  Well, as least as
safe as the PTE.  It's definitely a wee bit racy, which I'll elaborate
on when I repost the patches.

>> I've actually removed the PTE walking and I just now use the VMA directly.  I 
>> don't see a ton of additional value from walking the page tables when we can get 
>> what we need from the VMA.
> 
> That's actually good, because it's also cheap, especially if we can get rid of the 
> extra find_vma().
> 
> and we (thankfully) have no non-linear vmas to worry about anymore.

Yep.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-27 22:39                 ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-27 22:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 09/25/2015 11:20 PM, Ingo Molnar wrote:
> * Dave Hansen <dave@sr71.net> wrote:
...
>> Since follow_pte() fails for all huge
>> pages, it just falls back to pulling the protection key out of the VMA,
>> which _does_ work for huge pages.
> 
> That might be true for explicit hugetlb vmas, but what about transparent hugepages 
> that can show up in regular vmas?

All PTEs (large or small) established under a given VMA have the same
protection key.  Any change in protection key for a range will either
change or split the VMA.

So I think it's safe to rely on the VMA entirely.  Well, as least as
safe as the PTE.  It's definitely a wee bit racy, which I'll elaborate
on when I repost the patches.

>> I've actually removed the PTE walking and I just now use the VMA directly.  I 
>> don't see a ton of additional value from walking the page tables when we can get 
>> what we need from the VMA.
> 
> That's actually good, because it's also cheap, especially if we can get rid of the 
> extra find_vma().
> 
> and we (thankfully) have no non-linear vmas to worry about anymore.

Yep.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-27 22:39                 ` Dave Hansen
@ 2015-09-28  5:59                   ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-28  5:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Dave Hansen <dave@sr71.net> wrote:

> On 09/25/2015 11:20 PM, Ingo Molnar wrote:
> > * Dave Hansen <dave@sr71.net> wrote:
> ...
> >> Since follow_pte() fails for all huge
> >> pages, it just falls back to pulling the protection key out of the VMA,
> >> which _does_ work for huge pages.
> > 
> > That might be true for explicit hugetlb vmas, but what about transparent hugepages 
> > that can show up in regular vmas?
> 
> All PTEs (large or small) established under a given VMA have the same
> protection key. [...]

So a 'pte' is only small. The 'large' thing is called a pmd. So follow_pte() is 
not adequate. But with that removed everything should be fine as the vma 
(protection) flags are size independent.

> So I think it's safe to rely on the VMA entirely.  Well, as least as safe as the 
> PTE.  It's definitely a wee bit racy, which I'll elaborate on when I repost the 
> patches.

So the race I can see is wrt. mprotect(), and we should fix that, because the 
existing method of recovering the 'page fault reason', error_code, is not racy - 
so the extension of it (the protection key) should not be racy either.

By the time user-space processes the signal we might race with other threads, but 
at least the fault-address/error-reason information itself should be coherent.

This can be solved by getting the protection key while still under the down_read() 
of the vma - instead of your current solution of a second find_vma().

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-28  5:59                   ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-09-28  5:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner


* Dave Hansen <dave@sr71.net> wrote:

> On 09/25/2015 11:20 PM, Ingo Molnar wrote:
> > * Dave Hansen <dave@sr71.net> wrote:
> ...
> >> Since follow_pte() fails for all huge
> >> pages, it just falls back to pulling the protection key out of the VMA,
> >> which _does_ work for huge pages.
> > 
> > That might be true for explicit hugetlb vmas, but what about transparent hugepages 
> > that can show up in regular vmas?
> 
> All PTEs (large or small) established under a given VMA have the same
> protection key. [...]

So a 'pte' is only small. The 'large' thing is called a pmd. So follow_pte() is 
not adequate. But with that removed everything should be fine as the vma 
(protection) flags are size independent.

> So I think it's safe to rely on the VMA entirely.  Well, as least as safe as the 
> PTE.  It's definitely a wee bit racy, which I'll elaborate on when I repost the 
> patches.

So the race I can see is wrt. mprotect(), and we should fix that, because the 
existing method of recovering the 'page fault reason', error_code, is not racy - 
so the extension of it (the protection key) should not be racy either.

By the time user-space processes the signal we might race with other threads, but 
at least the fault-address/error-reason information itself should be coherent.

This can be solved by getting the protection key while still under the down_read() 
of the vma - instead of your current solution of a second find_vma().

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-24 17:15       ` Dave Hansen
@ 2015-09-28 19:25         ` Christian Borntraeger
  -1 siblings, 0 replies; 172+ messages in thread
From: Christian Borntraeger @ 2015-09-28 19:25 UTC (permalink / raw)
  To: Dave Hansen, Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

Am 24.09.2015 um 19:15 schrieb Dave Hansen:
> Christian, can you tell us how big s390's storage protection keys are?
> See the discussion below about siginfo...

Dave, sorry for the late answer.
s390 storage keys are 4bit for the protection key (and 1 bit for fetch protection, 
change and reference bit) per physical page, so 1 byte is enough for us.

We do not have the storage keys per page table, but for the page frame instead 
(shared among all mappers) so I am not sure if the whole thing will fit for s390.
Having a signal for page protection errors might be useful for us - not sure yet.

Christian

PS: In the past we worked hard to get rid of storage key usage in Linux and are now using
software reference and change tracking to be closer what others do, so its a bit odd to
see other coming with the same idea ;-)

> 
> On 09/24/2015 02:23 AM, Ingo Molnar wrote:
>>> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
>>> +{
> ...
>>> +		struct vm_area_struct *vma = find_vma(tsk->mm, address);
>>> +		if (vma) {
>>> +			ret = vma_pkey(vma);
>>> +		} else {
>>> +			WARN_ONCE(1, "no PTE or VMA @ %lx\n", address);
>>> +			ret = 0;
>>> +		}
>>> +	}
>>> +	return ret;
>>
>> Yeah, so I have three observations:
>>
>> 1)
>>
>> I don't think this warning is entirely right, because this is a fundamentally racy 
>> op.
>>
>> fetch_pkey(), called by force_sign_info_fault(), can be called while not holding 
>> the vma - and if we race with any other thread of the mm, the vma might be gone 
>> already.
>>
>> So any threaded app using pkeys and vmas in parallel could trigger that WARN_ON().
> 
> Agreed.  I'll remove the warning.
> 
>> 2)
>>
>> And note that this is a somewhat new scenario: in regular page faults, 
>> 'error_code' always carries a then-valid cause of the page fault with itself. So 
>> we can put that into the siginfo and can be sure that it's the reason for the 
>> fault.
>>
>> With the above pkey code, we fetch the pte separately from the fault, and without 
>> synchronizing with the fault - and we cannot do that, nor do we want to.
>>
>> So I think this code should just accept the fact that races may happen. Perhaps 
>> warn if we get here with only a single mm user. (but even that would be a bit racy 
>> as we don't serialize against exit())
> 
> Good point.
> 
>> 3)
>>
>> For user-space that somehow wants to handle pkeys dynamically and drive them via 
>> faults, this seems somewhat inefficient: we already do a find_vma() in the primary 
>> fault lookup - and with the typical pkey usecase it will find a vma, just with the 
>> wrong access permissions. But when we generate the siginfo here, why do we do a 
>> find_vma() again? Why not pass the vma to the siginfo generating function?
> 
> My assumption was that the signal generation case was pretty slow.
> find_vma() is almost guaranteed to hit the vmacache, and we already hold
> mmap_sem, so the cost is pretty tiny.
> 
> I'm happy to change it if you're really concerned, but I didn't think it
> would be worth the trouble of plumbing it down.
> 
>>> --- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-16 10:48:15.584161859 -0700
>>> +++ b/include/uapi/asm-generic/siginfo.h	2015-09-16 10:48:15.592162222 -0700
>>> @@ -95,6 +95,13 @@ typedef struct siginfo {
>>>  				void __user *_lower;
>>>  				void __user *_upper;
>>>  			} _addr_bnd;
>>> +			int _pkey; /* FIXME: protection key value??
>>> +				    * Do we really need this in here?
>>> +				    * userspace can get the PKRU value in
>>> +				    * the signal handler, but they do not
>>> +				    * easily have access to the PKEY value
>>> +				    * from the PTE.
>>> +				    */
>>>  		} _sigfault;
>>
>> A couple of comments:
>>
>> 1)
>>
>> Please use our ABI types - this one should be 'u32' I think.
>>
>> We could use 'u8' as well here, and mark another 3 bytes next to it as reserved 
>> for future flags. Right now protection keys use 4 bits, but do you really think 
>> they'll ever grow beyond 8 bits? PTE bits are a scarce resource in general.
> 
> I don't expect them to get bigger, at least with anything resembling the
> current architecture.  Agreed about the scarcity of PTE bits.
> 
> siginfo.h is shared everywhere, so I'd ideally like to put a type in
> there that all the other architectures can use.
> 
>> 3)
>>
>> Please add suitable self-tests to tools/tests/selftests/x86/ that both documents 
>> the preferred usage of pkeys, demonstrates all implemented aspects the new ABI and 
>> provokes a fault and prints the resulting siginfo, etc.
>>
>>> @@ -206,7 +214,8 @@ typedef struct siginfo {
>>>  #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
>>>  #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
>>>  #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
>>> -#define NSIGSEGV	3
>>> +#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed address bound checks */
>>> +#define NSIGSEGV	4
>>
>> You copy & pasted the MPX comment here, it should read something like:
>>
>>    #define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection keys checks */
> 
> Whoops.  Will fix.
> 


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-28 19:25         ` Christian Borntraeger
  0 siblings, 0 replies; 172+ messages in thread
From: Christian Borntraeger @ 2015-09-28 19:25 UTC (permalink / raw)
  To: Dave Hansen, Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

Am 24.09.2015 um 19:15 schrieb Dave Hansen:
> Christian, can you tell us how big s390's storage protection keys are?
> See the discussion below about siginfo...

Dave, sorry for the late answer.
s390 storage keys are 4bit for the protection key (and 1 bit for fetch protection, 
change and reference bit) per physical page, so 1 byte is enough for us.

We do not have the storage keys per page table, but for the page frame instead 
(shared among all mappers) so I am not sure if the whole thing will fit for s390.
Having a signal for page protection errors might be useful for us - not sure yet.

Christian

PS: In the past we worked hard to get rid of storage key usage in Linux and are now using
software reference and change tracking to be closer what others do, so its a bit odd to
see other coming with the same idea ;-)

> 
> On 09/24/2015 02:23 AM, Ingo Molnar wrote:
>>> +static u16 fetch_pkey(unsigned long address, struct task_struct *tsk)
>>> +{
> ...
>>> +		struct vm_area_struct *vma = find_vma(tsk->mm, address);
>>> +		if (vma) {
>>> +			ret = vma_pkey(vma);
>>> +		} else {
>>> +			WARN_ONCE(1, "no PTE or VMA @ %lx\n", address);
>>> +			ret = 0;
>>> +		}
>>> +	}
>>> +	return ret;
>>
>> Yeah, so I have three observations:
>>
>> 1)
>>
>> I don't think this warning is entirely right, because this is a fundamentally racy 
>> op.
>>
>> fetch_pkey(), called by force_sign_info_fault(), can be called while not holding 
>> the vma - and if we race with any other thread of the mm, the vma might be gone 
>> already.
>>
>> So any threaded app using pkeys and vmas in parallel could trigger that WARN_ON().
> 
> Agreed.  I'll remove the warning.
> 
>> 2)
>>
>> And note that this is a somewhat new scenario: in regular page faults, 
>> 'error_code' always carries a then-valid cause of the page fault with itself. So 
>> we can put that into the siginfo and can be sure that it's the reason for the 
>> fault.
>>
>> With the above pkey code, we fetch the pte separately from the fault, and without 
>> synchronizing with the fault - and we cannot do that, nor do we want to.
>>
>> So I think this code should just accept the fact that races may happen. Perhaps 
>> warn if we get here with only a single mm user. (but even that would be a bit racy 
>> as we don't serialize against exit())
> 
> Good point.
> 
>> 3)
>>
>> For user-space that somehow wants to handle pkeys dynamically and drive them via 
>> faults, this seems somewhat inefficient: we already do a find_vma() in the primary 
>> fault lookup - and with the typical pkey usecase it will find a vma, just with the 
>> wrong access permissions. But when we generate the siginfo here, why do we do a 
>> find_vma() again? Why not pass the vma to the siginfo generating function?
> 
> My assumption was that the signal generation case was pretty slow.
> find_vma() is almost guaranteed to hit the vmacache, and we already hold
> mmap_sem, so the cost is pretty tiny.
> 
> I'm happy to change it if you're really concerned, but I didn't think it
> would be worth the trouble of plumbing it down.
> 
>>> --- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-16 10:48:15.584161859 -0700
>>> +++ b/include/uapi/asm-generic/siginfo.h	2015-09-16 10:48:15.592162222 -0700
>>> @@ -95,6 +95,13 @@ typedef struct siginfo {
>>>  				void __user *_lower;
>>>  				void __user *_upper;
>>>  			} _addr_bnd;
>>> +			int _pkey; /* FIXME: protection key value??
>>> +				    * Do we really need this in here?
>>> +				    * userspace can get the PKRU value in
>>> +				    * the signal handler, but they do not
>>> +				    * easily have access to the PKEY value
>>> +				    * from the PTE.
>>> +				    */
>>>  		} _sigfault;
>>
>> A couple of comments:
>>
>> 1)
>>
>> Please use our ABI types - this one should be 'u32' I think.
>>
>> We could use 'u8' as well here, and mark another 3 bytes next to it as reserved 
>> for future flags. Right now protection keys use 4 bits, but do you really think 
>> they'll ever grow beyond 8 bits? PTE bits are a scarce resource in general.
> 
> I don't expect them to get bigger, at least with anything resembling the
> current architecture.  Agreed about the scarcity of PTE bits.
> 
> siginfo.h is shared everywhere, so I'd ideally like to put a type in
> there that all the other architectures can use.
> 
>> 3)
>>
>> Please add suitable self-tests to tools/tests/selftests/x86/ that both documents 
>> the preferred usage of pkeys, demonstrates all implemented aspects the new ABI and 
>> provokes a fault and prints the resulting siginfo, etc.
>>
>>> @@ -206,7 +214,8 @@ typedef struct siginfo {
>>>  #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
>>>  #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
>>>  #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
>>> -#define NSIGSEGV	3
>>> +#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed address bound checks */
>>> +#define NSIGSEGV	4
>>
>> You copy & pasted the MPX comment here, it should read something like:
>>
>>    #define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection keys checks */
> 
> Whoops.  Will fix.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
  2015-09-28 19:25         ` Christian Borntraeger
@ 2015-09-28 19:32           ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-28 19:32 UTC (permalink / raw)
  To: Christian Borntraeger, Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 09/28/2015 12:25 PM, Christian Borntraeger wrote:
> We do not have the storage keys per page table, but for the page frame instead 
> (shared among all mappers) so I am not sure if the whole thing will fit for s390.
> Having a signal for page protection errors might be useful for us - not sure yet.

Ugh, yeah, that's a pretty different architecture.  The stuff we have
here (syscall, VMA flags, etc...) is probably useful to you only for
controlling access to non-shared memory.



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 10/26] x86, pkeys: notify userspace about protection key faults
@ 2015-09-28 19:32           ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-09-28 19:32 UTC (permalink / raw)
  To: Christian Borntraeger, Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Thomas Gleixner

On 09/28/2015 12:25 PM, Christian Borntraeger wrote:
> We do not have the storage keys per page table, but for the page frame instead 
> (shared among all mappers) so I am not sure if the whole thing will fit for s390.
> Having a signal for page protection errors might be useful for us - not sure yet.

Ugh, yeah, that's a pretty different architecture.  The stuff we have
here (syscall, VMA flags, etc...) is probably useful to you only for
controlling access to non-shared memory.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-09-24 19:10           ` Dave Hansen
@ 2015-10-01 11:17             ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-01 11:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook


* Dave Hansen <dave@sr71.net> wrote:

> > If yes then this could be a significant security feature / usecase for pkeys: 
> > executable sections of shared libraries and binaries could be mapped with pkey 
> > access disabled. If I read the Intel documentation correctly then that should 
> > be possible.
> 
> Agreed.  I've even heard from some researchers who are interested in this:
> 
> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf

So could we try to add an (opt-in) kernel option that enables this transparently 
and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any 
user-space changes and syscalls necessary?

Beyond the security improvement, this would enable this hardware feature on most 
x86 Linux distros automatically, on supported hardware, which is good for testing.

Assuming it boots up fine on a typical distro, i.e. assuming that there are no 
surprises where PROT_READ && PROT_EXEC sections are accessed as data.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-01 11:17             ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-01 11:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook


* Dave Hansen <dave@sr71.net> wrote:

> > If yes then this could be a significant security feature / usecase for pkeys: 
> > executable sections of shared libraries and binaries could be mapped with pkey 
> > access disabled. If I read the Intel documentation correctly then that should 
> > be possible.
> 
> Agreed.  I've even heard from some researchers who are interested in this:
> 
> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf

So could we try to add an (opt-in) kernel option that enables this transparently 
and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any 
user-space changes and syscalls necessary?

Beyond the security improvement, this would enable this hardware feature on most 
x86 Linux distros automatically, on supported hardware, which is good for testing.

Assuming it boots up fine on a typical distro, i.e. assuming that there are no 
surprises where PROT_READ && PROT_EXEC sections are accessed as data.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 11:17             ` Ingo Molnar
@ 2015-10-01 20:39               ` Kees Cook
  -1 siblings, 0 replies; 172+ messages in thread
From: Kees Cook @ 2015-10-01 20:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dave Hansen <dave@sr71.net> wrote:
>
>> > If yes then this could be a significant security feature / usecase for pkeys:

Which CPUs (will) have pkeys?

>> > executable sections of shared libraries and binaries could be mapped with pkey
>> > access disabled. If I read the Intel documentation correctly then that should
>> > be possible.
>>
>> Agreed.  I've even heard from some researchers who are interested in this:
>>
>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>
> So could we try to add an (opt-in) kernel option that enables this transparently
> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
> user-space changes and syscalls necessary?

I would like this very much. :)

> Beyond the security improvement, this would enable this hardware feature on most
> x86 Linux distros automatically, on supported hardware, which is good for testing.
>
> Assuming it boots up fine on a typical distro, i.e. assuming that there are no
> surprises where PROT_READ && PROT_EXEC sections are accessed as data.

I can't wait to find out what implicitly expects PROT_READ from
PROT_EXEC mappings. :)

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-01 20:39               ` Kees Cook
  0 siblings, 0 replies; 172+ messages in thread
From: Kees Cook @ 2015-10-01 20:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Dave Hansen <dave@sr71.net> wrote:
>
>> > If yes then this could be a significant security feature / usecase for pkeys:

Which CPUs (will) have pkeys?

>> > executable sections of shared libraries and binaries could be mapped with pkey
>> > access disabled. If I read the Intel documentation correctly then that should
>> > be possible.
>>
>> Agreed.  I've even heard from some researchers who are interested in this:
>>
>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>
> So could we try to add an (opt-in) kernel option that enables this transparently
> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
> user-space changes and syscalls necessary?

I would like this very much. :)

> Beyond the security improvement, this would enable this hardware feature on most
> x86 Linux distros automatically, on supported hardware, which is good for testing.
>
> Assuming it boots up fine on a typical distro, i.e. assuming that there are no
> surprises where PROT_READ && PROT_EXEC sections are accessed as data.

I can't wait to find out what implicitly expects PROT_READ from
PROT_EXEC mappings. :)

-Kees

-- 
Kees Cook
Chrome OS Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 20:39               ` Kees Cook
@ 2015-10-01 20:45                 ` Andy Lutomirski
  -1 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-10-01 20:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ingo Molnar, Dave Hansen, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Thu, Oct 1, 2015 at 1:39 PM, Kees Cook <keescook@google.com> wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Dave Hansen <dave@sr71.net> wrote:
>>
>>> > If yes then this could be a significant security feature / usecase for pkeys:
>
> Which CPUs (will) have pkeys?
>
>>> > executable sections of shared libraries and binaries could be mapped with pkey
>>> > access disabled. If I read the Intel documentation correctly then that should
>>> > be possible.
>>>
>>> Agreed.  I've even heard from some researchers who are interested in this:
>>>
>>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>>
>> So could we try to add an (opt-in) kernel option that enables this transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
>
> I would like this very much. :)
>
>> Beyond the security improvement, this would enable this hardware feature on most
>> x86 Linux distros automatically, on supported hardware, which is good for testing.
>>
>> Assuming it boots up fine on a typical distro, i.e. assuming that there are no
>> surprises where PROT_READ && PROT_EXEC sections are accessed as data.
>
> I can't wait to find out what implicitly expects PROT_READ from
> PROT_EXEC mappings. :)

There's one annoying issue at least:

mprotect_pkey(..., PROT_READ | PROT_EXEC, 0) sets protection key 0.
mprotect_pkey(..., PROT_EXEC, 0) maybe sets protection key 15 or
whatever we use for this.  What does mprotect_pkey(..., PROT_EXEC, 0)
do?  What if the caller actually wants key 0?  What if some CPU vendor
some day implements --x for real?


Also, how do we do mprotect_pkey and say "don't change the key"?

>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-01 20:45                 ` Andy Lutomirski
  0 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-10-01 20:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ingo Molnar, Dave Hansen, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Thu, Oct 1, 2015 at 1:39 PM, Kees Cook <keescook@google.com> wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Dave Hansen <dave@sr71.net> wrote:
>>
>>> > If yes then this could be a significant security feature / usecase for pkeys:
>
> Which CPUs (will) have pkeys?
>
>>> > executable sections of shared libraries and binaries could be mapped with pkey
>>> > access disabled. If I read the Intel documentation correctly then that should
>>> > be possible.
>>>
>>> Agreed.  I've even heard from some researchers who are interested in this:
>>>
>>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>>
>> So could we try to add an (opt-in) kernel option that enables this transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
>
> I would like this very much. :)
>
>> Beyond the security improvement, this would enable this hardware feature on most
>> x86 Linux distros automatically, on supported hardware, which is good for testing.
>>
>> Assuming it boots up fine on a typical distro, i.e. assuming that there are no
>> surprises where PROT_READ && PROT_EXEC sections are accessed as data.
>
> I can't wait to find out what implicitly expects PROT_READ from
> PROT_EXEC mappings. :)

There's one annoying issue at least:

mprotect_pkey(..., PROT_READ | PROT_EXEC, 0) sets protection key 0.
mprotect_pkey(..., PROT_EXEC, 0) maybe sets protection key 15 or
whatever we use for this.  What does mprotect_pkey(..., PROT_EXEC, 0)
do?  What if the caller actually wants key 0?  What if some CPU vendor
some day implements --x for real?


Also, how do we do mprotect_pkey and say "don't change the key"?

>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



-- 
Andy Lutomirski
AMA Capital Management, LLC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 20:39               ` Kees Cook
@ 2015-10-01 20:58                 ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-01 20:58 UTC (permalink / raw)
  To: Kees Cook, Ingo Molnar
  Cc: x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/01/2015 01:39 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> * Dave Hansen <dave@sr71.net> wrote:
>>>> If yes then this could be a significant security feature / usecase for pkeys:
> 
> Which CPUs (will) have pkeys?

It hasn't been announced publicly, so all I can say here is "future ones".

>>>> executable sections of shared libraries and binaries could be mapped with pkey
>>>> access disabled. If I read the Intel documentation correctly then that should
>>>> be possible.
>>>
>>> Agreed.  I've even heard from some researchers who are interested in this:
>>>
>>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>>
>> So could we try to add an (opt-in) kernel option that enables this transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
> 
> I would like this very much. :)

I'll go hack something together and see what breaks.


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-01 20:58                 ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-01 20:58 UTC (permalink / raw)
  To: Kees Cook, Ingo Molnar
  Cc: x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/01/2015 01:39 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> * Dave Hansen <dave@sr71.net> wrote:
>>>> If yes then this could be a significant security feature / usecase for pkeys:
> 
> Which CPUs (will) have pkeys?

It hasn't been announced publicly, so all I can say here is "future ones".

>>>> executable sections of shared libraries and binaries could be mapped with pkey
>>>> access disabled. If I read the Intel documentation correctly then that should
>>>> be possible.
>>>
>>> Agreed.  I've even heard from some researchers who are interested in this:
>>>
>>> https://www.infsec.cs.uni-saarland.de/wp-content/uploads/sites/2/2014/10/nuernberger2014ccs_disclosure.pdf
>>
>> So could we try to add an (opt-in) kernel option that enables this transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
> 
> I would like this very much. :)

I'll go hack something together and see what breaks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 20:39               ` Kees Cook
                                 ` (2 preceding siblings ...)
  (?)
@ 2015-10-01 22:33               ` Dave Hansen
  2015-10-01 22:35                   ` Kees Cook
                                   ` (3 more replies)
  -1 siblings, 4 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-01 22:33 UTC (permalink / raw)
  To: Kees Cook, Ingo Molnar
  Cc: x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 554 bytes --]

On 10/01/2015 01:39 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> So could we try to add an (opt-in) kernel option that enables this transparently
>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>> user-space changes and syscalls necessary?
> 
> I would like this very much. :)

Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
if I boot with this, though.

I'll see if I can turn it in to a bit more of an opt-in and see what's
actually going wrong.



[-- Attachment #2: pkeys-95-rewire-mprotect-to-use-pkeys.patch --]
[-- Type: text/x-patch, Size: 8255 bytes --]



---

 b/arch/x86/include/asm/fpu/internal.h |    4 ++++
 b/arch/x86/kernel/fpu/core.c          |    4 ++++
 b/arch/x86/kernel/fpu/xstate.c        |   16 +++++++++++++++-
 b/arch/x86/mm/fault.c                 |    8 ++++++--
 b/include/linux/mm_types.h            |    1 +
 b/kernel/fork.c                       |    3 ++-
 b/kernel/sched/core.c                 |    3 +++
 b/mm/mmap.c                           |    8 +++++++-
 b/mm/mprotect.c                       |   27 ++++++++++++++++++++++++++-
 9 files changed, 68 insertions(+), 6 deletions(-)

diff -puN mm/mprotect.c~pkeys-95-rewire-mprotect-to-use-pkeys mm/mprotect.c
--- a/mm/mprotect.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.183874598 -0700
+++ b/mm/mprotect.c	2015-10-01 15:28:14.741262888 -0700
@@ -24,6 +24,7 @@
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
 #include <linux/ksm.h>
+#include <linux/debugfs.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -453,10 +454,34 @@ out:
 	return error;
 }
 
+u32 __read_mostly mprotect_hack_pkey = 1;
+int mprotect_hack_pkey_init(void)
+{
+       debugfs_create_u32("mprotect_hack_pkey",  S_IRUSR | S_IWUSR,
+                       NULL, &mprotect_hack_pkey);
+       return 0;
+}
+late_initcall(mprotect_hack_pkey_init);
+
+int pkey_for_access_protect = 1;
+int pkey_for_write_protect = 2;
 SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 		unsigned long, prot)
 {
-	return do_mprotect_key(start, len, prot, 0);
+	int ret;
+	unsigned long newprot = prot;
+	u32 pkey_hack = READ_ONCE(mprotect_hack_pkey);
+	u16 pkey = 0;
+
+	if (!pkey_hack)
+		return do_mprotect_key(start, len, prot, 0);
+
+	if ((prot & PROT_EXEC) && !(prot & PROT_WRITE))
+		pkey = pkey_for_access_protect;
+
+	ret = do_mprotect_key(start, len, newprot, pkey);
+
+	return ret;
 }
 
 SYSCALL_DEFINE4(mprotect_key, unsigned long, start, size_t, len,
diff -puN include/linux/mm_types.h~pkeys-95-rewire-mprotect-to-use-pkeys include/linux/mm_types.h
--- a/include/linux/mm_types.h~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.185874687 -0700
+++ b/include/linux/mm_types.h	2015-10-01 15:21:25.227876573 -0700
@@ -486,6 +486,7 @@ struct mm_struct {
 	/* address of the bounds directory */
 	void __user *bd_addr;
 #endif
+	u32 fake_mprotect_pkey;
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff -puN kernel/fork.c~pkeys-95-rewire-mprotect-to-use-pkeys kernel/fork.c
--- a/kernel/fork.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.187874777 -0700
+++ b/kernel/fork.c	2015-10-01 15:21:25.228876618 -0700
@@ -927,6 +927,7 @@ static struct mm_struct *dup_mm(struct t
 
 	mm->hiwater_rss = get_mm_rss(mm);
 	mm->hiwater_vm = mm->total_vm;
+	mm->fake_mprotect_pkey = 0;
 
 	if (mm->binfmt && !try_module_get(mm->binfmt->module))
 		goto free_pt;
@@ -1700,7 +1701,7 @@ long _do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-
+	//printk("%s()\n", __func__);
 	/*
 	 * Determine whether and which event to report to ptracer.  When
 	 * called from kernel_thread or CLONE_UNTRACED is explicitly
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-95-rewire-mprotect-to-use-pkeys arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.197875226 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-10-01 15:21:25.228876618 -0700
@@ -41,6 +41,17 @@ u64 xfeatures_mask __read_mostly;
 static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask)*8];
+void hack_fpstate_for_pkru(struct xregs_state *xstate)
+{
+        void *__pkru;
+        xstate->header.xfeatures |= XFEATURE_MASK_PKRU;
+        __pkru = ((char *)xstate) + xstate_offsets[XFEATURE_PKRU];
+	/*
+	 * Access disable PKEY 1 and
+	 * Write disable PKEY 2
+	 */
+        *(u32 *)__pkru = 0x00000024;
+}
 
 /*
  * Clear all of the X86_FEATURE_* bits that are unavailable
@@ -321,7 +332,10 @@ static void __init setup_init_fpu_buf(vo
 		init_fpstate.xsave.header.xcomp_bv = (u64)1 << 63 | xfeatures_mask;
 		init_fpstate.xsave.header.xfeatures = xfeatures_mask;
 	}
-
+	{
+		void hack_fpstate_for_pkru(struct xregs_state *xstate);
+		hack_fpstate_for_pkru(&init_fpstate.xsave);
+	}
 	/*
 	 * Init all the features state with header_bv being 0x0
 	 */
diff -puN arch/x86/mm/fault.c~pkeys-95-rewire-mprotect-to-use-pkeys arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.204875540 -0700
+++ b/arch/x86/mm/fault.c	2015-10-01 15:21:25.229876663 -0700
@@ -902,8 +902,10 @@ static inline bool bad_area_access_from_
 {
 	if (!boot_cpu_has(X86_FEATURE_OSPKE))
 		return false;
-	if (error_code & PF_PK)
+	if (error_code & PF_PK) {
+		printk("%s() PF_PK\n", __func__);
 		return true;
+	}
 	/* this checks permission keys on the VMA: */
 	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE)))
 		return true;
@@ -1095,8 +1097,10 @@ access_error(unsigned long error_code, s
 	 * to, for instance, confuse a protection-key-denied
 	 * write with one for which we should do a COW.
 	 */
-	if (error_code & PF_PK)
+	if (error_code & PF_PK) {
+		printk("%s() PF_PK\n", __func__);
 		return 1;
+	}
 	/*
 	 * Make sure to check the VMA so that we do not perform
 	 * faults just to hit a PF_PK as soon as we fill in a
diff -puN arch/x86/kernel/fpu/core.c~pkeys-95-rewire-mprotect-to-use-pkeys arch/x86/kernel/fpu/core.c
--- a/arch/x86/kernel/fpu/core.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.207875675 -0700
+++ b/arch/x86/kernel/fpu/core.c	2015-10-01 15:21:25.229876663 -0700
@@ -262,6 +262,10 @@ static void fpu_copy(struct fpu *dst_fpu
 		fpregs_deactivate(src_fpu);
 	}
 	preempt_enable();
+	{
+		void hack_fpstate_for_pkru(struct xregs_state *xstate);
+		hack_fpstate_for_pkru(&dst_fpu->state.xsave);
+	}
 }
 
 int fpu__copy(struct fpu *dst_fpu, struct fpu *src_fpu)
diff -puN arch/x86/include/asm/fpu/internal.h~pkeys-95-rewire-mprotect-to-use-pkeys arch/x86/include/asm/fpu/internal.h
--- a/arch/x86/include/asm/fpu/internal.h~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.209875765 -0700
+++ b/arch/x86/include/asm/fpu/internal.h	2015-10-01 15:21:25.230876707 -0700
@@ -335,6 +335,10 @@ static inline void copy_xregs_to_kernel(
 
 	/* We should never fault when copying to a kernel buffer: */
 	WARN_ON_FPU(err);
+	{
+		void hack_fpstate_for_pkru(struct xregs_state *xstate);
+		hack_fpstate_for_pkru(xstate);
+	}
 }
 
 /*
diff -puN kernel/sched/core.c~pkeys-95-rewire-mprotect-to-use-pkeys kernel/sched/core.c
--- a/kernel/sched/core.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.216876079 -0700
+++ b/kernel/sched/core.c	2015-10-01 15:21:25.232876797 -0700
@@ -2644,6 +2644,9 @@ context_switch(struct rq *rq, struct tas
 	/* Here we just switch the register state and the stack. */
 	switch_to(prev, next, prev);
 	barrier();
+	if (read_pkru() && printk_ratelimit()) {
+		printk("pid: %d pkru: 0x%x\n", current->pid, read_pkru());
+	}
 
 	return finish_task_switch(prev);
 }
diff -puN mm/mmap.c~pkeys-95-rewire-mprotect-to-use-pkeys mm/mmap.c
--- a/mm/mmap.c~pkeys-95-rewire-mprotect-to-use-pkeys	2015-10-01 15:21:25.223876393 -0700
+++ b/mm/mmap.c	2015-10-01 15:25:44.327508557 -0700
@@ -1267,6 +1267,8 @@ unsigned long do_mmap(struct file *file,
 			unsigned long flags, vm_flags_t vm_flags,
 			unsigned long pgoff, unsigned long *populate)
 {
+	extern u16 pkey_for_access_protect;
+	u16 pkey = 0;
 	struct mm_struct *mm = current->mm;
 
 	*populate = 0;
@@ -1311,7 +1313,11 @@ unsigned long do_mmap(struct file *file,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
+	if ((prot & PROT_EXEC) && !(prot & PROT_WRITE)) {
+		pkey = pkey_for_access_protect;
+		trace_printk("hacking mmap() to use pkey %d\n", pkey);
+	}
+	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
_

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 22:33               ` Dave Hansen
@ 2015-10-01 22:35                   ` Kees Cook
  2015-10-01 22:48                   ` Linus Torvalds
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 172+ messages in thread
From: Kees Cook @ 2015-10-01 22:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen <dave@sr71.net> wrote:
> On 10/01/2015 01:39 PM, Kees Cook wrote:
>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>> So could we try to add an (opt-in) kernel option that enables this transparently
>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>> user-space changes and syscalls necessary?
>>
>> I would like this very much. :)
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.

*laugh* Okay... well, we've got some work to do, I guess. :)

(And which init?)

> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

Cool, thanks!

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-01 22:35                   ` Kees Cook
  0 siblings, 0 replies; 172+ messages in thread
From: Kees Cook @ 2015-10-01 22:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen <dave@sr71.net> wrote:
> On 10/01/2015 01:39 PM, Kees Cook wrote:
>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>> So could we try to add an (opt-in) kernel option that enables this transparently
>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>> user-space changes and syscalls necessary?
>>
>> I would like this very much. :)
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.

*laugh* Okay... well, we've got some work to do, I guess. :)

(And which init?)

> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

Cool, thanks!

-Kees

-- 
Kees Cook
Chrome OS Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 22:35                   ` Kees Cook
@ 2015-10-01 22:39                     ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-01 22:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ingo Molnar, x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/01/2015 03:35 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen <dave@sr71.net> wrote:
>> On 10/01/2015 01:39 PM, Kees Cook wrote:
>>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>>> So could we try to add an (opt-in) kernel option that enables this transparently
>>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>>> user-space changes and syscalls necessary?
>>>
>>> I would like this very much. :)
>>
>> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
>> if I boot with this, though.
> 
> *laugh* Okay... well, we've got some work to do, I guess. :)
> 
> (And which init?)

systemd for better or worse.


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-01 22:39                     ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-01 22:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ingo Molnar, x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/01/2015 03:35 PM, Kees Cook wrote:
> On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen <dave@sr71.net> wrote:
>> On 10/01/2015 01:39 PM, Kees Cook wrote:
>>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>>> So could we try to add an (opt-in) kernel option that enables this transparently
>>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>>> user-space changes and syscalls necessary?
>>>
>>> I would like this very much. :)
>>
>> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
>> if I boot with this, though.
> 
> *laugh* Okay... well, we've got some work to do, I guess. :)
> 
> (And which init?)

systemd for better or worse.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 22:33               ` Dave Hansen
@ 2015-10-01 22:48                   ` Linus Torvalds
  2015-10-01 22:48                   ` Linus Torvalds
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 172+ messages in thread
From: Linus Torvalds @ 2015-10-01 22:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen <dave@sr71.net> wrote:
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.
>
> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

It's quite likely that you will find that compilers put read-only
constants in the text section, knowing that executable means readable.

So it's entirely possible that it's pretty much all over.

That said, I don't understand your patch. Why check PROT_WRITE? We've
had :"execute but not write" forever. It's "execute and not *read*"
that is interesting.

So I wonder if your testing is just bogus. But maybe I'm mis-reading this?

                Linus

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-01 22:48                   ` Linus Torvalds
  0 siblings, 0 replies; 172+ messages in thread
From: Linus Torvalds @ 2015-10-01 22:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen <dave@sr71.net> wrote:
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.
>
> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

It's quite likely that you will find that compilers put read-only
constants in the text section, knowing that executable means readable.

So it's entirely possible that it's pretty much all over.

That said, I don't understand your patch. Why check PROT_WRITE? We've
had :"execute but not write" forever. It's "execute and not *read*"
that is interesting.

So I wonder if your testing is just bogus. But maybe I'm mis-reading this?

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 22:48                   ` Linus Torvalds
@ 2015-10-01 22:56                     ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-01 22:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/01/2015 03:48 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen <dave@sr71.net> wrote:
>>
>> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
>> if I boot with this, though.
>>
>> I'll see if I can turn it in to a bit more of an opt-in and see what's
>> actually going wrong.
...
> That said, I don't understand your patch. Why check PROT_WRITE? We've
> had :"execute but not write" forever. It's "execute and not *read*"
> that is interesting.

I was thinking that almost anybody doing a PROT_WRITE|PROT_EXEC really
*is* going to write to it so they'll notice pretty fast if we completely
deny them access to it.

Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
but aren't _really_ going to read it, so we can safely deny them all
access other than exec.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-01 22:56                     ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-01 22:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/01/2015 03:48 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen <dave@sr71.net> wrote:
>>
>> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
>> if I boot with this, though.
>>
>> I'll see if I can turn it in to a bit more of an opt-in and see what's
>> actually going wrong.
...
> That said, I don't understand your patch. Why check PROT_WRITE? We've
> had :"execute but not write" forever. It's "execute and not *read*"
> that is interesting.

I was thinking that almost anybody doing a PROT_WRITE|PROT_EXEC really
*is* going to write to it so they'll notice pretty fast if we completely
deny them access to it.

Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
but aren't _really_ going to read it, so we can safely deny them all
access other than exec.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 22:33               ` Dave Hansen
@ 2015-10-01 22:57                   ` Andy Lutomirski
  2015-10-01 22:48                   ` Linus Torvalds
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-10-01 22:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Paolo Bonzini, kvm list

On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen <dave@sr71.net> wrote:
> On 10/01/2015 01:39 PM, Kees Cook wrote:
>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>> So could we try to add an (opt-in) kernel option that enables this transparently
>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>> user-space changes and syscalls necessary?
>>
>> I would like this very much. :)
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.

Somebody really ought to rework things so that a crash in init prints
out a normal indication of the unhandled signal and optionally leaves
everything else running.

Also...

EPT seems to have separate R, W, and X flags.  I wonder if it would
make sense to add a KVM paravirt feature that maps the entire guest
physical space an extra time at a monstrous offset with R cleared in
the EPT and passes through a #PF or other notification (KVM-specific
thing? #VE?) on a read fault.

This wouldn't even need a whole duplicate paging hierarchy -- it would
just duplicate the EPT PML4 entries, so it would add exactly zero
runtime memory usage.

The guest would use it by treating the high bit of the physical
address as a "may read" bit.

This reminds me -- we should probably wire up X86_TRAP_VE with a stub
that OOPSes until someone figures out some more useful thing to do.
We're probably not doing anyone any favors by unconditionally
promoting them to double-faults.

--Andy

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-01 22:57                   ` Andy Lutomirski
  0 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-10-01 22:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Paolo Bonzini, kvm list

On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen <dave@sr71.net> wrote:
> On 10/01/2015 01:39 PM, Kees Cook wrote:
>> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>> So could we try to add an (opt-in) kernel option that enables this transparently
>>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
>>> user-space changes and syscalls necessary?
>>
>> I would like this very much. :)
>
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.

Somebody really ought to rework things so that a crash in init prints
out a normal indication of the unhandled signal and optionally leaves
everything else running.

Also...

EPT seems to have separate R, W, and X flags.  I wonder if it would
make sense to add a KVM paravirt feature that maps the entire guest
physical space an extra time at a monstrous offset with R cleared in
the EPT and passes through a #PF or other notification (KVM-specific
thing? #VE?) on a read fault.

This wouldn't even need a whole duplicate paging hierarchy -- it would
just duplicate the EPT PML4 entries, so it would add exactly zero
runtime memory usage.

The guest would use it by treating the high bit of the physical
address as a "may read" bit.

This reminds me -- we should probably wire up X86_TRAP_VE with a stub
that OOPSes until someone figures out some more useful thing to do.
We're probably not doing anyone any favors by unconditionally
promoting them to double-faults.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 22:56                     ` Dave Hansen
@ 2015-10-02  1:38                       ` Linus Torvalds
  -1 siblings, 0 replies; 172+ messages in thread
From: Linus Torvalds @ 2015-10-02  1:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Thu, Oct 1, 2015 at 6:56 PM, Dave Hansen <dave@sr71.net> wrote:
>
> Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
> also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
> but aren't _really_ going to read it, so we can safely deny them all
> access other than exec.

That's a completely insane assumption. There are tons of reasons to
have code and read-only data in the same segment, and it's very
traditional. Just assuming that you only execute out of something that
has PROT_EXEC | PROT_READ is insane.

No, what you *should* look at is to use the protection keys to
actually enforce a plain PROT_EXEC. That has never worked before
(because traditionally R implies X, and then we got NX).

That would at least allow people who know they don't intersperse
read-only constants in the code to use PROT_EXE only.

Of course, there may well be users who use PROT_EXE that actually *do*
do reads, and just relied on the old hardware behavior. So it's not
guaranteed to work either without any extra flags. But at least it's
worth a try, unlike the "yeah, the user asked for read, but the user
doesn't know what he's doing" thinking that is just crazy talk.

           Linus

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-02  1:38                       ` Linus Torvalds
  0 siblings, 0 replies; 172+ messages in thread
From: Linus Torvalds @ 2015-10-02  1:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Thu, Oct 1, 2015 at 6:56 PM, Dave Hansen <dave@sr71.net> wrote:
>
> Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
> also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
> but aren't _really_ going to read it, so we can safely deny them all
> access other than exec.

That's a completely insane assumption. There are tons of reasons to
have code and read-only data in the same segment, and it's very
traditional. Just assuming that you only execute out of something that
has PROT_EXEC | PROT_READ is insane.

No, what you *should* look at is to use the protection keys to
actually enforce a plain PROT_EXEC. That has never worked before
(because traditionally R implies X, and then we got NX).

That would at least allow people who know they don't intersperse
read-only constants in the code to use PROT_EXE only.

Of course, there may well be users who use PROT_EXE that actually *do*
do reads, and just relied on the old hardware behavior. So it's not
guaranteed to work either without any extra flags. But at least it's
worth a try, unlike the "yeah, the user asked for read, but the user
doesn't know what he's doing" thinking that is just crazy talk.

           Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 22:33               ` Dave Hansen
@ 2015-10-02  6:09                   ` Ingo Molnar
  2015-10-01 22:48                   ` Linus Torvalds
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-02  6:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Dave Hansen <dave@sr71.net> wrote:

> On 10/01/2015 01:39 PM, Kees Cook wrote:
> > On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >> So could we try to add an (opt-in) kernel option that enables this transparently
> >> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
> >> user-space changes and syscalls necessary?
> > 
> > I would like this very much. :)
> 
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.
> 
> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

So the reality of modern Linux distros is that, according to some limited 
strace-ing around, pure PROT_EXEC usage does not seem to exist: 99% of executable 
mappings are mapped via PROT_EXEC|PROT_READ.

So the most usable kernel testing approach would be to enable these types of pkeys 
for a child task via some mechanism and inherit it to all children (including 
inheriting it over non-suid exec) - but not to any other task.

You could hijack a new personality bit just for debug purposes - see the (totally 
untested) patch below.

Depending on user-space's assumptions it might not end up being anything usable we 
can apply, but it would be a great testing tool if it worked to a certain degree.

I.e. allow the system to boot in without pkeys set for any task, then set the 
personality of a shell process to PER_LINUX_PKEYS and see which binaries (if any!) 
will start up without segfaulting.

This way you don't have to debug SystemD, which is extremely fragile and 
passive-aggressive towards kernels that don't behave in precisely the fashion 
under which SystemD is being developed.

Thanks,

	Ingo

========>

Absolutely-Not-Signed-off-by: Ingo Molnar <mingo@kernel.org>

 include/uapi/linux/personality.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
index aa169c4339d2..bead47213419 100644
--- a/include/uapi/linux/personality.h
+++ b/include/uapi/linux/personality.h
@@ -8,6 +8,7 @@
  * These occupy the top three bytes.
  */
 enum {
+	PROT_READ_EXEC_HACK =	0x0010000,	/* PROT_READ|PROT_EXEC == PROT_EXEC hack */
 	UNAME26	=               0x0020000,
 	ADDR_NO_RANDOMIZE = 	0x0040000,	/* disable randomization of VA space */
 	FDPIC_FUNCPTRS =	0x0080000,	/* userspace function ptrs point to descriptors
@@ -41,6 +42,7 @@ enum {
 enum {
 	PER_LINUX =		0x0000,
 	PER_LINUX_32BIT =	0x0000 | ADDR_LIMIT_32BIT,
+	PER_LINUX_PKEYS =	0x0000 | PROT_READ_EXEC_HACK,
 	PER_LINUX_FDPIC =	0x0000 | FDPIC_FUNCPTRS,
 	PER_SVR4 =		0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
 	PER_SVR3 =		0x0002 | STICKY_TIMEOUTS | SHORT_INODE,

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-02  6:09                   ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-02  6:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Dave Hansen <dave@sr71.net> wrote:

> On 10/01/2015 01:39 PM, Kees Cook wrote:
> > On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >> So could we try to add an (opt-in) kernel option that enables this transparently
> >> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any
> >> user-space changes and syscalls necessary?
> > 
> > I would like this very much. :)
> 
> Here it is in a quite fugly form (well, it's not opt-in).  Init crashes
> if I boot with this, though.
> 
> I'll see if I can turn it in to a bit more of an opt-in and see what's
> actually going wrong.

So the reality of modern Linux distros is that, according to some limited 
strace-ing around, pure PROT_EXEC usage does not seem to exist: 99% of executable 
mappings are mapped via PROT_EXEC|PROT_READ.

So the most usable kernel testing approach would be to enable these types of pkeys 
for a child task via some mechanism and inherit it to all children (including 
inheriting it over non-suid exec) - but not to any other task.

You could hijack a new personality bit just for debug purposes - see the (totally 
untested) patch below.

Depending on user-space's assumptions it might not end up being anything usable we 
can apply, but it would be a great testing tool if it worked to a certain degree.

I.e. allow the system to boot in without pkeys set for any task, then set the 
personality of a shell process to PER_LINUX_PKEYS and see which binaries (if any!) 
will start up without segfaulting.

This way you don't have to debug SystemD, which is extremely fragile and 
passive-aggressive towards kernels that don't behave in precisely the fashion 
under which SystemD is being developed.

Thanks,

	Ingo

========>

Absolutely-Not-Signed-off-by: Ingo Molnar <mingo@kernel.org>

 include/uapi/linux/personality.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
index aa169c4339d2..bead47213419 100644
--- a/include/uapi/linux/personality.h
+++ b/include/uapi/linux/personality.h
@@ -8,6 +8,7 @@
  * These occupy the top three bytes.
  */
 enum {
+	PROT_READ_EXEC_HACK =	0x0010000,	/* PROT_READ|PROT_EXEC == PROT_EXEC hack */
 	UNAME26	=               0x0020000,
 	ADDR_NO_RANDOMIZE = 	0x0040000,	/* disable randomization of VA space */
 	FDPIC_FUNCPTRS =	0x0080000,	/* userspace function ptrs point to descriptors
@@ -41,6 +42,7 @@ enum {
 enum {
 	PER_LINUX =		0x0000,
 	PER_LINUX_32BIT =	0x0000 | ADDR_LIMIT_32BIT,
+	PER_LINUX_PKEYS =	0x0000 | PROT_READ_EXEC_HACK,
 	PER_LINUX_FDPIC =	0x0000 | FDPIC_FUNCPTRS,
 	PER_SVR4 =		0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
 	PER_SVR3 =		0x0002 | STICKY_TIMEOUTS | SHORT_INODE,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 20:45                 ` Andy Lutomirski
@ 2015-10-02  6:23                   ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-02  6:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Dave Hansen, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Andy Lutomirski <luto@amacapital.net> wrote:

> >> Assuming it boots up fine on a typical distro, i.e. assuming that there are no
> >> surprises where PROT_READ && PROT_EXEC sections are accessed as data.
> >
> > I can't wait to find out what implicitly expects PROT_READ from
> > PROT_EXEC mappings. :)

So what seems to happen is that there are no pure PROT_EXEC mappings in practice - 
they are only omnibus PROT_READ|PROT_EXEC mappings, an unknown proportion of which 
truly relies on PROT_READ:

  $ for C in firefox ls perf libreoffice google-chrome Xorg xterm \
      konsole; do echo; echo "# $C:"; strace -e trace=mmap -f $C -h 2>&1 | cut -d, -f3 | \
      grep PROT | sort | uniq -c; done

# firefox:
     13  PROT_READ
     82  PROT_READ|PROT_EXEC
    184  PROT_READ|PROT_WRITE
      2  PROT_READ|PROT_WRITE|PROT_EXEC

# ls:
      2  PROT_READ
      7  PROT_READ|PROT_EXEC
     17  PROT_READ|PROT_WRITE

# perf:
      1  PROT_READ
     20  PROT_READ|PROT_EXEC
     44  PROT_READ|PROT_WRITE

# libreoffice:
      2  PROT_NONE
     87  PROT_READ
    148  PROT_READ|PROT_EXEC
    339  PROT_READ|PROT_WRITE

# google-chrome:
     39  PROT_READ
    121  PROT_READ|PROT_EXEC
    345  PROT_READ|PROT_WRITE

# Xorg:
      1  PROT_READ
     22  PROT_READ|PROT_EXEC
     39  PROT_READ|PROT_WRITE

# xterm:
      1  PROT_READ
     25  PROT_READ|PROT_EXEC
     46  PROT_READ|PROT_WRITE

# konsole:
      1  PROT_READ
    101  PROT_READ|PROT_EXEC
    175  PROT_READ|PROT_WRITE

So whatever kernel side method we come up with, it's not something that I expect 
to become production quality. "Proper" conversion to pkeys has to be driven from 
the user-space side.

That does not mean we can not try! :-)

> There's one annoying issue at least:
> 
> mprotect_pkey(..., PROT_READ | PROT_EXEC, 0) sets protection key 0.
> mprotect_pkey(..., PROT_EXEC, 0) maybe sets protection key 15 or
> whatever we use for this.  What does mprotect_pkey(..., PROT_EXEC, 0)
> do?  What if the caller actually wants key 0?  What if some CPU vendor
> some day implements --x for real?

That comes from the hardcoded "user-space has 4 bits to itself, not managed by the 
kernel" assumption in the whole design. So no layering between different 
user-space libraries using pkeys in a different fashion, no transparent kernel use 
of pkeys (such as it may be), etc.

I'm not sure it's _worth_ managing these 4 bits, but '16 separate keys' does seem 
to be to me above a certain resource threshold that should be more explicitly 
managed than telling user-space: "it's all yours!".

> Also, how do we do mprotect_pkey and say "don't change the key"?

So if we start managing keys as a resource (i.e. alloc/free up to 16 of them), and 
provide APIs for user-space to do all that, then user-space is not supposed to 
touch keys it has not allocated for itself - just like it's not supposed to write 
to fds it has not opened.

Such an allocation method can still 'mess up', and if the kernel allocates a key 
for its purposes it should not assume that user-space cannot change it, but at 
least for non-buggy code there's no interaction and it would work out fine.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-02  6:23                   ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-02  6:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Dave Hansen, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Andy Lutomirski <luto@amacapital.net> wrote:

> >> Assuming it boots up fine on a typical distro, i.e. assuming that there are no
> >> surprises where PROT_READ && PROT_EXEC sections are accessed as data.
> >
> > I can't wait to find out what implicitly expects PROT_READ from
> > PROT_EXEC mappings. :)

So what seems to happen is that there are no pure PROT_EXEC mappings in practice - 
they are only omnibus PROT_READ|PROT_EXEC mappings, an unknown proportion of which 
truly relies on PROT_READ:

  $ for C in firefox ls perf libreoffice google-chrome Xorg xterm \
      konsole; do echo; echo "# $C:"; strace -e trace=mmap -f $C -h 2>&1 | cut -d, -f3 | \
      grep PROT | sort | uniq -c; done

# firefox:
     13  PROT_READ
     82  PROT_READ|PROT_EXEC
    184  PROT_READ|PROT_WRITE
      2  PROT_READ|PROT_WRITE|PROT_EXEC

# ls:
      2  PROT_READ
      7  PROT_READ|PROT_EXEC
     17  PROT_READ|PROT_WRITE

# perf:
      1  PROT_READ
     20  PROT_READ|PROT_EXEC
     44  PROT_READ|PROT_WRITE

# libreoffice:
      2  PROT_NONE
     87  PROT_READ
    148  PROT_READ|PROT_EXEC
    339  PROT_READ|PROT_WRITE

# google-chrome:
     39  PROT_READ
    121  PROT_READ|PROT_EXEC
    345  PROT_READ|PROT_WRITE

# Xorg:
      1  PROT_READ
     22  PROT_READ|PROT_EXEC
     39  PROT_READ|PROT_WRITE

# xterm:
      1  PROT_READ
     25  PROT_READ|PROT_EXEC
     46  PROT_READ|PROT_WRITE

# konsole:
      1  PROT_READ
    101  PROT_READ|PROT_EXEC
    175  PROT_READ|PROT_WRITE

So whatever kernel side method we come up with, it's not something that I expect 
to become production quality. "Proper" conversion to pkeys has to be driven from 
the user-space side.

That does not mean we can not try! :-)

> There's one annoying issue at least:
> 
> mprotect_pkey(..., PROT_READ | PROT_EXEC, 0) sets protection key 0.
> mprotect_pkey(..., PROT_EXEC, 0) maybe sets protection key 15 or
> whatever we use for this.  What does mprotect_pkey(..., PROT_EXEC, 0)
> do?  What if the caller actually wants key 0?  What if some CPU vendor
> some day implements --x for real?

That comes from the hardcoded "user-space has 4 bits to itself, not managed by the 
kernel" assumption in the whole design. So no layering between different 
user-space libraries using pkeys in a different fashion, no transparent kernel use 
of pkeys (such as it may be), etc.

I'm not sure it's _worth_ managing these 4 bits, but '16 separate keys' does seem 
to be to me above a certain resource threshold that should be more explicitly 
managed than telling user-space: "it's all yours!".

> Also, how do we do mprotect_pkey and say "don't change the key"?

So if we start managing keys as a resource (i.e. alloc/free up to 16 of them), and 
provide APIs for user-space to do all that, then user-space is not supposed to 
touch keys it has not allocated for itself - just like it's not supposed to write 
to fds it has not opened.

Such an allocation method can still 'mess up', and if the kernel allocates a key 
for its purposes it should not assume that user-space cannot change it, but at 
least for non-buggy code there's no interaction and it would work out fine.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 22:48                   ` Linus Torvalds
@ 2015-10-02  7:09                     ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-02  7:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Kees Cook, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen <dave@sr71.net> wrote:
> >
> > Here it is in a quite fugly form (well, it's not opt-in).  Init crashes if I 
> > boot with this, though.
> >
> > I'll see if I can turn it in to a bit more of an opt-in and see what's 
> > actually going wrong.
> 
> It's quite likely that you will find that compilers put read-only constants in 
> the text section, knowing that executable means readable.

At least with pkeys enabling true --x mappings, that compiler practice becomes a 
(mild) security problem: it provides a readable and executable return target for 
stack/buffer overflow attacks - FWIIW. (It's a limited concern because the true 
code areas are executable already.)

I'd expect such readonly data to eventually move out into the regular data 
sections, the moment the kernel gives a tool to distros to enforce true PROT_EXEC 
mappings.

> So it's entirely possible that it's pretty much all over.

I'd expect that too.

> That said, I don't understand your patch. Why check PROT_WRITE? We've had
> :"execute but not write" forever. It's "execute and not *read*" that is
> interesting.

Yeah, but almost none of user-space seems to be using it.

> So I wonder if your testing is just bogus. But maybe I'm mis-reading this?

I don't think you are mis-reading it: my (hacky! bad! not signed off!) debug idea 
was to fudge PROT_EXEC|PROT_READ bits into pure PROT_EXEC only - at least to get 
pkeys used in a much more serious fashion than standalone testcases, without 
having to change the distro itself.

You are probably right that true data reads from executable sections are very 
common, so this might not be a viable technique even for testing purposes.

But worth a try.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-02  7:09                     ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-02  7:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Kees Cook, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Oct 1, 2015 at 6:33 PM, Dave Hansen <dave@sr71.net> wrote:
> >
> > Here it is in a quite fugly form (well, it's not opt-in).  Init crashes if I 
> > boot with this, though.
> >
> > I'll see if I can turn it in to a bit more of an opt-in and see what's 
> > actually going wrong.
> 
> It's quite likely that you will find that compilers put read-only constants in 
> the text section, knowing that executable means readable.

At least with pkeys enabling true --x mappings, that compiler practice becomes a 
(mild) security problem: it provides a readable and executable return target for 
stack/buffer overflow attacks - FWIIW. (It's a limited concern because the true 
code areas are executable already.)

I'd expect such readonly data to eventually move out into the regular data 
sections, the moment the kernel gives a tool to distros to enforce true PROT_EXEC 
mappings.

> So it's entirely possible that it's pretty much all over.

I'd expect that too.

> That said, I don't understand your patch. Why check PROT_WRITE? We've had
> :"execute but not write" forever. It's "execute and not *read*" that is
> interesting.

Yeah, but almost none of user-space seems to be using it.

> So I wonder if your testing is just bogus. But maybe I'm mis-reading this?

I don't think you are mis-reading it: my (hacky! bad! not signed off!) debug idea 
was to fudge PROT_EXEC|PROT_READ bits into pure PROT_EXEC only - at least to get 
pkeys used in a much more serious fashion than standalone testcases, without 
having to change the distro itself.

You are probably right that true data reads from executable sections are very 
common, so this might not be a viable technique even for testing purposes.

But worth a try.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-01 22:48                   ` Linus Torvalds
@ 2015-10-02 11:49                     ` Paolo Bonzini
  -1 siblings, 0 replies; 172+ messages in thread
From: Paolo Bonzini @ 2015-10-02 11:49 UTC (permalink / raw)
  To: Linus Torvalds, Dave Hansen
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov



On 02/10/2015 00:48, Linus Torvalds wrote:
> It's quite likely that you will find that compilers put read-only
> constants in the text section, knowing that executable means readable.

Not on x86 (because it has large immediates; RISC machines and s390 do
put large constants in the text section).

But at the very least jump tables reside in the .text seection.

Paolo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-02 11:49                     ` Paolo Bonzini
  0 siblings, 0 replies; 172+ messages in thread
From: Paolo Bonzini @ 2015-10-02 11:49 UTC (permalink / raw)
  To: Linus Torvalds, Dave Hansen
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov



On 02/10/2015 00:48, Linus Torvalds wrote:
> It's quite likely that you will find that compilers put read-only
> constants in the text section, knowing that executable means readable.

Not on x86 (because it has large immediates; RISC machines and s390 do
put large constants in the text section).

But at the very least jump tables reside in the .text seection.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-02 11:49                     ` Paolo Bonzini
@ 2015-10-02 11:58                       ` Linus Torvalds
  -1 siblings, 0 replies; 172+ messages in thread
From: Linus Torvalds @ 2015-10-02 11:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Dave Hansen, Kees Cook, Ingo Molnar, x86, LKML, Linux-MM,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 02/10/2015 00:48, Linus Torvalds wrote:
>> It's quite likely that you will find that compilers put read-only
>> constants in the text section, knowing that executable means readable.
>
> Not on x86 (because it has large immediates; RISC machines and s390 do
> put large constants in the text section).
>
> But at the very least jump tables reside in the .text seection.

Yes, at least traditionally gcc put things like the jump tables for
switch() statements immediately next to the code. That caused lots of
pain on the P4, where the L1 I$ and D$ were exclusive. I think that
caused gcc to then put the jump tables further away, and it might be
in a separate section these days - but it might also just be
"sufficiently aligned" that the L1 cache issue isn't in play any more.

Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
rest easy knowing that the data accesses and text accesses should be
separated by at least one cacheline (maybe even 128 bytes - I think
the L4 used 64-byte line size, but it was sub-sections of a 128-byte
bigger line - but that might have been in the L2 only).

But I could easily see the compiler/linker still putting them in the
same ELF segment.

              Linus

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-02 11:58                       ` Linus Torvalds
  0 siblings, 0 replies; 172+ messages in thread
From: Linus Torvalds @ 2015-10-02 11:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Dave Hansen, Kees Cook, Ingo Molnar, x86, LKML, Linux-MM,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 02/10/2015 00:48, Linus Torvalds wrote:
>> It's quite likely that you will find that compilers put read-only
>> constants in the text section, knowing that executable means readable.
>
> Not on x86 (because it has large immediates; RISC machines and s390 do
> put large constants in the text section).
>
> But at the very least jump tables reside in the .text seection.

Yes, at least traditionally gcc put things like the jump tables for
switch() statements immediately next to the code. That caused lots of
pain on the P4, where the L1 I$ and D$ were exclusive. I think that
caused gcc to then put the jump tables further away, and it might be
in a separate section these days - but it might also just be
"sufficiently aligned" that the L1 cache issue isn't in play any more.

Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
rest easy knowing that the data accesses and text accesses should be
separated by at least one cacheline (maybe even 128 bytes - I think
the L4 used 64-byte line size, but it was sub-sections of a 128-byte
bigger line - but that might have been in the L2 only).

But I could easily see the compiler/linker still putting them in the
same ELF segment.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-02 11:58                       ` Linus Torvalds
@ 2015-10-02 12:14                         ` Paolo Bonzini
  -1 siblings, 0 replies; 172+ messages in thread
From: Paolo Bonzini @ 2015-10-02 12:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Kees Cook, Ingo Molnar, x86, LKML, Linux-MM,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov



On 02/10/2015 13:58, Linus Torvalds wrote:
> On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 02/10/2015 00:48, Linus Torvalds wrote:
>>> It's quite likely that you will find that compilers put read-only
>>> constants in the text section, knowing that executable means readable.
>>
>> Not on x86 (because it has large immediates; RISC machines and s390 do
>> put large constants in the text section).
>>
>> But at the very least jump tables reside in the .text seection.
> 
> Yes, at least traditionally gcc put things like the jump tables for
> switch() statements immediately next to the code. That caused lots of
> pain on the P4, where the L1 I$ and D$ were exclusive. I think that
> caused gcc to then put the jump tables further away, and it might be
> in a separate section these days - but it might also just be
> "sufficiently aligned" that the L1 cache issue isn't in play any more.
> 
> Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
> rest easy knowing that the data accesses and text accesses should be
> separated by at least one cacheline (maybe even 128 bytes - I think
> the L4 used 64-byte line size, but it was sub-sections of a 128-byte
> bigger line - but that might have been in the L2 only).
> 
> But I could easily see the compiler/linker still putting them in the
> same ELF segment.

You're entirely right, it puts them in .rodata actually.  But .rodata is
in the same segment as .text:

$ readelf --segments /bin/true
...
 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym
          .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init
          .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   03     .init_array .fini_array .jcr .data.rel.ro .dynamic .got .data .bss 
   04     .dynamic 
   05     .note.ABI-tag .note.gnu.build-id 
   06     .eh_frame_hdr 
   07     
   08     .init_array .fini_array .jcr .data.rel.ro .dynamic .got 


Paolo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-02 12:14                         ` Paolo Bonzini
  0 siblings, 0 replies; 172+ messages in thread
From: Paolo Bonzini @ 2015-10-02 12:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Kees Cook, Ingo Molnar, x86, LKML, Linux-MM,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov



On 02/10/2015 13:58, Linus Torvalds wrote:
> On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 02/10/2015 00:48, Linus Torvalds wrote:
>>> It's quite likely that you will find that compilers put read-only
>>> constants in the text section, knowing that executable means readable.
>>
>> Not on x86 (because it has large immediates; RISC machines and s390 do
>> put large constants in the text section).
>>
>> But at the very least jump tables reside in the .text seection.
> 
> Yes, at least traditionally gcc put things like the jump tables for
> switch() statements immediately next to the code. That caused lots of
> pain on the P4, where the L1 I$ and D$ were exclusive. I think that
> caused gcc to then put the jump tables further away, and it might be
> in a separate section these days - but it might also just be
> "sufficiently aligned" that the L1 cache issue isn't in play any more.
> 
> Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
> rest easy knowing that the data accesses and text accesses should be
> separated by at least one cacheline (maybe even 128 bytes - I think
> the L4 used 64-byte line size, but it was sub-sections of a 128-byte
> bigger line - but that might have been in the L2 only).
> 
> But I could easily see the compiler/linker still putting them in the
> same ELF segment.

You're entirely right, it puts them in .rodata actually.  But .rodata is
in the same segment as .text:

$ readelf --segments /bin/true
...
 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym
          .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init
          .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   03     .init_array .fini_array .jcr .data.rel.ro .dynamic .got .data .bss 
   04     .dynamic 
   05     .note.ABI-tag .note.gnu.build-id 
   06     .eh_frame_hdr 
   07     
   08     .init_array .fini_array .jcr .data.rel.ro .dynamic .got 


Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-02  6:23                   ` Ingo Molnar
@ 2015-10-02 17:50                     ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-02 17:50 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Kees Cook, x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/01/2015 11:23 PM, Ingo Molnar wrote:
>> > Also, how do we do mprotect_pkey and say "don't change the key"?
> So if we start managing keys as a resource (i.e. alloc/free up to 16 of them), and 
> provide APIs for user-space to do all that, then user-space is not supposed to 
> touch keys it has not allocated for itself - just like it's not supposed to write 
> to fds it has not opened.

I like that.  It gives us at least a "soft" indicator to userspace about
what keys it should or shouldn't be using.

> Such an allocation method can still 'mess up', and if the kernel allocates a key 
> for its purposes it should not assume that user-space cannot change it, but at 
> least for non-buggy code there's no interaction and it would work out fine.

Yeah.  It also provides a clean interface so that future hardware could
enforce enforce kernel "ownership" of a key which could protect against
even buggy code.

So, we add a pair of syscalls,

	unsigned long sys_alloc_pkey(unsigned long flags??)
	unsigned long sys_free_pkey(unsigned long pkey)

keep the metadata in the mm, and then make sure that userspace allocated
it before it is allowed to do an mprotect_pkey() with it.

mprotect_pkey(add, flags, pkey)
{
	if (!(mm->pkeys_allocated & (1 << pkey))
		return -EINVAL;
}

That should be pretty easy to implement.  The only real overhead is the
16 bits we need to keep in the mm somewhere.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-02 17:50                     ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-02 17:50 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Kees Cook, x86, LKML, Linux-MM, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/01/2015 11:23 PM, Ingo Molnar wrote:
>> > Also, how do we do mprotect_pkey and say "don't change the key"?
> So if we start managing keys as a resource (i.e. alloc/free up to 16 of them), and 
> provide APIs for user-space to do all that, then user-space is not supposed to 
> touch keys it has not allocated for itself - just like it's not supposed to write 
> to fds it has not opened.

I like that.  It gives us at least a "soft" indicator to userspace about
what keys it should or shouldn't be using.

> Such an allocation method can still 'mess up', and if the kernel allocates a key 
> for its purposes it should not assume that user-space cannot change it, but at 
> least for non-buggy code there's no interaction and it would work out fine.

Yeah.  It also provides a clean interface so that future hardware could
enforce enforce kernel "ownership" of a key which could protect against
even buggy code.

So, we add a pair of syscalls,

	unsigned long sys_alloc_pkey(unsigned long flags??)
	unsigned long sys_free_pkey(unsigned long pkey)

keep the metadata in the mm, and then make sure that userspace allocated
it before it is allowed to do an mprotect_pkey() with it.

mprotect_pkey(add, flags, pkey)
{
	if (!(mm->pkeys_allocated & (1 << pkey))
		return -EINVAL;
}

That should be pretty easy to implement.  The only real overhead is the
16 bits we need to keep in the mm somewhere.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-02  1:38                       ` Linus Torvalds
@ 2015-10-02 18:08                         ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-02 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/01/2015 06:38 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2015 at 6:56 PM, Dave Hansen <dave@sr71.net> wrote:
>>
>> Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
>> also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
>> but aren't _really_ going to read it, so we can safely deny them all
>> access other than exec.
> 
> That's a completely insane assumption. There are tons of reasons to
> have code and read-only data in the same segment, and it's very
> traditional. Just assuming that you only execute out of something that
> has PROT_EXEC | PROT_READ is insane.

Yes, it's insane, and I confirmed that ld.so actually reads some stuff
out of the first page of the r-x part of the executable.

But, it did find a bug in my code where I wouldn't allow instruction
fetches to fault in pages in a pkey-protected area, so it wasn't a
completely worthless exercise.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-02 18:08                         ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-02 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Ingo Molnar, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/01/2015 06:38 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2015 at 6:56 PM, Dave Hansen <dave@sr71.net> wrote:
>>
>> Also, a quick ftrace showed that most mmap() callers that set PROT_EXEC
>> also set PROT_READ.  I'm just assuming that folks are setting PROT_READ
>> but aren't _really_ going to read it, so we can safely deny them all
>> access other than exec.
> 
> That's a completely insane assumption. There are tons of reasons to
> have code and read-only data in the same segment, and it's very
> traditional. Just assuming that you only execute out of something that
> has PROT_EXEC | PROT_READ is insane.

Yes, it's insane, and I confirmed that ld.so actually reads some stuff
out of the first page of the r-x part of the executable.

But, it did find a bug in my code where I wouldn't allow instruction
fetches to fault in pages in a pkey-protected area, so it wasn't a
completely worthless exercise.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-02 12:14                         ` Paolo Bonzini
@ 2015-10-03  6:46                           ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-03  6:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Linus Torvalds, Dave Hansen, Kees Cook, x86, LKML, Linux-MM,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 02/10/2015 13:58, Linus Torvalds wrote:
> > On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> On 02/10/2015 00:48, Linus Torvalds wrote:
> >>> It's quite likely that you will find that compilers put read-only
> >>> constants in the text section, knowing that executable means readable.
> >>
> >> Not on x86 (because it has large immediates; RISC machines and s390 do
> >> put large constants in the text section).
> >>
> >> But at the very least jump tables reside in the .text seection.
> > 
> > Yes, at least traditionally gcc put things like the jump tables for
> > switch() statements immediately next to the code. That caused lots of
> > pain on the P4, where the L1 I$ and D$ were exclusive. I think that
> > caused gcc to then put the jump tables further away, and it might be
> > in a separate section these days - but it might also just be
> > "sufficiently aligned" that the L1 cache issue isn't in play any more.
> > 
> > Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
> > rest easy knowing that the data accesses and text accesses should be
> > separated by at least one cacheline (maybe even 128 bytes - I think
> > the L4 used 64-byte line size, but it was sub-sections of a 128-byte
> > bigger line - but that might have been in the L2 only).
> > 
> > But I could easily see the compiler/linker still putting them in the
> > same ELF segment.
> 
> You're entirely right, it puts them in .rodata actually.  But .rodata is
> in the same segment as .text:
> 
> $ readelf --segments /bin/true
> ...
>  Section to Segment mapping:
>   Segment Sections...
>    00     
>    01     .interp 
>    02     .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym
>           .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init
>           .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
>    03     .init_array .fini_array .jcr .data.rel.ro .dynamic .got .data .bss 
>    04     .dynamic 
>    05     .note.ABI-tag .note.gnu.build-id 
>    06     .eh_frame_hdr 
>    07     
>    08     .init_array .fini_array .jcr .data.rel.ro .dynamic .got 

Is there an easy(-ish) way (i.e. using compiler/linker flags, not linker scripts) 
to build the ELF binary in such a way so that non-code data:

          .rodata .eh_frame_hdr .eh_frame 

... gets put into a separate (readonly and non-executable) segment? That would 
enable things from the distro side AFAICS, right?

(assuming I'm reading the ELF dump right.)

Or does this need binutils surgery?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-03  6:46                           ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-03  6:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Linus Torvalds, Dave Hansen, Kees Cook, x86, LKML, Linux-MM,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 02/10/2015 13:58, Linus Torvalds wrote:
> > On Fri, Oct 2, 2015 at 7:49 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> On 02/10/2015 00:48, Linus Torvalds wrote:
> >>> It's quite likely that you will find that compilers put read-only
> >>> constants in the text section, knowing that executable means readable.
> >>
> >> Not on x86 (because it has large immediates; RISC machines and s390 do
> >> put large constants in the text section).
> >>
> >> But at the very least jump tables reside in the .text seection.
> > 
> > Yes, at least traditionally gcc put things like the jump tables for
> > switch() statements immediately next to the code. That caused lots of
> > pain on the P4, where the L1 I$ and D$ were exclusive. I think that
> > caused gcc to then put the jump tables further away, and it might be
> > in a separate section these days - but it might also just be
> > "sufficiently aligned" that the L1 cache issue isn't in play any more.
> > 
> > Anyway, because of the P4 exclusive L1 I/D$ issue we can pretty much
> > rest easy knowing that the data accesses and text accesses should be
> > separated by at least one cacheline (maybe even 128 bytes - I think
> > the L4 used 64-byte line size, but it was sub-sections of a 128-byte
> > bigger line - but that might have been in the L2 only).
> > 
> > But I could easily see the compiler/linker still putting them in the
> > same ELF segment.
> 
> You're entirely right, it puts them in .rodata actually.  But .rodata is
> in the same segment as .text:
> 
> $ readelf --segments /bin/true
> ...
>  Section to Segment mapping:
>   Segment Sections...
>    00     
>    01     .interp 
>    02     .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym
>           .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init
>           .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
>    03     .init_array .fini_array .jcr .data.rel.ro .dynamic .got .data .bss 
>    04     .dynamic 
>    05     .note.ABI-tag .note.gnu.build-id 
>    06     .eh_frame_hdr 
>    07     
>    08     .init_array .fini_array .jcr .data.rel.ro .dynamic .got 

Is there an easy(-ish) way (i.e. using compiler/linker flags, not linker scripts) 
to build the ELF binary in such a way so that non-code data:

          .rodata .eh_frame_hdr .eh_frame 

... gets put into a separate (readonly and non-executable) segment? That would 
enable things from the distro side AFAICS, right?

(assuming I'm reading the ELF dump right.)

Or does this need binutils surgery?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-02  7:09                     ` Ingo Molnar
@ 2015-10-03  6:59                       ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-03  6:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Kees Cook, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Ingo Molnar <mingo@kernel.org> wrote:

> > It's quite likely that you will find that compilers put read-only constants in 
> > the text section, knowing that executable means readable.
> 
> At least with pkeys enabling true --x mappings, that compiler practice becomes a 
> (mild) security problem: it provides a readable and executable return target for 
> stack/buffer overflow attacks - FWIIW. (It's a limited concern because the true 
> code areas are executable already.)

Btw., it's not just security, there will also a robustness advantage to creating 
true PROT_EXEC mappings: right now if buggy user-space code accidentally 
references into an executable section: say uses a negative index in a table put 
into .rodata, the code will not crash, it will happily read from the .text area.

But if we mapped .text with true PROT_EXEC (and the CPU enforced that) then we'd 
get a nice segfault.

This has additional security benefits as well, beyond not providing readable ROP 
sites - which in fact look more significant than the ROP readability angle I 
mentioned initially.

So to sum it up, if we use true --x (non-readable PROT_EXEC) mappings using pkeys, 
we get the following benefits:

 - Overflows and other out of bounds accesses from .rodata (and other data
   sections near .text) will be caught by the CPU instead of silent data flow 
   corruption. This has robustness (and thus security) advantages.

 - True --x code is not readable, thus not 'soft-discoverable' via information 
   leaks for ROP purposes.

 - The version fingerprinting of unknown remote target binaries via information 
   leaks becomes harder as well.

 - The local (and remote) guessing of ASLR offsets via information leaks gets
   harder as well.

 - We get to test pkeys much more seriously than the opt-in special uses! :-)

Intel sent me pkeys test hardware, so I can give it a go in practice as well and 
see how well it works.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-03  6:59                       ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-03  6:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Kees Cook, x86, LKML, Linux-MM, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Ingo Molnar <mingo@kernel.org> wrote:

> > It's quite likely that you will find that compilers put read-only constants in 
> > the text section, knowing that executable means readable.
> 
> At least with pkeys enabling true --x mappings, that compiler practice becomes a 
> (mild) security problem: it provides a readable and executable return target for 
> stack/buffer overflow attacks - FWIIW. (It's a limited concern because the true 
> code areas are executable already.)

Btw., it's not just security, there will also a robustness advantage to creating 
true PROT_EXEC mappings: right now if buggy user-space code accidentally 
references into an executable section: say uses a negative index in a table put 
into .rodata, the code will not crash, it will happily read from the .text area.

But if we mapped .text with true PROT_EXEC (and the CPU enforced that) then we'd 
get a nice segfault.

This has additional security benefits as well, beyond not providing readable ROP 
sites - which in fact look more significant than the ROP readability angle I 
mentioned initially.

So to sum it up, if we use true --x (non-readable PROT_EXEC) mappings using pkeys, 
we get the following benefits:

 - Overflows and other out of bounds accesses from .rodata (and other data
   sections near .text) will be caught by the CPU instead of silent data flow 
   corruption. This has robustness (and thus security) advantages.

 - True --x code is not readable, thus not 'soft-discoverable' via information 
   leaks for ROP purposes.

 - The version fingerprinting of unknown remote target binaries via information 
   leaks becomes harder as well.

 - The local (and remote) guessing of ASLR offsets via information leaks gets
   harder as well.

 - We get to test pkeys much more seriously than the opt-in special uses! :-)

Intel sent me pkeys test hardware, so I can give it a go in practice as well and 
see how well it works.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-02 17:50                     ` Dave Hansen
@ 2015-10-03  7:27                       ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-03  7:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Dave Hansen <dave@sr71.net> wrote:

> On 10/01/2015 11:23 PM, Ingo Molnar wrote:
> >> > Also, how do we do mprotect_pkey and say "don't change the key"?
> >
> > So if we start managing keys as a resource (i.e. alloc/free up to 16 of them), 
> > and provide APIs for user-space to do all that, then user-space is not 
> > supposed to touch keys it has not allocated for itself - just like it's not 
> > supposed to write to fds it has not opened.
> 
> I like that.  It gives us at least a "soft" indicator to userspace about what 
> keys it should or shouldn't be using.

Yes. A 16-bit allocation bitmap would solve this nicely.

> > Such an allocation method can still 'mess up', and if the kernel allocates a key 
> > for its purposes it should not assume that user-space cannot change it, but at 
> > least for non-buggy code there's no interaction and it would work out fine.
> 
> Yeah.  It also provides a clean interface so that future hardware could
> enforce enforce kernel "ownership" of a key which could protect against
> even buggy code.
> 
> So, we add a pair of syscalls,
> 
> 	unsigned long sys_alloc_pkey(unsigned long flags??)
> 	unsigned long sys_free_pkey(unsigned long pkey)
> 
> keep the metadata in the mm, and then make sure that userspace allocated
> it before it is allowed to do an mprotect_pkey() with it.

Yeah, so such an interface would allow the clean, transparent usage of pkeys for 
pure PROT_EXEC mappings.

I'd expect the --x/PROT_EXEC mappings to be _by far_ more frequently used than 
pure pkeys - but we still need the management interface to keep the kernel's use 
of pkeys separate from user-space's use.

If all the necessary tooling changes are propagated through then in fact I'd 
expect every pkeys capable Linux system to use pkeys, for almost every user-space 
task.

To have maximum future flexibility for pkeys I'd suggest the following additional 
changes to the syscall ABI:

 - Please name them with a pkey_ prefix, along the sys_pkey_* nomenclature, so 
   that it becomes an easily identified 'family' of system calls.

 - I'd also suggest providing an initial value with the 'alloc' call. It's true 
   that user-space can do this itself in assembly, OTOH there's no reason not to 
   provide a C interface for this.

 - Make the pkey identifier 'int', not 'long', like fds are. There's very little
   expectation to ever have more than 4 billion pkeys per mm, right?

 - How far do we want the kernel to manage this? Any reason we don't want a
   'set pkey' operation, if user-space wants to use pure C interfaces? That could 
   be vDSO accelerated as well, to use the unprivileged op. An advantage of such
   an interface would be that it would enable the kernel to more actively manage
   the actual mappings as well in the future: for example to automatically not
   allow accidental RWX mappings. Such an interface would also allow the future
   introduction of privileged pkey mappings on the hardware side, without having
   to change user-space, since everything goes via the kernel interface.

 - Along similar considerations, also add a sys_pkey_query() system call to query 
   the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
   at the moment.) This too could be vDSO accelerated in the future.

I.e. something like:

     unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
     unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
     unsigned long sys_pkey_get   (int pkey)
     unsigned long sys_pkey_free  (int pkey)

Optional suggestion:

 - _Maybe_ also allow the 'remote managed' setup of pkeys: of non-local tasks - 
   but I'm not sure about that: it looks expensive and complex, and a TID argument 
   can always be added later if there's some real need.

> That should be pretty easy to implement.  The only real overhead is the 16 bits 
> we need to keep in the mm somewhere.

Yes.

Note that if we use the C syscall interface suggestions I outlined above, we could 
in the future also change to have a full table, and manage it explicitly - without 
user-space changes - if the hardware side is tweaked to allow kernel side pkeys.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-03  7:27                       ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-03  7:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Dave Hansen <dave@sr71.net> wrote:

> On 10/01/2015 11:23 PM, Ingo Molnar wrote:
> >> > Also, how do we do mprotect_pkey and say "don't change the key"?
> >
> > So if we start managing keys as a resource (i.e. alloc/free up to 16 of them), 
> > and provide APIs for user-space to do all that, then user-space is not 
> > supposed to touch keys it has not allocated for itself - just like it's not 
> > supposed to write to fds it has not opened.
> 
> I like that.  It gives us at least a "soft" indicator to userspace about what 
> keys it should or shouldn't be using.

Yes. A 16-bit allocation bitmap would solve this nicely.

> > Such an allocation method can still 'mess up', and if the kernel allocates a key 
> > for its purposes it should not assume that user-space cannot change it, but at 
> > least for non-buggy code there's no interaction and it would work out fine.
> 
> Yeah.  It also provides a clean interface so that future hardware could
> enforce enforce kernel "ownership" of a key which could protect against
> even buggy code.
> 
> So, we add a pair of syscalls,
> 
> 	unsigned long sys_alloc_pkey(unsigned long flags??)
> 	unsigned long sys_free_pkey(unsigned long pkey)
> 
> keep the metadata in the mm, and then make sure that userspace allocated
> it before it is allowed to do an mprotect_pkey() with it.

Yeah, so such an interface would allow the clean, transparent usage of pkeys for 
pure PROT_EXEC mappings.

I'd expect the --x/PROT_EXEC mappings to be _by far_ more frequently used than 
pure pkeys - but we still need the management interface to keep the kernel's use 
of pkeys separate from user-space's use.

If all the necessary tooling changes are propagated through then in fact I'd 
expect every pkeys capable Linux system to use pkeys, for almost every user-space 
task.

To have maximum future flexibility for pkeys I'd suggest the following additional 
changes to the syscall ABI:

 - Please name them with a pkey_ prefix, along the sys_pkey_* nomenclature, so 
   that it becomes an easily identified 'family' of system calls.

 - I'd also suggest providing an initial value with the 'alloc' call. It's true 
   that user-space can do this itself in assembly, OTOH there's no reason not to 
   provide a C interface for this.

 - Make the pkey identifier 'int', not 'long', like fds are. There's very little
   expectation to ever have more than 4 billion pkeys per mm, right?

 - How far do we want the kernel to manage this? Any reason we don't want a
   'set pkey' operation, if user-space wants to use pure C interfaces? That could 
   be vDSO accelerated as well, to use the unprivileged op. An advantage of such
   an interface would be that it would enable the kernel to more actively manage
   the actual mappings as well in the future: for example to automatically not
   allow accidental RWX mappings. Such an interface would also allow the future
   introduction of privileged pkey mappings on the hardware side, without having
   to change user-space, since everything goes via the kernel interface.

 - Along similar considerations, also add a sys_pkey_query() system call to query 
   the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
   at the moment.) This too could be vDSO accelerated in the future.

I.e. something like:

     unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
     unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
     unsigned long sys_pkey_get   (int pkey)
     unsigned long sys_pkey_free  (int pkey)

Optional suggestion:

 - _Maybe_ also allow the 'remote managed' setup of pkeys: of non-local tasks - 
   but I'm not sure about that: it looks expensive and complex, and a TID argument 
   can always be added later if there's some real need.

> That should be pretty easy to implement.  The only real overhead is the 16 bits 
> we need to keep in the mm somewhere.

Yes.

Note that if we use the C syscall interface suggestions I outlined above, we could 
in the future also change to have a full table, and manage it explicitly - without 
user-space changes - if the hardware side is tweaked to allow kernel side pkeys.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-09-24  9:49         ` Ingo Molnar
@ 2015-10-03  8:17           ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-03  8:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook,
	Brian Gerst


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Dave Hansen <dave@sr71.net> wrote:
> 
> > > Another question, related to enumeration as well: I'm wondering whether 
> > > there's any way for the kernel to allocate a bit or two for its own purposes - 
> > > such as protecting crypto keys? Or is the facility fundamentally intended for 
> > > user-space use only?
> > 
> > No, that's not possible with the current setup.
> 
> Ok, then another question, have you considered the following usecase:

So, I'm wondering about the following additional usecase:

Right now the native x86 PTE format allows two protection related bits for 
user-space pages:

  _PAGE_BIT_RW:                   if 0 the page is read-only,  if 1 then it's read-write
  _PAGE_BIT_NX:                   if 0 the page is executable, if 1 then it's not executable

As discussed previously, pkeys allows 'true execute only (--x)' mappings.

Another possibility would be 'true write-only (-w-)' mappings.

This too could in theory be introduced 'transparently', via 'pure PROT_WRITE' 
mappings (i.e. no PROT_READ|PROT_EXEC bits set). Assuming the amount of user-space 
with implicit 'PROT_WRITE implies PROT_READ' assumptions is not unmanageble for a 
distro willing to try this.

Usage of this would be more limited than of pure PROT_EXEC mappings, but it's a 
nonzero set:

 - Write-only log buffers that are normally mmap()-ed from a file.

 - Write-only write() IO buffers that are only accessed via write().
   (kernel-space accesses ignore pkey values.)

   glibc's buffered IO might possibly make use of this, for write-only
   fopen()ed files.

 - Language runtimes could improve their security by eliminating W+X mappings of 
   JIT-ed code, instead they could use two alias mappings: one alias is a 
   true-exec (--x) mapping, the other (separately mapped, separately randomized)
   mapping is a true write-only (--x) mapping for generated code.

In addition to the security advantage, another advantage would be increased 
robustness: no accidental corruption of IO (or JIT) buffers via read-only 
codepaths.

Another advantage would be that it would utilize pkeys without having to teach 
applications to use new system calls.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-03  8:17           ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-03  8:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook,
	Brian Gerst


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Dave Hansen <dave@sr71.net> wrote:
> 
> > > Another question, related to enumeration as well: I'm wondering whether 
> > > there's any way for the kernel to allocate a bit or two for its own purposes - 
> > > such as protecting crypto keys? Or is the facility fundamentally intended for 
> > > user-space use only?
> > 
> > No, that's not possible with the current setup.
> 
> Ok, then another question, have you considered the following usecase:

So, I'm wondering about the following additional usecase:

Right now the native x86 PTE format allows two protection related bits for 
user-space pages:

  _PAGE_BIT_RW:                   if 0 the page is read-only,  if 1 then it's read-write
  _PAGE_BIT_NX:                   if 0 the page is executable, if 1 then it's not executable

As discussed previously, pkeys allows 'true execute only (--x)' mappings.

Another possibility would be 'true write-only (-w-)' mappings.

This too could in theory be introduced 'transparently', via 'pure PROT_WRITE' 
mappings (i.e. no PROT_READ|PROT_EXEC bits set). Assuming the amount of user-space 
with implicit 'PROT_WRITE implies PROT_READ' assumptions is not unmanageble for a 
distro willing to try this.

Usage of this would be more limited than of pure PROT_EXEC mappings, but it's a 
nonzero set:

 - Write-only log buffers that are normally mmap()-ed from a file.

 - Write-only write() IO buffers that are only accessed via write().
   (kernel-space accesses ignore pkey values.)

   glibc's buffered IO might possibly make use of this, for write-only
   fopen()ed files.

 - Language runtimes could improve their security by eliminating W+X mappings of 
   JIT-ed code, instead they could use two alias mappings: one alias is a 
   true-exec (--x) mapping, the other (separately mapped, separately randomized)
   mapping is a true write-only (--x) mapping for generated code.

In addition to the security advantage, another advantage would be increased 
robustness: no accidental corruption of IO (or JIT) buffers via read-only 
codepaths.

Another advantage would be that it would utilize pkeys without having to teach 
applications to use new system calls.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-03  7:27                       ` Ingo Molnar
@ 2015-10-06 23:28                         ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-06 23:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>  - I'd also suggest providing an initial value with the 'alloc' call. It's true 
>    that user-space can do this itself in assembly, OTOH there's no reason not to 
>    provide a C interface for this.

You mean an initial value for the rights register (PKRU), correct?

So init_val would be something like

	PKEY_DENY_ACCESS
	PKEY_DENY_WRITE

and it would refer only to the key that was allocated.

>  - Along similar considerations, also add a sys_pkey_query() system call to query 
>    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
>    at the moment.) This too could be vDSO accelerated in the future.

Do you mean whether the key is being used on a mapping (VMA) or rather
whether the key is currently allocated (has been returned from
sys_pkey_alloc() in the past)?

> I.e. something like:
> 
>      unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
>      unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>      unsigned long sys_pkey_get   (int pkey)
>      unsigned long sys_pkey_free  (int pkey)
> 
> Optional suggestion:
> 
>  - _Maybe_ also allow the 'remote managed' setup of pkeys: of non-local tasks - 
>    but I'm not sure about that: it looks expensive and complex, and a TID argument 
>    can always be added later if there's some real need.

Yeah, let's see how the stuff above looks first.


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-06 23:28                         ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-06 23:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>  - I'd also suggest providing an initial value with the 'alloc' call. It's true 
>    that user-space can do this itself in assembly, OTOH there's no reason not to 
>    provide a C interface for this.

You mean an initial value for the rights register (PKRU), correct?

So init_val would be something like

	PKEY_DENY_ACCESS
	PKEY_DENY_WRITE

and it would refer only to the key that was allocated.

>  - Along similar considerations, also add a sys_pkey_query() system call to query 
>    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
>    at the moment.) This too could be vDSO accelerated in the future.

Do you mean whether the key is being used on a mapping (VMA) or rather
whether the key is currently allocated (has been returned from
sys_pkey_alloc() in the past)?

> I.e. something like:
> 
>      unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
>      unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>      unsigned long sys_pkey_get   (int pkey)
>      unsigned long sys_pkey_free  (int pkey)
> 
> Optional suggestion:
> 
>  - _Maybe_ also allow the 'remote managed' setup of pkeys: of non-local tasks - 
>    but I'm not sure about that: it looks expensive and complex, and a TID argument 
>    can always be added later if there's some real need.

Yeah, let's see how the stuff above looks first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-06 23:28                         ` Dave Hansen
@ 2015-10-07  7:11                           ` Ingo Molnar
  -1 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-07  7:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Dave Hansen <dave@sr71.net> wrote:

> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
> >  - I'd also suggest providing an initial value with the 'alloc' call. It's true 
> >    that user-space can do this itself in assembly, OTOH there's no reason not to 
> >    provide a C interface for this.
> 
> You mean an initial value for the rights register (PKRU), correct?
> 
> So init_val would be something like
> 
> 	PKEY_DENY_ACCESS
> 	PKEY_DENY_WRITE
> 
> and it would refer only to the key that was allocated.

Correct.

> >  - Along similar considerations, also add a sys_pkey_query() system call to query 
> >    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
> >    at the moment.) This too could be vDSO accelerated in the future.
> 
> Do you mean whether the key is being used on a mapping (VMA) or rather
> whether the key is currently allocated (has been returned from
> sys_pkey_alloc() in the past)?

So in my mind 'pkeys' are an array of 16 values. The hardware allows us to map any 
'protection key value' to any of the 16 indices.

The query interface would only query this array, i.e. it would tell us what 
current protection value a given pkey index has - if it's allocated. So 
sys_pkey_query(6) would return the current protection key value for index 6. If 
the index has not been allocated yet, it would return -EBADF or so.

This is what 'managed pkeys' means in essence.

Allocation/freeing of pkeys is a relatively rare operation, and pkeys get 
inherited across fork()/clone() (which further cuts down on management 
activities), but it looks simple in any case.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-07  7:11                           ` Ingo Molnar
  0 siblings, 0 replies; 172+ messages in thread
From: Ingo Molnar @ 2015-10-07  7:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov


* Dave Hansen <dave@sr71.net> wrote:

> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
> >  - I'd also suggest providing an initial value with the 'alloc' call. It's true 
> >    that user-space can do this itself in assembly, OTOH there's no reason not to 
> >    provide a C interface for this.
> 
> You mean an initial value for the rights register (PKRU), correct?
> 
> So init_val would be something like
> 
> 	PKEY_DENY_ACCESS
> 	PKEY_DENY_WRITE
> 
> and it would refer only to the key that was allocated.

Correct.

> >  - Along similar considerations, also add a sys_pkey_query() system call to query 
> >    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
> >    at the moment.) This too could be vDSO accelerated in the future.
> 
> Do you mean whether the key is being used on a mapping (VMA) or rather
> whether the key is currently allocated (has been returned from
> sys_pkey_alloc() in the past)?

So in my mind 'pkeys' are an array of 16 values. The hardware allows us to map any 
'protection key value' to any of the 16 indices.

The query interface would only query this array, i.e. it would tell us what 
current protection value a given pkey index has - if it's allocated. So 
sys_pkey_query(6) would return the current protection key value for index 6. If 
the index has not been allocated yet, it would return -EBADF or so.

This is what 'managed pkeys' means in essence.

Allocation/freeing of pkeys is a relatively rare operation, and pkeys get 
inherited across fork()/clone() (which further cuts down on management 
activities), but it looks simple in any case.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-03  8:17           ` Ingo Molnar
@ 2015-10-07 20:24             ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-07 20:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook,
	Brian Gerst

On 10/03/2015 01:17 AM, Ingo Molnar wrote:
> Right now the native x86 PTE format allows two protection related bits for 
> user-space pages:
> 
>   _PAGE_BIT_RW:                   if 0 the page is read-only,  if 1 then it's read-write
>   _PAGE_BIT_NX:                   if 0 the page is executable, if 1 then it's not executable
> 
> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
> 
> Another possibility would be 'true write-only (-w-)' mappings.

How would those work?

Protection Keys has a Write-Disable and an Access-Disable bit.  But,
Access-Disable denies _all_ data access to the region.  There's no way
to allow only writes.

Or am I missing something?

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-07 20:24             ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-07 20:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
	Peter Zijlstra, Andy Lutomirski, Borislav Petkov, Kees Cook,
	Brian Gerst

On 10/03/2015 01:17 AM, Ingo Molnar wrote:
> Right now the native x86 PTE format allows two protection related bits for 
> user-space pages:
> 
>   _PAGE_BIT_RW:                   if 0 the page is read-only,  if 1 then it's read-write
>   _PAGE_BIT_NX:                   if 0 the page is executable, if 1 then it's not executable
> 
> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
> 
> Another possibility would be 'true write-only (-w-)' mappings.

How would those work?

Protection Keys has a Write-Disable and an Access-Disable bit.  But,
Access-Disable denies _all_ data access to the region.  There's no way
to allow only writes.

Or am I missing something?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-07 20:24             ` Dave Hansen
@ 2015-10-07 20:39               ` Andy Lutomirski
  -1 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-10-07 20:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, X86 ML, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Kees Cook, Brian Gerst

On Wed, Oct 7, 2015 at 1:24 PM, Dave Hansen <dave@sr71.net> wrote:
> On 10/03/2015 01:17 AM, Ingo Molnar wrote:
>> Right now the native x86 PTE format allows two protection related bits for
>> user-space pages:
>>
>>   _PAGE_BIT_RW:                   if 0 the page is read-only,  if 1 then it's read-write
>>   _PAGE_BIT_NX:                   if 0 the page is executable, if 1 then it's not executable
>>
>> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
>>
>> Another possibility would be 'true write-only (-w-)' mappings.
>
> How would those work?
>
> Protection Keys has a Write-Disable and an Access-Disable bit.  But,
> Access-Disable denies _all_ data access to the region.  There's no way
> to allow only writes.

Weird.  I wonder why Intel did that.

I also wonder whether EPT can do write-only.

--Andy

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-07 20:39               ` Andy Lutomirski
  0 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-10-07 20:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, X86 ML, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Kees Cook, Brian Gerst

On Wed, Oct 7, 2015 at 1:24 PM, Dave Hansen <dave@sr71.net> wrote:
> On 10/03/2015 01:17 AM, Ingo Molnar wrote:
>> Right now the native x86 PTE format allows two protection related bits for
>> user-space pages:
>>
>>   _PAGE_BIT_RW:                   if 0 the page is read-only,  if 1 then it's read-write
>>   _PAGE_BIT_NX:                   if 0 the page is executable, if 1 then it's not executable
>>
>> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
>>
>> Another possibility would be 'true write-only (-w-)' mappings.
>
> How would those work?
>
> Protection Keys has a Write-Disable and an Access-Disable bit.  But,
> Access-Disable denies _all_ data access to the region.  There's no way
> to allow only writes.

Weird.  I wonder why Intel did that.

I also wonder whether EPT can do write-only.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-07 20:39               ` Andy Lutomirski
@ 2015-10-07 20:47                 ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-07 20:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, X86 ML, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Kees Cook, Brian Gerst

On 10/07/2015 01:39 PM, Andy Lutomirski wrote:
> On Wed, Oct 7, 2015 at 1:24 PM, Dave Hansen <dave@sr71.net> wrote:
>> On 10/03/2015 01:17 AM, Ingo Molnar wrote:
>>> Right now the native x86 PTE format allows two protection related bits for
>>> user-space pages:
>>>
>>>   _PAGE_BIT_RW:                   if 0 the page is read-only,  if 1 then it's read-write
>>>   _PAGE_BIT_NX:                   if 0 the page is executable, if 1 then it's not executable
>>>
>>> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
>>>
>>> Another possibility would be 'true write-only (-w-)' mappings.
>>
>> How would those work?
>>
>> Protection Keys has a Write-Disable and an Access-Disable bit.  But,
>> Access-Disable denies _all_ data access to the region.  There's no way
>> to allow only writes.
> 
> Weird.  I wonder why Intel did that.
> 
> I also wonder whether EPT can do write-only.

The SDM makes it look that way.  There appear to be completely separate
r/w/x bits.  r=0/w=0/x=0 means !present.

The bit 0 definition says, for instance:

	Read access; indicates whether reads are allowed from the
	4-KByte page referenced by this entry


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-07 20:47                 ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-07 20:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, X86 ML, linux-kernel, linux-mm, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov,
	Kees Cook, Brian Gerst

On 10/07/2015 01:39 PM, Andy Lutomirski wrote:
> On Wed, Oct 7, 2015 at 1:24 PM, Dave Hansen <dave@sr71.net> wrote:
>> On 10/03/2015 01:17 AM, Ingo Molnar wrote:
>>> Right now the native x86 PTE format allows two protection related bits for
>>> user-space pages:
>>>
>>>   _PAGE_BIT_RW:                   if 0 the page is read-only,  if 1 then it's read-write
>>>   _PAGE_BIT_NX:                   if 0 the page is executable, if 1 then it's not executable
>>>
>>> As discussed previously, pkeys allows 'true execute only (--x)' mappings.
>>>
>>> Another possibility would be 'true write-only (-w-)' mappings.
>>
>> How would those work?
>>
>> Protection Keys has a Write-Disable and an Access-Disable bit.  But,
>> Access-Disable denies _all_ data access to the region.  There's no way
>> to allow only writes.
> 
> Weird.  I wonder why Intel did that.
> 
> I also wonder whether EPT can do write-only.

The SDM makes it look that way.  There appear to be completely separate
r/w/x bits.  r=0/w=0/x=0 means !present.

The bit 0 definition says, for instance:

	Read access; indicates whether reads are allowed from the
	4-KByte page referenced by this entry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-03  7:27                       ` Ingo Molnar
@ 2015-10-16 15:12                         ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-16 15:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>  - Along similar considerations, also add a sys_pkey_query() system call to query 
>    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
>    at the moment.) This too could be vDSO accelerated in the future.
> 
> I.e. something like:
> 
>      unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
>      unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>      unsigned long sys_pkey_get   (int pkey)
>      unsigned long sys_pkey_free  (int pkey)

The pkey_set() operation is going to get a wee bit interesting with signals.

pkey_set() will modify the _current_ context's PKRU which includes the
register itself and the kernel XSAVE buffer (if active).  But, since the
PKRU state is saved/restored with the XSAVE state, we will blow away any
state set during the signal.

I _think_ the right move here is to either keep a 'shadow' version of
PKRU inside the kernel (for each thread) and always update the task's
XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
signal's PKRU state in to the main process's PKRU state when returning
from a signal.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-16 15:12                         ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-16 15:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>  - Along similar considerations, also add a sys_pkey_query() system call to query 
>    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
>    at the moment.) This too could be vDSO accelerated in the future.
> 
> I.e. something like:
> 
>      unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
>      unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>      unsigned long sys_pkey_get   (int pkey)
>      unsigned long sys_pkey_free  (int pkey)

The pkey_set() operation is going to get a wee bit interesting with signals.

pkey_set() will modify the _current_ context's PKRU which includes the
register itself and the kernel XSAVE buffer (if active).  But, since the
PKRU state is saved/restored with the XSAVE state, we will blow away any
state set during the signal.

I _think_ the right move here is to either keep a 'shadow' version of
PKRU inside the kernel (for each thread) and always update the task's
XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
signal's PKRU state in to the main process's PKRU state when returning
from a signal.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-16 15:12                         ` Dave Hansen
@ 2015-10-21 18:55                           ` Andy Lutomirski
  -1 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-10-21 18:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen <dave@sr71.net> wrote:
> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>>  - Along similar considerations, also add a sys_pkey_query() system call to query
>>    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
>>    at the moment.) This too could be vDSO accelerated in the future.
>>
>> I.e. something like:
>>
>>      unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
>>      unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>>      unsigned long sys_pkey_get   (int pkey)
>>      unsigned long sys_pkey_free  (int pkey)
>
> The pkey_set() operation is going to get a wee bit interesting with signals.
>
> pkey_set() will modify the _current_ context's PKRU which includes the
> register itself and the kernel XSAVE buffer (if active).  But, since the
> PKRU state is saved/restored with the XSAVE state, we will blow away any
> state set during the signal.
>
> I _think_ the right move here is to either keep a 'shadow' version of
> PKRU inside the kernel (for each thread) and always update the task's
> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
> signal's PKRU state in to the main process's PKRU state when returning
> from a signal.

Ick.  Or we could just declare that signals don't affect the PKRU
state by default and mask it off in sigreturn.

In fact, maybe we should add a general xfeature (or whatever it's
called these days) to the xstate in the signal context that controls
which pieces are restored.  Then user code can tweak it if needed in
signal handlers.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-21 18:55                           ` Andy Lutomirski
  0 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-10-21 18:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen <dave@sr71.net> wrote:
> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>>  - Along similar considerations, also add a sys_pkey_query() system call to query
>>    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
>>    at the moment.) This too could be vDSO accelerated in the future.
>>
>> I.e. something like:
>>
>>      unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
>>      unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>>      unsigned long sys_pkey_get   (int pkey)
>>      unsigned long sys_pkey_free  (int pkey)
>
> The pkey_set() operation is going to get a wee bit interesting with signals.
>
> pkey_set() will modify the _current_ context's PKRU which includes the
> register itself and the kernel XSAVE buffer (if active).  But, since the
> PKRU state is saved/restored with the XSAVE state, we will blow away any
> state set during the signal.
>
> I _think_ the right move here is to either keep a 'shadow' version of
> PKRU inside the kernel (for each thread) and always update the task's
> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
> signal's PKRU state in to the main process's PKRU state when returning
> from a signal.

Ick.  Or we could just declare that signals don't affect the PKRU
state by default and mask it off in sigreturn.

In fact, maybe we should add a general xfeature (or whatever it's
called these days) to the xstate in the signal context that controls
which pieces are restored.  Then user code can tweak it if needed in
signal handlers.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-21 18:55                           ` Andy Lutomirski
@ 2015-10-21 19:11                             ` Dave Hansen
  -1 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-21 19:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/21/2015 11:55 AM, Andy Lutomirski wrote:
> On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen <dave@sr71.net> wrote:
>> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>>>  - Along similar considerations, also add a sys_pkey_query() system call to query
>>>    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
>>>    at the moment.) This too could be vDSO accelerated in the future.
>>>
>>> I.e. something like:
>>>
>>>      unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
>>>      unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>>>      unsigned long sys_pkey_get   (int pkey)
>>>      unsigned long sys_pkey_free  (int pkey)
>>
>> The pkey_set() operation is going to get a wee bit interesting with signals.
>>
>> pkey_set() will modify the _current_ context's PKRU which includes the
>> register itself and the kernel XSAVE buffer (if active).  But, since the
>> PKRU state is saved/restored with the XSAVE state, we will blow away any
>> state set during the signal.
>>
>> I _think_ the right move here is to either keep a 'shadow' version of
>> PKRU inside the kernel (for each thread) and always update the task's
>> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
>> signal's PKRU state in to the main process's PKRU state when returning
>> from a signal.
> 
> Ick.  Or we could just declare that signals don't affect the PKRU
> state by default and mask it off in sigreturn.

Yeah, I've been messing with it in a few forms and it's pretty ugly.

I think it will be easier if we say the PKRU rights are not inherited by
signals and changes during a signal are tossed out.  Signal handlers are
special anyway and folks have to be careful writing them.

> In fact, maybe we should add a general xfeature (or whatever it's
> called these days) to the xstate in the signal context that controls
> which pieces are restored.  Then user code can tweak it if needed in
> signal handlers.

Yeah, that's probably a good idea long-term.  We're only getting more
and more things managed by XSAVE and it's going to be increasingly
interesting to glue real semantics back on top.


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-21 19:11                             ` Dave Hansen
  0 siblings, 0 replies; 172+ messages in thread
From: Dave Hansen @ 2015-10-21 19:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On 10/21/2015 11:55 AM, Andy Lutomirski wrote:
> On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen <dave@sr71.net> wrote:
>> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>>>  - Along similar considerations, also add a sys_pkey_query() system call to query
>>>    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
>>>    at the moment.) This too could be vDSO accelerated in the future.
>>>
>>> I.e. something like:
>>>
>>>      unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
>>>      unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>>>      unsigned long sys_pkey_get   (int pkey)
>>>      unsigned long sys_pkey_free  (int pkey)
>>
>> The pkey_set() operation is going to get a wee bit interesting with signals.
>>
>> pkey_set() will modify the _current_ context's PKRU which includes the
>> register itself and the kernel XSAVE buffer (if active).  But, since the
>> PKRU state is saved/restored with the XSAVE state, we will blow away any
>> state set during the signal.
>>
>> I _think_ the right move here is to either keep a 'shadow' version of
>> PKRU inside the kernel (for each thread) and always update the task's
>> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
>> signal's PKRU state in to the main process's PKRU state when returning
>> from a signal.
> 
> Ick.  Or we could just declare that signals don't affect the PKRU
> state by default and mask it off in sigreturn.

Yeah, I've been messing with it in a few forms and it's pretty ugly.

I think it will be easier if we say the PKRU rights are not inherited by
signals and changes during a signal are tossed out.  Signal handlers are
special anyway and folks have to be careful writing them.

> In fact, maybe we should add a general xfeature (or whatever it's
> called these days) to the xstate in the signal context that controls
> which pieces are restored.  Then user code can tweak it if needed in
> signal handlers.

Yeah, that's probably a good idea long-term.  We're only getting more
and more things managed by XSAVE and it's going to be increasingly
interesting to glue real semantics back on top.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
  2015-10-21 19:11                             ` Dave Hansen
@ 2015-10-21 23:22                               ` Andy Lutomirski
  -1 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-10-21 23:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Wed, Oct 21, 2015 at 12:11 PM, Dave Hansen <dave@sr71.net> wrote:
> On 10/21/2015 11:55 AM, Andy Lutomirski wrote:
>> On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen <dave@sr71.net> wrote:
>>> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>>>>  - Along similar considerations, also add a sys_pkey_query() system call to query
>>>>    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
>>>>    at the moment.) This too could be vDSO accelerated in the future.
>>>>
>>>> I.e. something like:
>>>>
>>>>      unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
>>>>      unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>>>>      unsigned long sys_pkey_get   (int pkey)
>>>>      unsigned long sys_pkey_free  (int pkey)
>>>
>>> The pkey_set() operation is going to get a wee bit interesting with signals.
>>>
>>> pkey_set() will modify the _current_ context's PKRU which includes the
>>> register itself and the kernel XSAVE buffer (if active).  But, since the
>>> PKRU state is saved/restored with the XSAVE state, we will blow away any
>>> state set during the signal.
>>>
>>> I _think_ the right move here is to either keep a 'shadow' version of
>>> PKRU inside the kernel (for each thread) and always update the task's
>>> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
>>> signal's PKRU state in to the main process's PKRU state when returning
>>> from a signal.
>>
>> Ick.  Or we could just declare that signals don't affect the PKRU
>> state by default and mask it off in sigreturn.
>
> Yeah, I've been messing with it in a few forms and it's pretty ugly.
>
> I think it will be easier if we say the PKRU rights are not inherited by
> signals and changes during a signal are tossed out.  Signal handlers are
> special anyway and folks have to be careful writing them.

This is somewhat related to something I've been pondering in a
different context: fsbase and gsbase.

If a program changes fsbase using wrfsbase, should a signal handler
override it?  And should a change made in a signal handler carry over
after sigreturn?  Arguably, for fsbase and gsbase, the answer is no --
anyone use uses them for userspace threading (which is presumably why
they happened in the first place, even though userspace threading has
possibly dubious value) probably wants their context switches to stick
across signal invocations.

So I think that propagating PKRU into the signal handler and keeping
the in-register value on sigreturn by default is probably a reasonable
choice.

(OTOH, there's an argument for allowing programs to reset PKRU on
signal delivery: you could sort of arrange for signal handler to be
more privileged than the code that invokes them.  But that's doable
with some asm regardless.)

>
>> In fact, maybe we should add a general xfeature (or whatever it's
>> called these days) to the xstate in the signal context that controls
>> which pieces are restored.  Then user code can tweak it if needed in
>> signal handlers.
>
> Yeah, that's probably a good idea long-term.  We're only getting more
> and more things managed by XSAVE and it's going to be increasingly
> interesting to glue real semantics back on top.
>

Should we maybe extend copy_user_to_fpregs_zeroing to have a pair of
masks, where one mask indicates which features are copied and another
indicates which are preserved?  It looks like we already allow some
control over which bits are restored from sigcontext versus being
restored to their init state.

We might need to add some kind of extended ucontext area for this.  I
don't know if we're starting to run out of space.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: [PATCH 26/26] x86, pkeys: Documentation
@ 2015-10-21 23:22                               ` Andy Lutomirski
  0 siblings, 0 replies; 172+ messages in thread
From: Andy Lutomirski @ 2015-10-21 23:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Kees Cook, x86, LKML, Linux-MM, Linus Torvalds,
	Andrew Morton, Peter Zijlstra, Andy Lutomirski, Borislav Petkov

On Wed, Oct 21, 2015 at 12:11 PM, Dave Hansen <dave@sr71.net> wrote:
> On 10/21/2015 11:55 AM, Andy Lutomirski wrote:
>> On Fri, Oct 16, 2015 at 8:12 AM, Dave Hansen <dave@sr71.net> wrote:
>>> On 10/03/2015 12:27 AM, Ingo Molnar wrote:
>>>>  - Along similar considerations, also add a sys_pkey_query() system call to query
>>>>    the mapping of a specific pkey. (returns -EBADF or so if the key is not mapped
>>>>    at the moment.) This too could be vDSO accelerated in the future.
>>>>
>>>> I.e. something like:
>>>>
>>>>      unsigned long sys_pkey_alloc (unsigned long flags, unsigned long init_val)
>>>>      unsigned long sys_pkey_set   (int pkey, unsigned long new_val)
>>>>      unsigned long sys_pkey_get   (int pkey)
>>>>      unsigned long sys_pkey_free  (int pkey)
>>>
>>> The pkey_set() operation is going to get a wee bit interesting with signals.
>>>
>>> pkey_set() will modify the _current_ context's PKRU which includes the
>>> register itself and the kernel XSAVE buffer (if active).  But, since the
>>> PKRU state is saved/restored with the XSAVE state, we will blow away any
>>> state set during the signal.
>>>
>>> I _think_ the right move here is to either keep a 'shadow' version of
>>> PKRU inside the kernel (for each thread) and always update the task's
>>> XSAVE PKRU state when returning from a signal handler.  Or, _copy_ the
>>> signal's PKRU state in to the main process's PKRU state when returning
>>> from a signal.
>>
>> Ick.  Or we could just declare that signals don't affect the PKRU
>> state by default and mask it off in sigreturn.
>
> Yeah, I've been messing with it in a few forms and it's pretty ugly.
>
> I think it will be easier if we say the PKRU rights are not inherited by
> signals and changes during a signal are tossed out.  Signal handlers are
> special anyway and folks have to be careful writing them.

This is somewhat related to something I've been pondering in a
different context: fsbase and gsbase.

If a program changes fsbase using wrfsbase, should a signal handler
override it?  And should a change made in a signal handler carry over
after sigreturn?  Arguably, for fsbase and gsbase, the answer is no --
anyone use uses them for userspace threading (which is presumably why
they happened in the first place, even though userspace threading has
possibly dubious value) probably wants their context switches to stick
across signal invocations.

So I think that propagating PKRU into the signal handler and keeping
the in-register value on sigreturn by default is probably a reasonable
choice.

(OTOH, there's an argument for allowing programs to reset PKRU on
signal delivery: you could sort of arrange for signal handler to be
more privileged than the code that invokes them.  But that's doable
with some asm regardless.)

>
>> In fact, maybe we should add a general xfeature (or whatever it's
>> called these days) to the xstate in the signal context that controls
>> which pieces are restored.  Then user code can tweak it if needed in
>> signal handlers.
>
> Yeah, that's probably a good idea long-term.  We're only getting more
> and more things managed by XSAVE and it's going to be increasingly
> interesting to glue real semantics back on top.
>

Should we maybe extend copy_user_to_fpregs_zeroing to have a pair of
masks, where one mask indicates which features are copied and another
indicates which are preserved?  It looks like we already allow some
control over which bits are restored from sigcontext versus being
restored to their init state.

We might need to add some kind of extended ucontext area for this.  I
don't know if we're starting to run out of space.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 172+ messages in thread

end of thread, other threads:[~2015-10-21 23:23 UTC | newest]

Thread overview: 172+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-16 17:49 [PATCH 00/26] [RFCv2] x86: Memory Protection Keys Dave Hansen
2015-09-16 17:49 ` Dave Hansen
2015-09-16 17:49 ` [PATCH 04/26] x86, pku: define new CR4 bit Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 03/26] x86, pkeys: cpuid bit definition Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 02/26] x86, pkeys: Add Kconfig option Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 01/26] x86, fpu: add placeholder for Processor Trace XSAVE state Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 07/26] x86, pkeys: new page fault error code bit: PF_PK Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 05/26] x86, pkey: add PKRU xsave fields and data structure(s) Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-22 19:53   ` Thomas Gleixner
2015-09-22 19:53     ` Thomas Gleixner
2015-09-22 19:58     ` Dave Hansen
2015-09-22 19:58       ` Dave Hansen
2015-09-16 17:49 ` [PATCH 06/26] x86, pkeys: PTE bits for storing protection key Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 10/26] x86, pkeys: notify userspace about protection key faults Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-22 20:03   ` Thomas Gleixner
2015-09-22 20:03     ` Thomas Gleixner
2015-09-22 20:21     ` Dave Hansen
2015-09-22 20:21       ` Dave Hansen
2015-09-22 20:27       ` Thomas Gleixner
2015-09-22 20:27         ` Thomas Gleixner
2015-09-22 20:29         ` Dave Hansen
2015-09-22 20:29           ` Dave Hansen
2015-09-23  8:05           ` Ingo Molnar
2015-09-23  8:05             ` Ingo Molnar
2015-09-24  9:23   ` Ingo Molnar
2015-09-24  9:23     ` Ingo Molnar
2015-09-24  9:30     ` Ingo Molnar
2015-09-24  9:30       ` Ingo Molnar
2015-09-24 17:41       ` Dave Hansen
2015-09-24 17:41         ` Dave Hansen
2015-09-25  7:11         ` Ingo Molnar
2015-09-25  7:11           ` Ingo Molnar
2015-09-25 23:18           ` Dave Hansen
2015-09-25 23:18             ` Dave Hansen
2015-09-26  6:20             ` Ingo Molnar
2015-09-26  6:20               ` Ingo Molnar
2015-09-27 22:39               ` Dave Hansen
2015-09-27 22:39                 ` Dave Hansen
2015-09-28  5:59                 ` Ingo Molnar
2015-09-28  5:59                   ` Ingo Molnar
2015-09-24 17:15     ` Dave Hansen
2015-09-24 17:15       ` Dave Hansen
2015-09-28 19:25       ` Christian Borntraeger
2015-09-28 19:25         ` Christian Borntraeger
2015-09-28 19:32         ` Dave Hansen
2015-09-28 19:32           ` Dave Hansen
2015-09-16 17:49 ` [PATCH 09/26] x86, pkeys: arch-specific protection bits Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 08/26] x86, pkeys: store protection in high VMA flags Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 11/26] x86, pkeys: add functions for set/fetch PKRU Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-22 20:05   ` Thomas Gleixner
2015-09-22 20:05     ` Thomas Gleixner
2015-09-22 20:22     ` Dave Hansen
2015-09-22 20:22       ` Dave Hansen
2015-09-16 17:49 ` [PATCH 14/26] x86, pkeys: check VMAs and PTEs for protection keys Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 12/26] mm: factor out VMA fault permission checking Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 13/26] mm: simplify get_user_pages() PTE bit handling Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 15/26] x86, pkeys: optimize fault handling in access_error() Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 16/26] x86, pkeys: dump PKRU with other kernel registers Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 17/26] x86, pkeys: dump PTE pkey in /proc/pid/smaps Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 19/26] [NEWSYSCALL] mm, multi-arch: pass a protection key in to calc_vm_flag_bits() Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 20/26] [NEWSYSCALL] mm: implement new mprotect_pkey() system call Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 18/26] x86, pkeys: add Kconfig prompt to existing config option Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 22/26] [HIJACKPROT] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 21/26] [NEWSYSCALL] x86: wire up mprotect_key() system call Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 24/26] [HIJACKPROT] x86, pkeys: mask off pkeys bits in mprotect() Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 25/26] x86, pkeys: actually enable Memory Protection Keys in CPU Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 23/26] [HIJACKPROT] x86, pkeys: add x86 version of arch_validate_prot() Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-16 17:49 ` [PATCH 26/26] x86, pkeys: Documentation Dave Hansen
2015-09-16 17:49   ` Dave Hansen
2015-09-20  8:55   ` Ingo Molnar
2015-09-20  8:55     ` Ingo Molnar
2015-09-21  4:34     ` Dave Hansen
2015-09-21  4:34       ` Dave Hansen
2015-09-24  9:49       ` Ingo Molnar
2015-09-24  9:49         ` Ingo Molnar
2015-09-24 19:10         ` Dave Hansen
2015-09-24 19:10           ` Dave Hansen
2015-09-24 19:17           ` Andy Lutomirski
2015-09-24 19:17             ` Andy Lutomirski
2015-09-25  7:16             ` Ingo Molnar
2015-09-25  7:16               ` Ingo Molnar
2015-09-25  6:15           ` Ingo Molnar
2015-09-25  6:15             ` Ingo Molnar
2015-10-01 11:17           ` Ingo Molnar
2015-10-01 11:17             ` Ingo Molnar
2015-10-01 20:39             ` Kees Cook
2015-10-01 20:39               ` Kees Cook
2015-10-01 20:45               ` Andy Lutomirski
2015-10-01 20:45                 ` Andy Lutomirski
2015-10-02  6:23                 ` Ingo Molnar
2015-10-02  6:23                   ` Ingo Molnar
2015-10-02 17:50                   ` Dave Hansen
2015-10-02 17:50                     ` Dave Hansen
2015-10-03  7:27                     ` Ingo Molnar
2015-10-03  7:27                       ` Ingo Molnar
2015-10-06 23:28                       ` Dave Hansen
2015-10-06 23:28                         ` Dave Hansen
2015-10-07  7:11                         ` Ingo Molnar
2015-10-07  7:11                           ` Ingo Molnar
2015-10-16 15:12                       ` Dave Hansen
2015-10-16 15:12                         ` Dave Hansen
2015-10-21 18:55                         ` Andy Lutomirski
2015-10-21 18:55                           ` Andy Lutomirski
2015-10-21 19:11                           ` Dave Hansen
2015-10-21 19:11                             ` Dave Hansen
2015-10-21 23:22                             ` Andy Lutomirski
2015-10-21 23:22                               ` Andy Lutomirski
2015-10-01 20:58               ` Dave Hansen
2015-10-01 20:58                 ` Dave Hansen
2015-10-01 22:33               ` Dave Hansen
2015-10-01 22:35                 ` Kees Cook
2015-10-01 22:35                   ` Kees Cook
2015-10-01 22:39                   ` Dave Hansen
2015-10-01 22:39                     ` Dave Hansen
2015-10-01 22:48                 ` Linus Torvalds
2015-10-01 22:48                   ` Linus Torvalds
2015-10-01 22:56                   ` Dave Hansen
2015-10-01 22:56                     ` Dave Hansen
2015-10-02  1:38                     ` Linus Torvalds
2015-10-02  1:38                       ` Linus Torvalds
2015-10-02 18:08                       ` Dave Hansen
2015-10-02 18:08                         ` Dave Hansen
2015-10-02  7:09                   ` Ingo Molnar
2015-10-02  7:09                     ` Ingo Molnar
2015-10-03  6:59                     ` Ingo Molnar
2015-10-03  6:59                       ` Ingo Molnar
2015-10-02 11:49                   ` Paolo Bonzini
2015-10-02 11:49                     ` Paolo Bonzini
2015-10-02 11:58                     ` Linus Torvalds
2015-10-02 11:58                       ` Linus Torvalds
2015-10-02 12:14                       ` Paolo Bonzini
2015-10-02 12:14                         ` Paolo Bonzini
2015-10-03  6:46                         ` Ingo Molnar
2015-10-03  6:46                           ` Ingo Molnar
2015-10-01 22:57                 ` Andy Lutomirski
2015-10-01 22:57                   ` Andy Lutomirski
2015-10-02  6:09                 ` Ingo Molnar
2015-10-02  6:09                   ` Ingo Molnar
2015-10-03  8:17         ` Ingo Molnar
2015-10-03  8:17           ` Ingo Molnar
2015-10-07 20:24           ` Dave Hansen
2015-10-07 20:24             ` Dave Hansen
2015-10-07 20:39             ` Andy Lutomirski
2015-10-07 20:39               ` Andy Lutomirski
2015-10-07 20:47               ` Dave Hansen
2015-10-07 20:47                 ` Dave Hansen
2015-09-16 17:51 ` Fwd: [PATCH 00/26] [RFCv2] x86: Memory Protection Keys Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.