All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/25] x86: Memory Protection Keys
@ 2015-09-28 19:18 ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, linux-api, linux-arch

I have addressed all known issues and review comments.  I believe
they are ready to be pulled in to the x86 tree.  Note that this
is also the first time anyone has seen the new 'selftests' code.
If there are issues limited to it, I'd prefer to fix those up
separately post-merge.

Changes from RFCv2 (Thanks Ingo and Thomas for most of these):

 * few minor compile warnings
 * changed 'nopku' interaction with cpuid bits.  Now, we do not
   clear the PKU cpuid bit, we just skip enabling it.
 * changed __pkru_allows_write() to also check access disable bit
 * removed the unused write_pkru()
 * made si_pkey a u64 and added some patch description details.
   Also made it share space in siginfo with MPX and clarified
   comments.
 * give some real text for the Processor Trace xsave state
 * made vma_pkey() less ugly (and much more optimized actually)
 * added SEGV_PKUERR to copy_siginfo_to_user()
 * remove page table walk when filling in si_pkey, added some
   big fat comments about it being inherently racy.
 * added self test code

MM reviewers, if you are going to look at one thing, please look
at patch 14 which adds a bunch of additional vma/pte permission
checks.

This code contains a new system call: mprotect_key(),  This needs
the usual amount of rigor around new interfaces.  Review there
would be much appreciated.

This code is not runnable to anyone outside of Intel unless they
have some special hardware or a fancy simulator.  If you are
interested in running this for real, please get in touch with me.
Hardware is available to a very small but nonzero number of
people.

This set is also available here (with the new syscall):

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v006

=== diffstat ===

(note that over half of this is kselftests)

 Documentation/kernel-parameters.txt           |    3 
 Documentation/x86/protection-keys.txt         |   54 +
 arch/powerpc/include/asm/mman.h               |    5 
 arch/powerpc/include/asm/mmu_context.h        |   11 
 arch/s390/include/asm/mmu_context.h           |   11 
 arch/unicore32/include/asm/mmu_context.h      |   11 
 arch/x86/Kconfig                              |   15 
 arch/x86/entry/syscalls/syscall_32.tbl        |    1 
 arch/x86/entry/syscalls/syscall_64.tbl        |    1 
 arch/x86/include/asm/cpufeature.h             |   54 +
 arch/x86/include/asm/disabled-features.h      |   12 
 arch/x86/include/asm/fpu/types.h              |   16 
 arch/x86/include/asm/fpu/xstate.h             |    4 
 arch/x86/include/asm/mmu_context.h            |   71 ++
 arch/x86/include/asm/pgtable.h                |   45 +
 arch/x86/include/asm/pgtable_types.h          |   34 -
 arch/x86/include/asm/required-features.h      |    4 
 arch/x86/include/asm/special_insns.h          |   32 +
 arch/x86/include/uapi/asm/mman.h              |   23 
 arch/x86/include/uapi/asm/processor-flags.h   |    2 
 arch/x86/kernel/cpu/common.c                  |   42 +
 arch/x86/kernel/fpu/xstate.c                  |    7 
 arch/x86/kernel/process_64.c                  |    2 
 arch/x86/kernel/setup.c                       |    9 
 arch/x86/mm/fault.c                           |  143 +++-
 arch/x86/mm/gup.c                             |   37 -
 drivers/char/agp/frontend.c                   |    2 
 drivers/staging/android/ashmem.c              |    9 
 fs/proc/task_mmu.c                            |    5 
 include/asm-generic/mm_hooks.h                |   11 
 include/linux/mm.h                            |   13 
 include/linux/mman.h                          |    6 
 include/uapi/asm-generic/siginfo.h            |   17 
 kernel/signal.c                               |    4 
 mm/Kconfig                                    |   11 
 mm/gup.c                                      |   28 
 mm/memory.c                                   |    4 
 mm/mmap.c                                     |    2 
 mm/mprotect.c                                 |   20 
 mm/nommu.c                                    |    2 
 tools/testing/selftests/x86/Makefile          |    3 
 tools/testing/selftests/x86/pkey-helpers.h    |  182 +++++
 tools/testing/selftests/x86/protection_keys.c |  828 ++++++++++++++++++++++++++
 43 files changed, 1705 insertions(+), 91 deletions(-)

Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 00/25] x86: Memory Protection Keys
@ 2015-09-28 19:18 ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, linux-api, linux-arch

I have addressed all known issues and review comments.  I believe
they are ready to be pulled in to the x86 tree.  Note that this
is also the first time anyone has seen the new 'selftests' code.
If there are issues limited to it, I'd prefer to fix those up
separately post-merge.

Changes from RFCv2 (Thanks Ingo and Thomas for most of these):

 * few minor compile warnings
 * changed 'nopku' interaction with cpuid bits.  Now, we do not
   clear the PKU cpuid bit, we just skip enabling it.
 * changed __pkru_allows_write() to also check access disable bit
 * removed the unused write_pkru()
 * made si_pkey a u64 and added some patch description details.
   Also made it share space in siginfo with MPX and clarified
   comments.
 * give some real text for the Processor Trace xsave state
 * made vma_pkey() less ugly (and much more optimized actually)
 * added SEGV_PKUERR to copy_siginfo_to_user()
 * remove page table walk when filling in si_pkey, added some
   big fat comments about it being inherently racy.
 * added self test code

MM reviewers, if you are going to look at one thing, please look
at patch 14 which adds a bunch of additional vma/pte permission
checks.

This code contains a new system call: mprotect_key(),  This needs
the usual amount of rigor around new interfaces.  Review there
would be much appreciated.

This code is not runnable to anyone outside of Intel unless they
have some special hardware or a fancy simulator.  If you are
interested in running this for real, please get in touch with me.
Hardware is available to a very small but nonzero number of
people.

This set is also available here (with the new syscall):

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v006

=== diffstat ===

(note that over half of this is kselftests)

 Documentation/kernel-parameters.txt           |    3 
 Documentation/x86/protection-keys.txt         |   54 +
 arch/powerpc/include/asm/mman.h               |    5 
 arch/powerpc/include/asm/mmu_context.h        |   11 
 arch/s390/include/asm/mmu_context.h           |   11 
 arch/unicore32/include/asm/mmu_context.h      |   11 
 arch/x86/Kconfig                              |   15 
 arch/x86/entry/syscalls/syscall_32.tbl        |    1 
 arch/x86/entry/syscalls/syscall_64.tbl        |    1 
 arch/x86/include/asm/cpufeature.h             |   54 +
 arch/x86/include/asm/disabled-features.h      |   12 
 arch/x86/include/asm/fpu/types.h              |   16 
 arch/x86/include/asm/fpu/xstate.h             |    4 
 arch/x86/include/asm/mmu_context.h            |   71 ++
 arch/x86/include/asm/pgtable.h                |   45 +
 arch/x86/include/asm/pgtable_types.h          |   34 -
 arch/x86/include/asm/required-features.h      |    4 
 arch/x86/include/asm/special_insns.h          |   32 +
 arch/x86/include/uapi/asm/mman.h              |   23 
 arch/x86/include/uapi/asm/processor-flags.h   |    2 
 arch/x86/kernel/cpu/common.c                  |   42 +
 arch/x86/kernel/fpu/xstate.c                  |    7 
 arch/x86/kernel/process_64.c                  |    2 
 arch/x86/kernel/setup.c                       |    9 
 arch/x86/mm/fault.c                           |  143 +++-
 arch/x86/mm/gup.c                             |   37 -
 drivers/char/agp/frontend.c                   |    2 
 drivers/staging/android/ashmem.c              |    9 
 fs/proc/task_mmu.c                            |    5 
 include/asm-generic/mm_hooks.h                |   11 
 include/linux/mm.h                            |   13 
 include/linux/mman.h                          |    6 
 include/uapi/asm-generic/siginfo.h            |   17 
 kernel/signal.c                               |    4 
 mm/Kconfig                                    |   11 
 mm/gup.c                                      |   28 
 mm/memory.c                                   |    4 
 mm/mmap.c                                     |    2 
 mm/mprotect.c                                 |   20 
 mm/nommu.c                                    |    2 
 tools/testing/selftests/x86/Makefile          |    3 
 tools/testing/selftests/x86/pkey-helpers.h    |  182 +++++
 tools/testing/selftests/x86/protection_keys.c |  828 ++++++++++++++++++++++++++
 43 files changed, 1705 insertions(+), 91 deletions(-)

Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 02/25] x86, pkeys: Add Kconfig option
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I don't have a strong opinion on whether we need a Kconfig prompt
or not.  Protection Keys has relatively little code associated
with it, and it is not a heavyweight feature to keep enabled.
However, I can imagine that folks would still appreciate being
able to disable it.

We will hide the prompt for now.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig |    4 ++++
 1 file changed, 4 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-01-kconfig arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-01-kconfig	2015-09-28 11:39:41.883997985 -0700
+++ b/arch/x86/Kconfig	2015-09-28 11:39:41.887998167 -0700
@@ -1694,6 +1694,10 @@ config X86_INTEL_MPX
 
 	  If unsure, say N.
 
+config X86_INTEL_MEMORY_PROTECTION_KEYS
+	def_bool y
+	depends on CPU_SUP_INTEL && X86_64
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 03/25] x86, pkeys: cpuid bit definition
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There are two CPUID bits for protection keys.  One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys.  Specifically:

	Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
	Protection keys (and the RDPKRU/WRPKRU instructions)

This is because userspace can not see CR4 contents, but it can
see CPUID contents.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:

	CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]

X86_FEATURE_OSPKE is "OSPKU":

	CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]

These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.

Add it to the disabled-features mask when its config option is
off.  Even though we are not using it here, we also extend the
REQUIRED_MASK_BIT_SET() macro to keep it mirroring the
DISABLED_MASK_BIT_SET() version.

This means that in almost all code, you should use:

	cpu_has(X86_FEATURE_PKU)

and *not* the CONFIG option.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/cpufeature.h        |   54 +++++++++++++++++------------
 b/arch/x86/include/asm/disabled-features.h |   12 ++++++
 b/arch/x86/include/asm/required-features.h |    4 ++
 b/arch/x86/kernel/cpu/common.c             |    1 
 4 files changed, 50 insertions(+), 21 deletions(-)

diff -puN arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid arch/x86/include/asm/cpufeature.h
--- a/arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid	2015-09-28 11:39:42.296016728 -0700
+++ b/arch/x86/include/asm/cpufeature.h	2015-09-28 11:39:42.305017137 -0700
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	13	/* N 32-bit words worth of info */
+#define NCAPINTS	14	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -254,6 +254,10 @@
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 13 */
+#define X86_FEATURE_PKU		(13*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE	(13*32+ 4) /* OS Protection Keys Enable */
+
 /*
  * BUG word(s)
  */
@@ -294,28 +298,36 @@ extern const char * const x86_bug_flags[
 	 test_bit(bit, (unsigned long *)((c)->x86_capability))
 
 #define REQUIRED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & REQUIRED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & REQUIRED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & REQUIRED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & REQUIRED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & REQUIRED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & REQUIRED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & REQUIRED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & REQUIRED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & REQUIRED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & REQUIRED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & REQUIRED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & REQUIRED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & REQUIRED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & REQUIRED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & REQUIRED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & REQUIRED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & REQUIRED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & REQUIRED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & REQUIRED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & REQUIRED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & REQUIRED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & REQUIRED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & REQUIRED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK13)) )
 
 #define DISABLED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & DISABLED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & DISABLED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & DISABLED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & DISABLED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & DISABLED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & DISABLED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & DISABLED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & DISABLED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & DISABLED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & DISABLED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & DISABLED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & DISABLED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & DISABLED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & DISABLED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & DISABLED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & DISABLED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & DISABLED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & DISABLED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & DISABLED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & DISABLED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & DISABLED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & DISABLED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & DISABLED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK13)) )
 
 #define cpu_has(c, bit)							\
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
diff -puN arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid arch/x86/include/asm/disabled-features.h
--- a/arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid	2015-09-28 11:39:42.298016819 -0700
+++ b/arch/x86/include/asm/disabled-features.h	2015-09-28 11:39:42.305017137 -0700
@@ -28,6 +28,14 @@
 # define DISABLE_CENTAUR_MCR	0
 #endif /* CONFIG_X86_64 */
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+# define DISABLE_PKU		(1<<(X86_FEATURE_PKU))
+# define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE))
+#else
+# define DISABLE_PKU		0
+# define DISABLE_OSPKE		0
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -41,5 +49,9 @@
 #define DISABLED_MASK7	0
 #define DISABLED_MASK8	0
 #define DISABLED_MASK9	(DISABLE_MPX)
+#define DISABLED_MASK10	0
+#define DISABLED_MASK11	0
+#define DISABLED_MASK12	0
+#define DISABLED_MASK13	(DISABLE_PKU|DISABLE_OSPKE)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff -puN arch/x86/include/asm/required-features.h~pkeys-01-cpuid arch/x86/include/asm/required-features.h
--- a/arch/x86/include/asm/required-features.h~pkeys-01-cpuid	2015-09-28 11:39:42.300016910 -0700
+++ b/arch/x86/include/asm/required-features.h	2015-09-28 11:39:42.306017183 -0700
@@ -92,5 +92,9 @@
 #define REQUIRED_MASK7	0
 #define REQUIRED_MASK8	0
 #define REQUIRED_MASK9	0
+#define REQUIRED_MASK10	0
+#define REQUIRED_MASK11	0
+#define REQUIRED_MASK12	0
+#define REQUIRED_MASK13	0
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff -puN arch/x86/kernel/cpu/common.c~pkeys-01-cpuid arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-01-cpuid	2015-09-28 11:39:42.302017001 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-09-28 11:39:42.306017183 -0700
@@ -619,6 +619,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		cpuid_count(0x00000007, 0, &eax, &ebx, &ecx, &edx);
 
 		c->x86_capability[9] = ebx;
+		c->x86_capability[13] = ecx;
 	}
 
 	/* Extended state features: level 0x0000000d */
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 01/25] x86, fpu: add placeholder for Processor Trace XSAVE state
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There is an XSAVE state component for Intel Processor Trace.  But,
we do not use it and do not expect to ever use it.

We add a placeholder in the code for it so it is not a mystery and
also so we do not need an explicit enum initialization for Protection
Keys in a moment.

Why will we never use it?  According to Andi Kleen:

	The XSAVE support assumes that there is a single buffer
	for each thread. But perf generally doesn't work this
	way, it usually has only a single perf event per CPU per
	user, and when tracing multiple threads on that CPU it
	inherits perf event buffers between different threads. So
	XSAVE per thread cannot handle this inheritance case
	directly.

	Using multiple XSAVE areas (another one per perf event)
	would defeat some of the state caching that the CPUs do.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/fpu/types.h |    1 +
 b/arch/x86/kernel/fpu/xstate.c     |   10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pt-xstate-bit arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pt-xstate-bit	2015-09-28 11:39:41.443977969 -0700
+++ b/arch/x86/include/asm/fpu/types.h	2015-09-28 11:39:41.448978197 -0700
@@ -108,6 +108,7 @@ enum xfeature {
 	XFEATURE_OPMASK,
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
+	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 
 	XFEATURE_MAX,
 };
diff -puN arch/x86/kernel/fpu/xstate.c~pt-xstate-bit arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pt-xstate-bit	2015-09-28 11:39:41.445978060 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-09-28 11:39:41.449978242 -0700
@@ -13,6 +13,11 @@
 
 #include <asm/tlbflush.h>
 
+/*
+ * Although we spell it out in here, the Processor Trace
+ * xfeature is completely unused.  We use other mechanisms
+ * to save/restore PT state in Linux.
+ */
 static const char *xfeature_names[] =
 {
 	"x87 floating point registers"	,
@@ -23,7 +28,7 @@ static const char *xfeature_names[] =
 	"AVX-512 opmask"		,
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
-	"unknown xstate feature"	,
+	"Processor Trace (unused)"	,
 };
 
 /*
@@ -469,7 +474,8 @@ static void check_xstate_against_struct(
 	 * numbers.
 	 */
 	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX)) {
+	    (nr >= XFEATURE_MAX) ||
+	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR)) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 02/25] x86, pkeys: Add Kconfig option
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I don't have a strong opinion on whether we need a Kconfig prompt
or not.  Protection Keys has relatively little code associated
with it, and it is not a heavyweight feature to keep enabled.
However, I can imagine that folks would still appreciate being
able to disable it.

We will hide the prompt for now.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig |    4 ++++
 1 file changed, 4 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-01-kconfig arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-01-kconfig	2015-09-28 11:39:41.883997985 -0700
+++ b/arch/x86/Kconfig	2015-09-28 11:39:41.887998167 -0700
@@ -1694,6 +1694,10 @@ config X86_INTEL_MPX
 
 	  If unsure, say N.
 
+config X86_INTEL_MEMORY_PROTECTION_KEYS
+	def_bool y
+	depends on CPU_SUP_INTEL && X86_64
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 01/25] x86, fpu: add placeholder for Processor Trace XSAVE state
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There is an XSAVE state component for Intel Processor Trace.  But,
we do not use it and do not expect to ever use it.

We add a placeholder in the code for it so it is not a mystery and
also so we do not need an explicit enum initialization for Protection
Keys in a moment.

Why will we never use it?  According to Andi Kleen:

	The XSAVE support assumes that there is a single buffer
	for each thread. But perf generally doesn't work this
	way, it usually has only a single perf event per CPU per
	user, and when tracing multiple threads on that CPU it
	inherits perf event buffers between different threads. So
	XSAVE per thread cannot handle this inheritance case
	directly.

	Using multiple XSAVE areas (another one per perf event)
	would defeat some of the state caching that the CPUs do.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/fpu/types.h |    1 +
 b/arch/x86/kernel/fpu/xstate.c     |   10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pt-xstate-bit arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pt-xstate-bit	2015-09-28 11:39:41.443977969 -0700
+++ b/arch/x86/include/asm/fpu/types.h	2015-09-28 11:39:41.448978197 -0700
@@ -108,6 +108,7 @@ enum xfeature {
 	XFEATURE_OPMASK,
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
+	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 
 	XFEATURE_MAX,
 };
diff -puN arch/x86/kernel/fpu/xstate.c~pt-xstate-bit arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pt-xstate-bit	2015-09-28 11:39:41.445978060 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-09-28 11:39:41.449978242 -0700
@@ -13,6 +13,11 @@
 
 #include <asm/tlbflush.h>
 
+/*
+ * Although we spell it out in here, the Processor Trace
+ * xfeature is completely unused.  We use other mechanisms
+ * to save/restore PT state in Linux.
+ */
 static const char *xfeature_names[] =
 {
 	"x87 floating point registers"	,
@@ -23,7 +28,7 @@ static const char *xfeature_names[] =
 	"AVX-512 opmask"		,
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
-	"unknown xstate feature"	,
+	"Processor Trace (unused)"	,
 };
 
 /*
@@ -469,7 +474,8 @@ static void check_xstate_against_struct(
 	 * numbers.
 	 */
 	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX)) {
+	    (nr >= XFEATURE_MAX) ||
+	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR)) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 03/25] x86, pkeys: cpuid bit definition
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There are two CPUID bits for protection keys.  One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys.  Specifically:

	Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
	Protection keys (and the RDPKRU/WRPKRU instructions)

This is because userspace can not see CR4 contents, but it can
see CPUID contents.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:

	CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]

X86_FEATURE_OSPKE is "OSPKU":

	CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]

These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.

Add it to the disabled-features mask when its config option is
off.  Even though we are not using it here, we also extend the
REQUIRED_MASK_BIT_SET() macro to keep it mirroring the
DISABLED_MASK_BIT_SET() version.

This means that in almost all code, you should use:

	cpu_has(X86_FEATURE_PKU)

and *not* the CONFIG option.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/cpufeature.h        |   54 +++++++++++++++++------------
 b/arch/x86/include/asm/disabled-features.h |   12 ++++++
 b/arch/x86/include/asm/required-features.h |    4 ++
 b/arch/x86/kernel/cpu/common.c             |    1 
 4 files changed, 50 insertions(+), 21 deletions(-)

diff -puN arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid arch/x86/include/asm/cpufeature.h
--- a/arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid	2015-09-28 11:39:42.296016728 -0700
+++ b/arch/x86/include/asm/cpufeature.h	2015-09-28 11:39:42.305017137 -0700
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	13	/* N 32-bit words worth of info */
+#define NCAPINTS	14	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -254,6 +254,10 @@
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 13 */
+#define X86_FEATURE_PKU		(13*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE	(13*32+ 4) /* OS Protection Keys Enable */
+
 /*
  * BUG word(s)
  */
@@ -294,28 +298,36 @@ extern const char * const x86_bug_flags[
 	 test_bit(bit, (unsigned long *)((c)->x86_capability))
 
 #define REQUIRED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & REQUIRED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & REQUIRED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & REQUIRED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & REQUIRED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & REQUIRED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & REQUIRED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & REQUIRED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & REQUIRED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & REQUIRED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & REQUIRED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & REQUIRED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & REQUIRED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & REQUIRED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & REQUIRED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & REQUIRED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & REQUIRED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & REQUIRED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & REQUIRED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & REQUIRED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & REQUIRED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & REQUIRED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & REQUIRED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & REQUIRED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK13)) )
 
 #define DISABLED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & DISABLED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & DISABLED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & DISABLED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & DISABLED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & DISABLED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & DISABLED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & DISABLED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & DISABLED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & DISABLED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & DISABLED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & DISABLED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & DISABLED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & DISABLED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & DISABLED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & DISABLED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & DISABLED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & DISABLED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & DISABLED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & DISABLED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & DISABLED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & DISABLED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & DISABLED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & DISABLED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK13)) )
 
 #define cpu_has(c, bit)							\
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
diff -puN arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid arch/x86/include/asm/disabled-features.h
--- a/arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid	2015-09-28 11:39:42.298016819 -0700
+++ b/arch/x86/include/asm/disabled-features.h	2015-09-28 11:39:42.305017137 -0700
@@ -28,6 +28,14 @@
 # define DISABLE_CENTAUR_MCR	0
 #endif /* CONFIG_X86_64 */
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+# define DISABLE_PKU		(1<<(X86_FEATURE_PKU))
+# define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE))
+#else
+# define DISABLE_PKU		0
+# define DISABLE_OSPKE		0
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -41,5 +49,9 @@
 #define DISABLED_MASK7	0
 #define DISABLED_MASK8	0
 #define DISABLED_MASK9	(DISABLE_MPX)
+#define DISABLED_MASK10	0
+#define DISABLED_MASK11	0
+#define DISABLED_MASK12	0
+#define DISABLED_MASK13	(DISABLE_PKU|DISABLE_OSPKE)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff -puN arch/x86/include/asm/required-features.h~pkeys-01-cpuid arch/x86/include/asm/required-features.h
--- a/arch/x86/include/asm/required-features.h~pkeys-01-cpuid	2015-09-28 11:39:42.300016910 -0700
+++ b/arch/x86/include/asm/required-features.h	2015-09-28 11:39:42.306017183 -0700
@@ -92,5 +92,9 @@
 #define REQUIRED_MASK7	0
 #define REQUIRED_MASK8	0
 #define REQUIRED_MASK9	0
+#define REQUIRED_MASK10	0
+#define REQUIRED_MASK11	0
+#define REQUIRED_MASK12	0
+#define REQUIRED_MASK13	0
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff -puN arch/x86/kernel/cpu/common.c~pkeys-01-cpuid arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-01-cpuid	2015-09-28 11:39:42.302017001 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-09-28 11:39:42.306017183 -0700
@@ -619,6 +619,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		cpuid_count(0x00000007, 0, &eax, &ebx, &ecx, &edx);
 
 		c->x86_capability[9] = ebx;
+		c->x86_capability[13] = ecx;
 	}
 
 	/* Extended state features: level 0x0000000d */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 06/25] x86, pkeys: PTE bits for storing protection key
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Previous documentation has referred to these 4 bits as "ignored".
That means that software could have made use of them.  But, as
far as I know, the kernel never used them.

They are still ignored when protection keys is not enabled, so
they could theoretically still get used for software purposes.

We also implement "empty" versions so that code that references
to them can be optimized away by the compiler when the config
option is not enabled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/pgtable_types.h |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits	2015-09-28 11:39:43.661078823 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-28 11:39:43.665079005 -0700
@@ -25,7 +25,11 @@
 #define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
+#define _PAGE_BIT_PKEY_BIT0	59       /* Protection Keys, bit 1/4 */
+#define _PAGE_BIT_PKEY_BIT1	60       /* Protection Keys, bit 2/4 */
+#define _PAGE_BIT_PKEY_BIT2	61       /* Protection Keys, bit 3/4 */
+#define _PAGE_BIT_PKEY_BIT3	62       /* Protection Keys, bit 4/4 */
+#define _PAGE_BIT_NX		63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -47,6 +51,17 @@
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2)
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3)
+#else
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 0))
+#endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 05/25] x86, pkey: add PKRU xsave fields and data structure(s)
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The protection keys register (PKRU) is saved and restored using
xsave.  Define the data structure that we will use to access it
inside the xsave buffer.

Note that we also have to widen the printk of the xsave feature
masks since this is feature 0x200 and we only did two characters
before.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/fpu/types.h  |   16 ++++++++++++++++
 b/arch/x86/include/asm/fpu/xstate.h |    4 +++-
 b/arch/x86/kernel/fpu/xstate.c      |    7 ++++++-
 3 files changed, 25 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pkeys-03-xsave arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pkeys-03-xsave	2015-09-28 11:39:43.198057761 -0700
+++ b/arch/x86/include/asm/fpu/types.h	2015-09-28 11:39:43.205058079 -0700
@@ -109,6 +109,7 @@ enum xfeature {
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
+	XFEATURE_PKRU,
 
 	XFEATURE_MAX,
 };
@@ -121,6 +122,7 @@ enum xfeature {
 #define XFEATURE_MASK_OPMASK		(1 << XFEATURE_OPMASK)
 #define XFEATURE_MASK_ZMM_Hi256		(1 << XFEATURE_ZMM_Hi256)
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
+#define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
@@ -213,6 +215,20 @@ struct avx_512_hi16_state {
 	struct reg_512_bit		hi16_zmm[16];
 } __packed;
 
+/*
+ * State component 9: 32-bit PKRU register.
+ */
+struct pkru {
+	u32 pkru;
+} __packed;
+
+struct pkru_state {
+	union {
+		struct pkru		pkru;
+		u8			pad_to_8_bytes[8];
+	};
+} __packed;
+
 struct xstate_header {
 	u64				xfeatures;
 	u64				xcomp_bv;
diff -puN arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave	2015-09-28 11:39:43.200057851 -0700
+++ b/arch/x86/include/asm/fpu/xstate.h	2015-09-28 11:39:43.205058079 -0700
@@ -27,7 +27,9 @@
 				 XFEATURE_MASK_Hi16_ZMM)
 
 /* Supported features which require eager state saving */
-#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)
+#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | \
+				 XFEATURE_MASK_BNDCSR | \
+				 XFEATURE_MASK_PKRU)
 
 /* All currently supported features */
 #define XCNTXT_MASK	(XFEATURE_MASK_LAZY | XFEATURE_MASK_EAGER)
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2015-09-28 11:39:43.201057897 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-09-28 11:39:43.205058079 -0700
@@ -29,6 +29,8 @@ static const char *xfeature_names[] =
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
 	"Processor Trace (unused)"	,
+	"Protection Keys User registers",
+	"unknown xstate feature"	,
 };
 
 /*
@@ -57,6 +59,7 @@ void fpu__xstate_clear_all_cpu_caps(void
 	setup_clear_cpu_cap(X86_FEATURE_AVX512ER);
 	setup_clear_cpu_cap(X86_FEATURE_AVX512CD);
 	setup_clear_cpu_cap(X86_FEATURE_MPX);
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
 }
 
 /*
@@ -235,7 +238,7 @@ static void __init print_xstate_feature(
 	const char *feature_name;
 
 	if (cpu_has_xfeatures(xstate_mask, &feature_name))
-		pr_info("x86/fpu: Supporting XSAVE feature 0x%02Lx: '%s'\n", xstate_mask, feature_name);
+		pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name);
 }
 
 /*
@@ -251,6 +254,7 @@ static void __init print_xstate_features
 	print_xstate_feature(XFEATURE_MASK_OPMASK);
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
+	print_xstate_feature(XFEATURE_MASK_PKRU);
 }
 
 /*
@@ -467,6 +471,7 @@ static void check_xstate_against_struct(
 	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
+	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 04/25] x86, pku: define new CR4 bit
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There is a new bit in CR4 for enabling protection keys.  We
will actually enable it later in the series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/uapi/asm/processor-flags.h |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4 arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4	2015-09-28 11:39:42.787039064 -0700
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2015-09-28 11:39:42.790039200 -0700
@@ -118,6 +118,8 @@
 #define X86_CR4_SMEP		_BITUL(X86_CR4_SMEP_BIT)
 #define X86_CR4_SMAP_BIT	21 /* enable SMAP support */
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
+#define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
+#define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 05/25] x86, pkey: add PKRU xsave fields and data structure(s)
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The protection keys register (PKRU) is saved and restored using
xsave.  Define the data structure that we will use to access it
inside the xsave buffer.

Note that we also have to widen the printk of the xsave feature
masks since this is feature 0x200 and we only did two characters
before.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/fpu/types.h  |   16 ++++++++++++++++
 b/arch/x86/include/asm/fpu/xstate.h |    4 +++-
 b/arch/x86/kernel/fpu/xstate.c      |    7 ++++++-
 3 files changed, 25 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pkeys-03-xsave arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pkeys-03-xsave	2015-09-28 11:39:43.198057761 -0700
+++ b/arch/x86/include/asm/fpu/types.h	2015-09-28 11:39:43.205058079 -0700
@@ -109,6 +109,7 @@ enum xfeature {
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
+	XFEATURE_PKRU,
 
 	XFEATURE_MAX,
 };
@@ -121,6 +122,7 @@ enum xfeature {
 #define XFEATURE_MASK_OPMASK		(1 << XFEATURE_OPMASK)
 #define XFEATURE_MASK_ZMM_Hi256		(1 << XFEATURE_ZMM_Hi256)
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
+#define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
@@ -213,6 +215,20 @@ struct avx_512_hi16_state {
 	struct reg_512_bit		hi16_zmm[16];
 } __packed;
 
+/*
+ * State component 9: 32-bit PKRU register.
+ */
+struct pkru {
+	u32 pkru;
+} __packed;
+
+struct pkru_state {
+	union {
+		struct pkru		pkru;
+		u8			pad_to_8_bytes[8];
+	};
+} __packed;
+
 struct xstate_header {
 	u64				xfeatures;
 	u64				xcomp_bv;
diff -puN arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave	2015-09-28 11:39:43.200057851 -0700
+++ b/arch/x86/include/asm/fpu/xstate.h	2015-09-28 11:39:43.205058079 -0700
@@ -27,7 +27,9 @@
 				 XFEATURE_MASK_Hi16_ZMM)
 
 /* Supported features which require eager state saving */
-#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)
+#define XFEATURE_MASK_EAGER	(XFEATURE_MASK_BNDREGS | \
+				 XFEATURE_MASK_BNDCSR | \
+				 XFEATURE_MASK_PKRU)
 
 /* All currently supported features */
 #define XCNTXT_MASK	(XFEATURE_MASK_LAZY | XFEATURE_MASK_EAGER)
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2015-09-28 11:39:43.201057897 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2015-09-28 11:39:43.205058079 -0700
@@ -29,6 +29,8 @@ static const char *xfeature_names[] =
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
 	"Processor Trace (unused)"	,
+	"Protection Keys User registers",
+	"unknown xstate feature"	,
 };
 
 /*
@@ -57,6 +59,7 @@ void fpu__xstate_clear_all_cpu_caps(void
 	setup_clear_cpu_cap(X86_FEATURE_AVX512ER);
 	setup_clear_cpu_cap(X86_FEATURE_AVX512CD);
 	setup_clear_cpu_cap(X86_FEATURE_MPX);
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
 }
 
 /*
@@ -235,7 +238,7 @@ static void __init print_xstate_feature(
 	const char *feature_name;
 
 	if (cpu_has_xfeatures(xstate_mask, &feature_name))
-		pr_info("x86/fpu: Supporting XSAVE feature 0x%02Lx: '%s'\n", xstate_mask, feature_name);
+		pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name);
 }
 
 /*
@@ -251,6 +254,7 @@ static void __init print_xstate_features
 	print_xstate_feature(XFEATURE_MASK_OPMASK);
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
+	print_xstate_feature(XFEATURE_MASK_PKRU);
 }
 
 /*
@@ -467,6 +471,7 @@ static void check_xstate_against_struct(
 	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
+	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 04/25] x86, pku: define new CR4 bit
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There is a new bit in CR4 for enabling protection keys.  We
will actually enable it later in the series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/uapi/asm/processor-flags.h |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4 arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4	2015-09-28 11:39:42.787039064 -0700
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2015-09-28 11:39:42.790039200 -0700
@@ -118,6 +118,8 @@
 #define X86_CR4_SMEP		_BITUL(X86_CR4_SMEP_BIT)
 #define X86_CR4_SMAP_BIT	21 /* enable SMAP support */
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
+#define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
+#define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 06/25] x86, pkeys: PTE bits for storing protection key
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Previous documentation has referred to these 4 bits as "ignored".
That means that software could have made use of them.  But, as
far as I know, the kernel never used them.

They are still ignored when protection keys is not enabled, so
they could theoretically still get used for software purposes.

We also implement "empty" versions so that code that references
to them can be optimized away by the compiler when the config
option is not enabled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/pgtable_types.h |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits	2015-09-28 11:39:43.661078823 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-28 11:39:43.665079005 -0700
@@ -25,7 +25,11 @@
 #define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
+#define _PAGE_BIT_PKEY_BIT0	59       /* Protection Keys, bit 1/4 */
+#define _PAGE_BIT_PKEY_BIT1	60       /* Protection Keys, bit 2/4 */
+#define _PAGE_BIT_PKEY_BIT2	61       /* Protection Keys, bit 3/4 */
+#define _PAGE_BIT_PKEY_BIT3	62       /* Protection Keys, bit 4/4 */
+#define _PAGE_BIT_NX		63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -47,6 +51,17 @@
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2)
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3)
+#else
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 0))
+#endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 08/25] x86, pkeys: store protection in high VMA flags
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures.  The high 32 bits are unused on 64-bit
platforms.  We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.

Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.

This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.

Sparse complains about these constants unless we explicitly
call them "UL".

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig   |    1 +
 b/include/linux/mm.h |    7 +++++++
 b/mm/Kconfig         |    3 +++
 3 files changed, 11 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-07-eat-high-vma-flags arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-07-eat-high-vma-flags	2015-09-28 11:39:44.493116671 -0700
+++ b/arch/x86/Kconfig	2015-09-28 11:39:44.500116990 -0700
@@ -152,6 +152,7 @@ config X86
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN include/linux/mm.h~pkeys-07-eat-high-vma-flags include/linux/mm.h
--- a/include/linux/mm.h~pkeys-07-eat-high-vma-flags	2015-09-28 11:39:44.495116762 -0700
+++ b/include/linux/mm.h	2015-09-28 11:39:44.501117035 -0700
@@ -157,6 +157,13 @@ extern unsigned int kobjsize(const void
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_HIGH_ARCH_0  0x100000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_1  0x200000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_2  0x400000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_3  0x800000000UL	/* bit only usable on 64-bit architectures */
+#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff -puN mm/Kconfig~pkeys-07-eat-high-vma-flags mm/Kconfig
--- a/mm/Kconfig~pkeys-07-eat-high-vma-flags	2015-09-28 11:39:44.497116853 -0700
+++ b/mm/Kconfig	2015-09-28 11:39:44.502117081 -0700
@@ -680,3 +680,6 @@ config ZONE_DEVICE
 
 config FRAME_VECTOR
 	bool
+
+config ARCH_USES_HIGH_VMA_FLAGS
+	bool
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 07/25] x86, pkeys: new page fault error code bit: PF_PK
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Note: "PK" is how the Intel SDM refers to this bit, so we also
use that nomenclature.

This only defines the bit, it does not plumb it anywhere to be
handled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/fault.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN arch/x86/mm/fault.c~pkeys-05-pfec arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-05-pfec	2015-09-28 11:39:44.073097565 -0700
+++ b/arch/x86/mm/fault.c	2015-09-28 11:39:44.076097701 -0700
@@ -33,6 +33,7 @@
  *   bit 2 ==	 0: kernel-mode access	1: user-mode access
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
+ *   bit 5 ==				1: protection keys block access
  */
 enum x86_pf_error_code {
 
@@ -41,6 +42,7 @@ enum x86_pf_error_code {
 	PF_USER		=		1 << 2,
 	PF_RSVD		=		1 << 3,
 	PF_INSTR	=		1 << 4,
+	PF_PK		=		1 << 5,
 };
 
 /*
@@ -916,7 +918,10 @@ static int spurious_fault_check(unsigned
 
 	if ((error_code & PF_INSTR) && !pte_exec(*pte))
 		return 0;
-
+	/*
+	 * Note: We do not do lazy flushing on protection key
+	 * changes, so no spurious fault will ever set PF_PK.
+	 */
 	return 1;
 }
 
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 07/25] x86, pkeys: new page fault error code bit: PF_PK
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Note: "PK" is how the Intel SDM refers to this bit, so we also
use that nomenclature.

This only defines the bit, it does not plumb it anywhere to be
handled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/fault.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN arch/x86/mm/fault.c~pkeys-05-pfec arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-05-pfec	2015-09-28 11:39:44.073097565 -0700
+++ b/arch/x86/mm/fault.c	2015-09-28 11:39:44.076097701 -0700
@@ -33,6 +33,7 @@
  *   bit 2 ==	 0: kernel-mode access	1: user-mode access
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
+ *   bit 5 ==				1: protection keys block access
  */
 enum x86_pf_error_code {
 
@@ -41,6 +42,7 @@ enum x86_pf_error_code {
 	PF_USER		=		1 << 2,
 	PF_RSVD		=		1 << 3,
 	PF_INSTR	=		1 << 4,
+	PF_PK		=		1 << 5,
 };
 
 /*
@@ -916,7 +918,10 @@ static int spurious_fault_check(unsigned
 
 	if ((error_code & PF_INSTR) && !pte_exec(*pte))
 		return 0;
-
+	/*
+	 * Note: We do not do lazy flushing on protection key
+	 * changes, so no spurious fault will ever set PF_PK.
+	 */
 	return 1;
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 08/25] x86, pkeys: store protection in high VMA flags
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures.  The high 32 bits are unused on 64-bit
platforms.  We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.

Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.

This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.

Sparse complains about these constants unless we explicitly
call them "UL".

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig   |    1 +
 b/include/linux/mm.h |    7 +++++++
 b/mm/Kconfig         |    3 +++
 3 files changed, 11 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-07-eat-high-vma-flags arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-07-eat-high-vma-flags	2015-09-28 11:39:44.493116671 -0700
+++ b/arch/x86/Kconfig	2015-09-28 11:39:44.500116990 -0700
@@ -152,6 +152,7 @@ config X86
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN include/linux/mm.h~pkeys-07-eat-high-vma-flags include/linux/mm.h
--- a/include/linux/mm.h~pkeys-07-eat-high-vma-flags	2015-09-28 11:39:44.495116762 -0700
+++ b/include/linux/mm.h	2015-09-28 11:39:44.501117035 -0700
@@ -157,6 +157,13 @@ extern unsigned int kobjsize(const void
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_HIGH_ARCH_0  0x100000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_1  0x200000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_2  0x400000000UL	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_3  0x800000000UL	/* bit only usable on 64-bit architectures */
+#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff -puN mm/Kconfig~pkeys-07-eat-high-vma-flags mm/Kconfig
--- a/mm/Kconfig~pkeys-07-eat-high-vma-flags	2015-09-28 11:39:44.497116853 -0700
+++ b/mm/Kconfig	2015-09-28 11:39:44.502117081 -0700
@@ -680,3 +680,6 @@ config ZONE_DEVICE
 
 config FRAME_VECTOR
 	bool
+
+config ARCH_USES_HIGH_VMA_FLAGS
+	bool
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 10/25] x86, pkeys: pass VMA down in to fault signal generation code
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

During a page fault, we look up the VMA to ensure that the fault
is in a region with a valid mapping.  But, in the top-level page
fault code we don't need the VMA for much else.  Once we have
decided that an access is bad, we are going to send a signal no
matter what and do not need the VMA any more.  So we do not pass
it down in to the signal generation code.

But, for protection keys, we need the VMA.  It tells us *which*
protection key we violated if we get a PF_PK.  So, we need to
pass the VMA down and fill in siginfo->si_pkey.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/fault.c |   50 ++++++++++++++++++++++++++++----------------------
 1 file changed, 28 insertions(+), 22 deletions(-)

diff -puN arch/x86/mm/fault.c~pkeys-08-pass-down-vma arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-08-pass-down-vma	2015-09-28 11:39:45.444159933 -0700
+++ b/arch/x86/mm/fault.c	2015-09-28 11:39:45.448160115 -0700
@@ -171,7 +171,8 @@ is_prefetch(struct pt_regs *regs, unsign
 
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
-		     struct task_struct *tsk, int fault)
+		     struct task_struct *tsk, struct vm_area_struct *vma,
+		     int fault)
 {
 	unsigned lsb = 0;
 	siginfo_t info;
@@ -656,6 +657,8 @@ no_context(struct pt_regs *regs, unsigne
 	struct task_struct *tsk = current;
 	unsigned long flags;
 	int sig;
+	/* No context means no VMA to pass down */
+	struct vm_area_struct *vma = NULL;
 
 	/* Are we prepared to handle this kernel fault? */
 	if (fixup_exception(regs)) {
@@ -679,7 +682,8 @@ no_context(struct pt_regs *regs, unsigne
 			tsk->thread.cr2 = address;
 
 			/* XXX: hwpoison faults will set the wrong code. */
-			force_sig_info_fault(signal, si_code, address, tsk, 0);
+			force_sig_info_fault(signal, si_code, address,
+					     tsk, vma, 0);
 		}
 
 		/*
@@ -756,7 +760,8 @@ show_signal_msg(struct pt_regs *regs, un
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		       unsigned long address, int si_code)
+		       unsigned long address, struct vm_area_struct *vma,
+		       int si_code)
 {
 	struct task_struct *tsk = current;
 
@@ -799,7 +804,7 @@ __bad_area_nosemaphore(struct pt_regs *r
 		tsk->thread.error_code	= error_code;
 		tsk->thread.trap_nr	= X86_TRAP_PF;
 
-		force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
+		force_sig_info_fault(SIGSEGV, si_code, address, tsk, vma, 0);
 
 		return;
 	}
@@ -812,14 +817,14 @@ __bad_area_nosemaphore(struct pt_regs *r
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		     unsigned long address)
+		     unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
+	__bad_area_nosemaphore(regs, error_code, address, vma, SEGV_MAPERR);
 }
 
 static void
 __bad_area(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int si_code)
+	   unsigned long address,  struct vm_area_struct *vma, int si_code)
 {
 	struct mm_struct *mm = current->mm;
 
@@ -829,25 +834,25 @@ __bad_area(struct pt_regs *regs, unsigne
 	 */
 	up_read(&mm->mmap_sem);
 
-	__bad_area_nosemaphore(regs, error_code, address, si_code);
+	__bad_area_nosemaphore(regs, error_code, address, vma, si_code);
 }
 
 static noinline void
 bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
-	__bad_area(regs, error_code, address, SEGV_MAPERR);
+	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
-	  unsigned int fault)
+	  struct vm_area_struct *vma, unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	int code = BUS_ADRERR;
@@ -874,12 +879,13 @@ do_sigbus(struct pt_regs *regs, unsigned
 		code = BUS_MCEERR_AR;
 	}
 #endif
-	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
+	force_sig_info_fault(SIGBUS, code, address, tsk, vma, fault);
 }
 
 static noinline void
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, unsigned int fault)
+	       unsigned long address, struct vm_area_struct *vma,
+	       unsigned int fault)
 {
 	if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
 		no_context(regs, error_code, address, 0, 0);
@@ -903,9 +909,9 @@ mm_fault_error(struct pt_regs *regs, uns
 	} else {
 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
 			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
+			do_sigbus(regs, error_code, address, vma, fault);
 		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, vma);
 		else
 			BUG();
 	}
@@ -1116,7 +1122,7 @@ __do_page_fault(struct pt_regs *regs, un
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock:
 		 */
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 
 		return;
 	}
@@ -1129,7 +1135,7 @@ __do_page_fault(struct pt_regs *regs, un
 		pgtable_bad(regs, error_code, address);
 
 	if (unlikely(smap_violation(error_code, regs))) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1138,7 +1144,7 @@ __do_page_fault(struct pt_regs *regs, un
 	 * in a region with pagefaults disabled then we must not take the fault
 	 */
 	if (unlikely(faulthandler_disabled() || !mm)) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1182,7 +1188,7 @@ __do_page_fault(struct pt_regs *regs, un
 	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
 		if ((error_code & PF_USER) == 0 &&
 		    !search_exception_tables(regs->ip)) {
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, NULL);
 			return;
 		}
 retry:
@@ -1230,7 +1236,7 @@ retry:
 	 */
 good_area:
 	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
+		bad_area_access_error(regs, error_code, address, vma);
 		return;
 	}
 
@@ -1268,7 +1274,7 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, fault);
+		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
 	}
 
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 09/25] x86, pkeys: arch-specific protection bits
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Lots of things seem to do:

        vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot.  So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT).  It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
	1. "prot" argument to system calls
	2. vma->vm_flags, filled from the mmap "prot"
	3. vma->vm_page prot, filled from vma->vm_flags
	4. the PTE itself.

The pseudocode for these for steps are as follows:

	mmap(PROT_PKEY*)
	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
	pte = pfn | vma->vm_page_prot

Note that this provides a new definitions for x86:

	arch_vm_get_page_prot()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/mmu_context.h   |   20 ++++++++++++++++++++
 b/arch/x86/include/asm/pgtable_types.h |   12 ++++++++++--
 b/arch/x86/include/uapi/asm/mman.h     |   16 ++++++++++++++++
 b/include/linux/mm.h                   |    6 ++++++
 4 files changed, 52 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-08-store-pkey-in-vma	2015-09-28 11:39:44.957137779 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2015-09-28 11:39:44.965138143 -0700
@@ -243,4 +243,24 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+static inline u16 vma_pkey(struct vm_area_struct *vma)
+{
+	u16 pkey = 0;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
+				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
+	/*
+	 * ffs is one-based, not zero-based, so bias back down by 1.
+	 */
+	int vm_pkey_shift = __builtin_ffsl(vma_pkey_mask) - 1;
+	/*
+	 * gcc generates better code if we do this rather than:
+	 * pkey = (flags & mask) >> shift
+	 */
+	pkey = (vma->vm_flags >> vm_pkey_shift) &
+	       (vma_pkey_mask >> vm_pkey_shift);
+#endif
+	return pkey;
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma	2015-09-28 11:39:44.959137870 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-28 11:39:44.965138143 -0700
@@ -111,7 +111,12 @@
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
 			 _PAGE_DIRTY)
 
-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify.  The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY)
@@ -227,7 +232,10 @@ enum page_cache_mode {
 /* PTE_PFN_MASK extracts the PFN from a (pte|pmd|pud|pgd)val_t */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)
 
-/* PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t */
+/*
+ *  PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t
+ *  This includes the protection key value.
+ */
 #define PTE_FLAGS_MASK		(~PTE_PFN_MASK)
 
 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma	2015-09-28 11:39:44.960137915 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-28 11:39:44.966138188 -0700
@@ -6,6 +6,22 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+#endif
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff -puN include/linux/mm.h~pkeys-08-store-pkey-in-vma include/linux/mm.h
--- a/include/linux/mm.h~pkeys-08-store-pkey-in-vma	2015-09-28 11:39:44.962138006 -0700
+++ b/include/linux/mm.h	2015-09-28 11:39:44.967138234 -0700
@@ -166,6 +166,12 @@ extern unsigned int kobjsize(const void
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+# define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+# define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#endif
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 10/25] x86, pkeys: pass VMA down in to fault signal generation code
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

During a page fault, we look up the VMA to ensure that the fault
is in a region with a valid mapping.  But, in the top-level page
fault code we don't need the VMA for much else.  Once we have
decided that an access is bad, we are going to send a signal no
matter what and do not need the VMA any more.  So we do not pass
it down in to the signal generation code.

But, for protection keys, we need the VMA.  It tells us *which*
protection key we violated if we get a PF_PK.  So, we need to
pass the VMA down and fill in siginfo->si_pkey.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/fault.c |   50 ++++++++++++++++++++++++++++----------------------
 1 file changed, 28 insertions(+), 22 deletions(-)

diff -puN arch/x86/mm/fault.c~pkeys-08-pass-down-vma arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-08-pass-down-vma	2015-09-28 11:39:45.444159933 -0700
+++ b/arch/x86/mm/fault.c	2015-09-28 11:39:45.448160115 -0700
@@ -171,7 +171,8 @@ is_prefetch(struct pt_regs *regs, unsign
 
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
-		     struct task_struct *tsk, int fault)
+		     struct task_struct *tsk, struct vm_area_struct *vma,
+		     int fault)
 {
 	unsigned lsb = 0;
 	siginfo_t info;
@@ -656,6 +657,8 @@ no_context(struct pt_regs *regs, unsigne
 	struct task_struct *tsk = current;
 	unsigned long flags;
 	int sig;
+	/* No context means no VMA to pass down */
+	struct vm_area_struct *vma = NULL;
 
 	/* Are we prepared to handle this kernel fault? */
 	if (fixup_exception(regs)) {
@@ -679,7 +682,8 @@ no_context(struct pt_regs *regs, unsigne
 			tsk->thread.cr2 = address;
 
 			/* XXX: hwpoison faults will set the wrong code. */
-			force_sig_info_fault(signal, si_code, address, tsk, 0);
+			force_sig_info_fault(signal, si_code, address,
+					     tsk, vma, 0);
 		}
 
 		/*
@@ -756,7 +760,8 @@ show_signal_msg(struct pt_regs *regs, un
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		       unsigned long address, int si_code)
+		       unsigned long address, struct vm_area_struct *vma,
+		       int si_code)
 {
 	struct task_struct *tsk = current;
 
@@ -799,7 +804,7 @@ __bad_area_nosemaphore(struct pt_regs *r
 		tsk->thread.error_code	= error_code;
 		tsk->thread.trap_nr	= X86_TRAP_PF;
 
-		force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
+		force_sig_info_fault(SIGSEGV, si_code, address, tsk, vma, 0);
 
 		return;
 	}
@@ -812,14 +817,14 @@ __bad_area_nosemaphore(struct pt_regs *r
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		     unsigned long address)
+		     unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
+	__bad_area_nosemaphore(regs, error_code, address, vma, SEGV_MAPERR);
 }
 
 static void
 __bad_area(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int si_code)
+	   unsigned long address,  struct vm_area_struct *vma, int si_code)
 {
 	struct mm_struct *mm = current->mm;
 
@@ -829,25 +834,25 @@ __bad_area(struct pt_regs *regs, unsigne
 	 */
 	up_read(&mm->mmap_sem);
 
-	__bad_area_nosemaphore(regs, error_code, address, si_code);
+	__bad_area_nosemaphore(regs, error_code, address, vma, si_code);
 }
 
 static noinline void
 bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
-	__bad_area(regs, error_code, address, SEGV_MAPERR);
+	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
-	  unsigned int fault)
+	  struct vm_area_struct *vma, unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	int code = BUS_ADRERR;
@@ -874,12 +879,13 @@ do_sigbus(struct pt_regs *regs, unsigned
 		code = BUS_MCEERR_AR;
 	}
 #endif
-	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
+	force_sig_info_fault(SIGBUS, code, address, tsk, vma, fault);
 }
 
 static noinline void
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, unsigned int fault)
+	       unsigned long address, struct vm_area_struct *vma,
+	       unsigned int fault)
 {
 	if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
 		no_context(regs, error_code, address, 0, 0);
@@ -903,9 +909,9 @@ mm_fault_error(struct pt_regs *regs, uns
 	} else {
 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
 			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
+			do_sigbus(regs, error_code, address, vma, fault);
 		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, vma);
 		else
 			BUG();
 	}
@@ -1116,7 +1122,7 @@ __do_page_fault(struct pt_regs *regs, un
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock:
 		 */
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 
 		return;
 	}
@@ -1129,7 +1135,7 @@ __do_page_fault(struct pt_regs *regs, un
 		pgtable_bad(regs, error_code, address);
 
 	if (unlikely(smap_violation(error_code, regs))) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1138,7 +1144,7 @@ __do_page_fault(struct pt_regs *regs, un
 	 * in a region with pagefaults disabled then we must not take the fault
 	 */
 	if (unlikely(faulthandler_disabled() || !mm)) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1182,7 +1188,7 @@ __do_page_fault(struct pt_regs *regs, un
 	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
 		if ((error_code & PF_USER) == 0 &&
 		    !search_exception_tables(regs->ip)) {
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, NULL);
 			return;
 		}
 retry:
@@ -1230,7 +1236,7 @@ retry:
 	 */
 good_area:
 	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
+		bad_area_access_error(regs, error_code, address, vma);
 		return;
 	}
 
@@ -1268,7 +1274,7 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, fault);
+		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
 	}
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 09/25] x86, pkeys: arch-specific protection bits
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Lots of things seem to do:

        vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot.  So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT).  It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
	1. "prot" argument to system calls
	2. vma->vm_flags, filled from the mmap "prot"
	3. vma->vm_page prot, filled from vma->vm_flags
	4. the PTE itself.

The pseudocode for these for steps are as follows:

	mmap(PROT_PKEY*)
	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
	pte = pfn | vma->vm_page_prot

Note that this provides a new definitions for x86:

	arch_vm_get_page_prot()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/mmu_context.h   |   20 ++++++++++++++++++++
 b/arch/x86/include/asm/pgtable_types.h |   12 ++++++++++--
 b/arch/x86/include/uapi/asm/mman.h     |   16 ++++++++++++++++
 b/include/linux/mm.h                   |    6 ++++++
 4 files changed, 52 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-08-store-pkey-in-vma	2015-09-28 11:39:44.957137779 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2015-09-28 11:39:44.965138143 -0700
@@ -243,4 +243,24 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+static inline u16 vma_pkey(struct vm_area_struct *vma)
+{
+	u16 pkey = 0;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
+				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
+	/*
+	 * ffs is one-based, not zero-based, so bias back down by 1.
+	 */
+	int vm_pkey_shift = __builtin_ffsl(vma_pkey_mask) - 1;
+	/*
+	 * gcc generates better code if we do this rather than:
+	 * pkey = (flags & mask) >> shift
+	 */
+	pkey = (vma->vm_flags >> vm_pkey_shift) &
+	       (vma_pkey_mask >> vm_pkey_shift);
+#endif
+	return pkey;
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-08-store-pkey-in-vma	2015-09-28 11:39:44.959137870 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-28 11:39:44.965138143 -0700
@@ -111,7 +111,12 @@
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
 			 _PAGE_DIRTY)
 
-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify.  The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY)
@@ -227,7 +232,10 @@ enum page_cache_mode {
 /* PTE_PFN_MASK extracts the PFN from a (pte|pmd|pud|pgd)val_t */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)
 
-/* PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t */
+/*
+ *  PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t
+ *  This includes the protection key value.
+ */
 #define PTE_FLAGS_MASK		(~PTE_PFN_MASK)
 
 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-08-store-pkey-in-vma	2015-09-28 11:39:44.960137915 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-28 11:39:44.966138188 -0700
@@ -6,6 +6,22 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+#endif
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff -puN include/linux/mm.h~pkeys-08-store-pkey-in-vma include/linux/mm.h
--- a/include/linux/mm.h~pkeys-08-store-pkey-in-vma	2015-09-28 11:39:44.962138006 -0700
+++ b/include/linux/mm.h	2015-09-28 11:39:44.967138234 -0700
@@ -166,6 +166,12 @@ extern unsigned int kobjsize(const void
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+# define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+# define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#endif
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 13/25] mm: factor out VMA fault permission checking
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This code matches a fault condition up with the VMA and ensures
that the VMA allows the fault to be handled instead of just
erroring out.

We will be extending this in a moment to comprehend protection
keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/gup.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff -puN mm/gup.c~pkeys-10-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-10-pte-fault	2015-09-28 11:39:46.790221164 -0700
+++ b/mm/gup.c	2015-09-28 11:39:46.794221345 -0700
@@ -554,6 +554,17 @@ next_page:
 }
 EXPORT_SYMBOL(__get_user_pages);
 
+bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
+{
+        vm_flags_t vm_flags =
+		(fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+
+	if (!(vm_flags & vma->vm_flags))
+		return false;
+
+	return true;
+}
+
 /*
  * fixup_user_fault() - manually resolve a user page fault
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -585,15 +596,13 @@ int fixup_user_fault(struct task_struct
 		     unsigned long address, unsigned int fault_flags)
 {
 	struct vm_area_struct *vma;
-	vm_flags_t vm_flags;
 	int ret;
 
 	vma = find_extend_vma(mm, address);
 	if (!vma || address < vma->vm_start)
 		return -EFAULT;
 
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
-	if (!(vm_flags & vma->vm_flags))
+	if (!vma_permits_fault(vma, fault_flags))
 		return -EFAULT;
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 12/25] x86, pkeys: add functions to fetch PKRU
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm


From: Dave Hansen <dave.hansen@linux.intel.com>

This adds the raw instruction to access PKRU as well as some
accessor functions that correctly handle when the CPU does not
support the instruction.  We don't use it here, but we will use
read_pkru() in the next patch.

eigned-off-by: Dave Hansen <dave.hansen@linux.intel.com>

---

 b/arch/x86/include/asm/pgtable.h       |    8 ++++++++
 b/arch/x86/include/asm/special_insns.h |   20 ++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff -puN arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions	2015-09-28 11:39:46.356201421 -0700
+++ b/arch/x86/include/asm/pgtable.h	2015-09-28 11:39:46.361201648 -0700
@@ -95,6 +95,14 @@ static inline int pte_dirty(pte_t pte)
 	return pte_flags(pte) & _PAGE_DIRTY;
 }
 
+
+static inline u32 read_pkru(void)
+{
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		return __read_pkru();
+	return 0;
+}
+
 static inline int pte_young(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_ACCESSED;
diff -puN arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/special_insns.h
--- a/arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions	2015-09-28 11:39:46.357201466 -0700
+++ b/arch/x86/include/asm/special_insns.h	2015-09-28 11:39:46.361201648 -0700
@@ -98,6 +98,26 @@ static inline void native_write_cr8(unsi
 }
 #endif
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static inline u32 __read_pkru(void)
+{
+        unsigned int eax, edx;
+        unsigned int ecx = 0;
+        unsigned int pkru;
+
+        asm volatile(".byte 0x0f,0x01,0xee\n\t"
+                     : "=a" (eax), "=d" (edx)
+                     : "c" (ecx));
+        pkru = eax;
+        return pkru;
+}
+#else
+static inline u32 __read_pkru(void)
+{
+	return 0;
+}
+#endif
+
 static inline void native_wbinvd(void)
 {
 	asm volatile("wbinvd": : :"memory");
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 11/25] x86, pkeys: notify userspace about protection key faults
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

A protection key fault is very similar to any other access error.
There must be a VMA, etc...  We even want to take the same action
(SIGSEGV) that we do with a normal access fault.

However, we do need to let userspace know that something is
different.  We do this the same way what we did with SEGV_BNDERR
with Memory Protection eXtensions (MPX): define a new SEGV code:
SEGV_PKUERR.

We also add a siginfo field: si_pkey that reveals to userspace
which protection key was set on the PTE that we faulted on.
There is no other easy way for userspace to figure this out.
They could parse smaps but that would be a bit cruel.

Note though that *ALL* protection key faults have to be generated
by a valid, present PTE at some point.  But this code does no PTE
lookups which seeds odd.  The reason is that we take advantage of
the way we generate PTEs from VMAs.  All PTEs under a VMA share
some attributes.  For instance, they are _all_ either PROT_READ
*OR* PROT_NONE.  They also always share a protection key, so we
never have to walk the page tables; we just use the VMA.

We share space with in siginfo with _addr_bnd.  #BR faults from
MPX are completely separate from page faults (#PF) that trigger
from protection key violations, so we never need both at the same
time.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/pgtable_types.h |    5 ++
 b/arch/x86/mm/fault.c                  |   59 ++++++++++++++++++++++++++++++++-
 b/include/uapi/asm-generic/siginfo.h   |   17 ++++++---
 b/kernel/signal.c                      |    4 ++
 4 files changed, 79 insertions(+), 6 deletions(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo	2015-09-28 11:39:45.859178812 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-28 11:39:45.868179221 -0700
@@ -64,6 +64,11 @@
 #endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
+#define _PAGE_PKEY_MASK (_PAGE_PKEY_BIT0 | \
+			 _PAGE_PKEY_BIT1 | \
+			 _PAGE_PKEY_BIT2 | \
+			 _PAGE_PKEY_BIT3)
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else
diff -puN arch/x86/mm/fault.c~pkeys-09-siginfo arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-09-siginfo	2015-09-28 11:39:45.861178903 -0700
+++ b/arch/x86/mm/fault.c	2015-09-28 11:39:45.868179221 -0700
@@ -15,12 +15,14 @@
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 
+#include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
 #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
 #include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
+#include <asm/mmu_context.h>		/* vma_pkey()			*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -169,6 +171,56 @@ is_prefetch(struct pt_regs *regs, unsign
 	return prefetch;
 }
 
+/*
+ * A protection key fault means that the PKRU value did not allow
+ * access to some PTE.  Userspace can figure out what PKRU was
+ * from the XSAVE state, and this function fills out a field in
+ * siginfo so userspace can discover which protection key was set
+ * on the PTE.
+ *
+ * If we get here, we know that the hardware signaled a PF_PK
+ * fault and that there was a VMA once we got in the fault
+ * handler.  It does *not* guarantee that the VMA we find here
+ * was the one that we faulted on.
+ *
+ * 1. T1   : mprotect_key(foo, PAGE_SIZE, pkey=4);
+ * 2. T1   : set PKRU to deny access to pkey=4, touches page
+ * 3. T1   : faults...
+ * 4.    T2: mprotect_key(foo, PAGE_SIZE, pkey=5);
+ * 5. T1   : enters fault handler, takes mmap_sem, etc...
+ * 6. T1   : reaches here, sees vma_pkey(vma)=5, when we really
+ *	     faulted on a pte with its pkey=4.
+ */
+static void fill_sig_info_pkey(int si_code, siginfo_t *info,
+		struct vm_area_struct *vma)
+{
+	/* This is effectively an #ifdef */
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	/* Fault not from Protection Keys: nothing to do */
+	if (si_code != SEGV_PKUERR)
+		return;
+	/*
+	 * force_sig_info_fault() is called from a number of
+	 * contexts, some of which have a VMA and some of which
+	 * do not.  The PF_PK handing happens after we have a
+	 * valid VMA, so we should never reach this without a
+	 * valid VMA.
+	 */
+	if (!vma) {
+		WARN_ONCE(1, "PKU fault with no VMA passed in");
+		info->si_pkey = 0;
+		return;
+	}
+	/*
+	 * si_pkey should be thought of as a strong hint, but not
+	 * absolutely guranteed to be 100% accurate because of
+	 * the race explained above.
+	 */
+	info->si_pkey = vma_pkey(vma);
+}
+
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 		     struct task_struct *tsk, struct vm_area_struct *vma,
@@ -187,6 +239,8 @@ force_sig_info_fault(int si_signo, int s
 		lsb = PAGE_SHIFT;
 	info.si_addr_lsb = lsb;
 
+	fill_sig_info_pkey(si_code, &info, vma);
+
 	force_sig_info(si_signo, &info, tsk);
 }
 
@@ -847,7 +901,10 @@ static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
+	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
+	else
+		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
diff -puN include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo include/uapi/asm-generic/siginfo.h
--- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-28 11:39:45.863178994 -0700
+++ b/include/uapi/asm-generic/siginfo.h	2015-09-28 11:39:45.869179266 -0700
@@ -91,10 +91,15 @@ typedef struct siginfo {
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
 			short _addr_lsb; /* LSB of the reported address */
-			struct {
-				void __user *_lower;
-				void __user *_upper;
-			} _addr_bnd;
+			union {
+				/* used when si_code=SEGV_BNDERR */
+				struct {
+					void __user *_lower;
+					void __user *_upper;
+				} _addr_bnd;
+				/* used when si_code=SEGV_PKUERR */
+				u64 _pkey;
+			};
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -137,6 +142,7 @@ typedef struct siginfo {
 #define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_lower	_sifields._sigfault._addr_bnd._lower
 #define si_upper	_sifields._sigfault._addr_bnd._upper
+#define si_pkey		_sifields._sigfault._pkey
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 #ifdef __ARCH_SIGSYS
@@ -206,7 +212,8 @@ typedef struct siginfo {
 #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
 #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
 #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
-#define NSIGSEGV	3
+#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection key checks */
+#define NSIGSEGV	4
 
 /*
  * SIGBUS si_codes
diff -puN kernel/signal.c~pkeys-09-siginfo kernel/signal.c
--- a/kernel/signal.c~pkeys-09-siginfo	2015-09-28 11:39:45.864179039 -0700
+++ b/kernel/signal.c	2015-09-28 11:39:45.870179312 -0700
@@ -2758,6 +2758,10 @@ int copy_siginfo_to_user(siginfo_t __use
 			err |= __put_user(from->si_upper, &to->si_upper);
 		}
 #endif
+#ifdef SEGV_BNDERR
+		if (from->si_signo == SIGSEGV && from->si_code == SEGV_PKUERR)
+			err |= __put_user(from->si_pkey, &to->si_pkey);
+#endif
 		break;
 	case __SI_CHLD:
 		err |= __put_user(from->si_pid, &to->si_pid);
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 12/25] x86, pkeys: add functions to fetch PKRU
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm


From: Dave Hansen <dave.hansen@linux.intel.com>

This adds the raw instruction to access PKRU as well as some
accessor functions that correctly handle when the CPU does not
support the instruction.  We don't use it here, but we will use
read_pkru() in the next patch.

eigned-off-by: Dave Hansen <dave.hansen@linux.intel.com>

---

 b/arch/x86/include/asm/pgtable.h       |    8 ++++++++
 b/arch/x86/include/asm/special_insns.h |   20 ++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff -puN arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-13-kernel-pkru-instructions	2015-09-28 11:39:46.356201421 -0700
+++ b/arch/x86/include/asm/pgtable.h	2015-09-28 11:39:46.361201648 -0700
@@ -95,6 +95,14 @@ static inline int pte_dirty(pte_t pte)
 	return pte_flags(pte) & _PAGE_DIRTY;
 }
 
+
+static inline u32 read_pkru(void)
+{
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		return __read_pkru();
+	return 0;
+}
+
 static inline int pte_young(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_ACCESSED;
diff -puN arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions arch/x86/include/asm/special_insns.h
--- a/arch/x86/include/asm/special_insns.h~pkeys-13-kernel-pkru-instructions	2015-09-28 11:39:46.357201466 -0700
+++ b/arch/x86/include/asm/special_insns.h	2015-09-28 11:39:46.361201648 -0700
@@ -98,6 +98,26 @@ static inline void native_write_cr8(unsi
 }
 #endif
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static inline u32 __read_pkru(void)
+{
+        unsigned int eax, edx;
+        unsigned int ecx = 0;
+        unsigned int pkru;
+
+        asm volatile(".byte 0x0f,0x01,0xee\n\t"
+                     : "=a" (eax), "=d" (edx)
+                     : "c" (ecx));
+        pkru = eax;
+        return pkru;
+}
+#else
+static inline u32 __read_pkru(void)
+{
+	return 0;
+}
+#endif
+
 static inline void native_wbinvd(void)
 {
 	asm volatile("wbinvd": : :"memory");
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 11/25] x86, pkeys: notify userspace about protection key faults
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

A protection key fault is very similar to any other access error.
There must be a VMA, etc...  We even want to take the same action
(SIGSEGV) that we do with a normal access fault.

However, we do need to let userspace know that something is
different.  We do this the same way what we did with SEGV_BNDERR
with Memory Protection eXtensions (MPX): define a new SEGV code:
SEGV_PKUERR.

We also add a siginfo field: si_pkey that reveals to userspace
which protection key was set on the PTE that we faulted on.
There is no other easy way for userspace to figure this out.
They could parse smaps but that would be a bit cruel.

Note though that *ALL* protection key faults have to be generated
by a valid, present PTE at some point.  But this code does no PTE
lookups which seeds odd.  The reason is that we take advantage of
the way we generate PTEs from VMAs.  All PTEs under a VMA share
some attributes.  For instance, they are _all_ either PROT_READ
*OR* PROT_NONE.  They also always share a protection key, so we
never have to walk the page tables; we just use the VMA.

We share space with in siginfo with _addr_bnd.  #BR faults from
MPX are completely separate from page faults (#PF) that trigger
from protection key violations, so we never need both at the same
time.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/include/asm/pgtable_types.h |    5 ++
 b/arch/x86/mm/fault.c                  |   59 ++++++++++++++++++++++++++++++++-
 b/include/uapi/asm-generic/siginfo.h   |   17 ++++++---
 b/kernel/signal.c                      |    4 ++
 4 files changed, 79 insertions(+), 6 deletions(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo	2015-09-28 11:39:45.859178812 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-09-28 11:39:45.868179221 -0700
@@ -64,6 +64,11 @@
 #endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
+#define _PAGE_PKEY_MASK (_PAGE_PKEY_BIT0 | \
+			 _PAGE_PKEY_BIT1 | \
+			 _PAGE_PKEY_BIT2 | \
+			 _PAGE_PKEY_BIT3)
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else
diff -puN arch/x86/mm/fault.c~pkeys-09-siginfo arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-09-siginfo	2015-09-28 11:39:45.861178903 -0700
+++ b/arch/x86/mm/fault.c	2015-09-28 11:39:45.868179221 -0700
@@ -15,12 +15,14 @@
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 
+#include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
 #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
 #include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
+#include <asm/mmu_context.h>		/* vma_pkey()			*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -169,6 +171,56 @@ is_prefetch(struct pt_regs *regs, unsign
 	return prefetch;
 }
 
+/*
+ * A protection key fault means that the PKRU value did not allow
+ * access to some PTE.  Userspace can figure out what PKRU was
+ * from the XSAVE state, and this function fills out a field in
+ * siginfo so userspace can discover which protection key was set
+ * on the PTE.
+ *
+ * If we get here, we know that the hardware signaled a PF_PK
+ * fault and that there was a VMA once we got in the fault
+ * handler.  It does *not* guarantee that the VMA we find here
+ * was the one that we faulted on.
+ *
+ * 1. T1   : mprotect_key(foo, PAGE_SIZE, pkey=4);
+ * 2. T1   : set PKRU to deny access to pkey=4, touches page
+ * 3. T1   : faults...
+ * 4.    T2: mprotect_key(foo, PAGE_SIZE, pkey=5);
+ * 5. T1   : enters fault handler, takes mmap_sem, etc...
+ * 6. T1   : reaches here, sees vma_pkey(vma)=5, when we really
+ *	     faulted on a pte with its pkey=4.
+ */
+static void fill_sig_info_pkey(int si_code, siginfo_t *info,
+		struct vm_area_struct *vma)
+{
+	/* This is effectively an #ifdef */
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	/* Fault not from Protection Keys: nothing to do */
+	if (si_code != SEGV_PKUERR)
+		return;
+	/*
+	 * force_sig_info_fault() is called from a number of
+	 * contexts, some of which have a VMA and some of which
+	 * do not.  The PF_PK handing happens after we have a
+	 * valid VMA, so we should never reach this without a
+	 * valid VMA.
+	 */
+	if (!vma) {
+		WARN_ONCE(1, "PKU fault with no VMA passed in");
+		info->si_pkey = 0;
+		return;
+	}
+	/*
+	 * si_pkey should be thought of as a strong hint, but not
+	 * absolutely guranteed to be 100% accurate because of
+	 * the race explained above.
+	 */
+	info->si_pkey = vma_pkey(vma);
+}
+
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 		     struct task_struct *tsk, struct vm_area_struct *vma,
@@ -187,6 +239,8 @@ force_sig_info_fault(int si_signo, int s
 		lsb = PAGE_SHIFT;
 	info.si_addr_lsb = lsb;
 
+	fill_sig_info_pkey(si_code, &info, vma);
+
 	force_sig_info(si_signo, &info, tsk);
 }
 
@@ -847,7 +901,10 @@ static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
+	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
+	else
+		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
diff -puN include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo include/uapi/asm-generic/siginfo.h
--- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo	2015-09-28 11:39:45.863178994 -0700
+++ b/include/uapi/asm-generic/siginfo.h	2015-09-28 11:39:45.869179266 -0700
@@ -91,10 +91,15 @@ typedef struct siginfo {
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
 			short _addr_lsb; /* LSB of the reported address */
-			struct {
-				void __user *_lower;
-				void __user *_upper;
-			} _addr_bnd;
+			union {
+				/* used when si_code=SEGV_BNDERR */
+				struct {
+					void __user *_lower;
+					void __user *_upper;
+				} _addr_bnd;
+				/* used when si_code=SEGV_PKUERR */
+				u64 _pkey;
+			};
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -137,6 +142,7 @@ typedef struct siginfo {
 #define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_lower	_sifields._sigfault._addr_bnd._lower
 #define si_upper	_sifields._sigfault._addr_bnd._upper
+#define si_pkey		_sifields._sigfault._pkey
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 #ifdef __ARCH_SIGSYS
@@ -206,7 +212,8 @@ typedef struct siginfo {
 #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
 #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
 #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
-#define NSIGSEGV	3
+#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection key checks */
+#define NSIGSEGV	4
 
 /*
  * SIGBUS si_codes
diff -puN kernel/signal.c~pkeys-09-siginfo kernel/signal.c
--- a/kernel/signal.c~pkeys-09-siginfo	2015-09-28 11:39:45.864179039 -0700
+++ b/kernel/signal.c	2015-09-28 11:39:45.870179312 -0700
@@ -2758,6 +2758,10 @@ int copy_siginfo_to_user(siginfo_t __use
 			err |= __put_user(from->si_upper, &to->si_upper);
 		}
 #endif
+#ifdef SEGV_BNDERR
+		if (from->si_signo == SIGSEGV && from->si_code == SEGV_PKUERR)
+			err |= __put_user(from->si_pkey, &to->si_pkey);
+#endif
 		break;
 	case __SI_CHLD:
 		err |= __put_user(from->si_pid, &to->si_pid);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 13/25] mm: factor out VMA fault permission checking
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This code matches a fault condition up with the VMA and ensures
that the VMA allows the fault to be handled instead of just
erroring out.

We will be extending this in a moment to comprehend protection
keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/gup.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff -puN mm/gup.c~pkeys-10-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-10-pte-fault	2015-09-28 11:39:46.790221164 -0700
+++ b/mm/gup.c	2015-09-28 11:39:46.794221345 -0700
@@ -554,6 +554,17 @@ next_page:
 }
 EXPORT_SYMBOL(__get_user_pages);
 
+bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
+{
+        vm_flags_t vm_flags =
+		(fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+
+	if (!(vm_flags & vma->vm_flags))
+		return false;
+
+	return true;
+}
+
 /*
  * fixup_user_fault() - manually resolve a user page fault
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -585,15 +596,13 @@ int fixup_user_fault(struct task_struct
 		     unsigned long address, unsigned int fault_flags)
 {
 	struct vm_area_struct *vma;
-	vm_flags_t vm_flags;
 	int ret;
 
 	vma = find_extend_vma(mm, address);
 	if (!vma || address < vma->vm_start)
 		return -EFAULT;
 
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
-	if (!(vm_flags & vma->vm_flags))
+	if (!vma_permits_fault(vma, fault_flags))
 		return -EFAULT;
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 15/25] x86, pkeys: check VMAs and PTEs for protection keys
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action.  For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.

We try to do the same thing for protection keys.  Basically, we
try to make sure that if a user does this:

	mprotect(ptr, size, PROT_NONE);
	*ptr = foo;

they see the same effects with protection keys when they do this:

	mprotect(ptr, size, PROT_READ|PROT_WRITE);
	set_pkey(ptr, size, 4);
	wrpkru(0xffffff3f); // access disable pkey 4
	*ptr = foo;

The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.

We add two functions and expose them to generic code:

	arch_pte_access_permitted(pte, write)
	arch_vma_access_permitted(vma, write)

These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.

But, there are also cases where we do not want to respect
protection keys.  When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/powerpc/include/asm/mmu_context.h   |   11 ++++++
 b/arch/s390/include/asm/mmu_context.h      |   11 ++++++
 b/arch/unicore32/include/asm/mmu_context.h |   11 ++++++
 b/arch/x86/include/asm/mmu_context.h       |   51 ++++++++++++++++++++++++++++-
 b/arch/x86/include/asm/pgtable.h           |   29 ++++++++++++++++
 b/arch/x86/mm/fault.c                      |   21 +++++++++++
 b/arch/x86/mm/gup.c                        |    3 +
 b/include/asm-generic/mm_hooks.h           |   11 ++++++
 b/mm/gup.c                                 |   17 ++++++++-
 b/mm/memory.c                              |    4 ++
 10 files changed, 165 insertions(+), 4 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-11-pte-fault arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-09-28 11:39:47.619258875 -0700
+++ b/arch/powerpc/include/asm/mmu_context.h	2015-09-28 11:39:47.637259694 -0700
@@ -148,5 +148,16 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-11-pte-fault arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-09-28 11:39:47.621258967 -0700
+++ b/arch/s390/include/asm/mmu_context.h	2015-09-28 11:39:47.638259740 -0700
@@ -130,4 +130,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __S390_MMU_CONTEXT_H */
diff -puN arch/unicore32/include/asm/mmu_context.h~pkeys-11-pte-fault arch/unicore32/include/asm/mmu_context.h
--- a/arch/unicore32/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-09-28 11:39:47.622259012 -0700
+++ b/arch/unicore32/include/asm/mmu_context.h	2015-09-28 11:39:47.638259740 -0700
@@ -97,4 +97,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-09-28 11:39:47.624259103 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2015-09-28 11:39:47.638259740 -0700
@@ -263,4 +263,53 @@ static inline u16 vma_pkey(struct vm_are
 	return pkey;
 }
 
-#endif /* _ASM_X86_MMU_CONTEXT_H */
+static inline bool __pkru_allows_pkey(u16 pkey, bool write)
+{
+	u32 pkru = read_pkru();
+
+	if (!__pkru_allows_read(pkru, pkey))
+		return false;
+	if (write && !__pkru_allows_write(pkru, pkey))
+		return false;
+
+	return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to PKRU for other
+ * processes or any way to tell *which * PKRU in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+	if (!current->mm)
+		return true;
+	/*
+	 * Should PKRU be enforced on the access to this VMA?  If
+	 * the VMA is from another process, then PKRU has no
+	 * relevance and should not be enforced.
+	 */
+	if (current->mm != vma->vm_mm)
+		return true;
+
+	return false;
+}
+
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* allow access if the VMA is not one from this process */
+	if (vma_is_foreign(vma))
+		return true;
+	return __pkru_allows_pkey(vma_pkey(vma), write);
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	return __pkru_allows_pkey(pte_pkey(pte), write);
+}
+
+#endif /* _ASM_X86_MMUeCONTEXT_H */
diff -puN arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault	2015-09-28 11:39:47.626259194 -0700
+++ b/arch/x86/include/asm/pgtable.h	2015-09-28 11:39:47.639259785 -0700
@@ -889,6 +889,35 @@ static inline pte_t pte_swp_clear_soft_d
 }
 #endif
 
+#define PKRU_AD_BIT 0x1
+#define PKRU_WD_BIT 0x2
+
+static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+}
+
+static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	/*
+	 * Access-disable disables writes too so we need to check
+	 * both bits here.
+	 */
+	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+}
+
+static inline u16 pte_pkey(pte_t pte)
+{
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/* ifdef to avoid doing 59-bit shift on 32-bit values */
+	return (pte_flags(pte) & _PAGE_PKEY_MASK) >> _PAGE_BIT_PKEY_BIT0;
+#else
+	return 0;
+#endif
+}
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff -puN arch/x86/mm/fault.c~pkeys-11-pte-fault arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-11-pte-fault	2015-09-28 11:39:47.627259240 -0700
+++ b/arch/x86/mm/fault.c	2015-09-28 11:39:47.639259785 -0700
@@ -897,11 +897,21 @@ bad_area(struct pt_regs *regs, unsigned
 	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
+static inline bool bad_area_access_from_pkeys(unsigned long error_code,
+		struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return false;
+	if (error_code & PF_PK)
+		return true;
+	return false;
+}
+
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
 {
-	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+	if (bad_area_access_from_pkeys(error_code, vma))
 		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
 	else
 		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
@@ -1073,6 +1083,15 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/*
+	 * Access or read was blocked by protection keys. We do
+	 * this check before any others because we do not want
+	 * to, for instance, confuse a protection-key-denied
+	 * write with one for which we should do a COW.
+	 */
+	if (error_code & PF_PK)
+		return 1;
+
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
diff -puN arch/x86/mm/gup.c~pkeys-11-pte-fault arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-11-pte-fault	2015-09-28 11:39:47.629259330 -0700
+++ b/arch/x86/mm/gup.c	2015-09-28 11:39:47.640259831 -0700
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/swap.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 
 static inline pte_t gup_get_pte(pte_t *ptep)
@@ -73,6 +74,8 @@ static inline int pte_allows_gup(pte_t p
 		return 0;
 	if (write && !pte_write(pte))
 		return 0;
+	if (!arch_pte_access_permitted(pte, write))
+		return 0;
 	return 1;
 }
 
diff -puN include/asm-generic/mm_hooks.h~pkeys-11-pte-fault include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-11-pte-fault	2015-09-28 11:39:47.631259421 -0700
+++ b/include/asm-generic/mm_hooks.h	2015-09-28 11:39:47.640259831 -0700
@@ -26,4 +26,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif	/* _ASM_GENERIC_MM_HOOKS_H */
diff -puN mm/gup.c~pkeys-11-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-11-pte-fault	2015-09-28 11:39:47.632259467 -0700
+++ b/mm/gup.c	2015-09-28 11:39:47.641259876 -0700
@@ -13,6 +13,7 @@
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
@@ -388,6 +389,8 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
+	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+		return -EFAULT;
 	return 0;
 }
 
@@ -556,12 +559,19 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-        vm_flags_t vm_flags =
-		(fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+	int write = (fault_flags & FAULT_FLAG_WRITE);
+	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
 		return false;
 
+	/*
+	 * The architecture might have a hardware protection
+	 * mechanism other than read/write that can deny access
+	 */
+	if (!arch_vma_access_permitted(vma, write))
+		return false;
+
 	return true;
 }
 
@@ -1079,6 +1089,9 @@ static int gup_pte_range(pmd_t pmd, unsi
 			pte_protnone(pte) || (write && !pte_write(pte)))
 			goto pte_unmap;
 
+		if (!arch_pte_access_permitted(pte, write))
+			goto pte_unmap;
+
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
diff -puN mm/memory.c~pkeys-11-pte-fault mm/memory.c
--- a/mm/memory.c~pkeys-11-pte-fault	2015-09-28 11:39:47.634259558 -0700
+++ b/mm/memory.c	2015-09-28 11:39:47.642259922 -0700
@@ -64,6 +64,7 @@
 #include <linux/userfaultfd_k.h>
 
 #include <asm/io.h>
+#include <asm/mmu_context.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
@@ -3342,6 +3343,9 @@ static int __handle_mm_fault(struct mm_s
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+		return VM_FAULT_SIGSEGV;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 16/25] x86, pkeys: optimize fault handling in access_error()
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We might not strictly have to make modifictions to
access_error() to check the VMA here.

If we do not, we will do this:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault(), allocates and maps page, sets pte.pkey=K
4. return to userspace
5. touch instruction reexecutes, but triggers PF_PK
6. do PKEY signal

What happens with this patch applied:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault() notices that K is inaccessible
4. do PKEY signal

We basically skip the fault that does an allocation.

So what this lets us do is protect areas from even being
*populated* unless it is accessible according to protection
keys.  That seems handy to me and makes protection keys work
more like an mprotect()'d mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/fault.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-15-access_error arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-15-access_error	2015-09-28 11:39:48.287289263 -0700
+++ b/arch/x86/mm/fault.c	2015-09-28 11:39:48.290289400 -0700
@@ -904,6 +904,9 @@ static inline bool bad_area_access_from_
 		return false;
 	if (error_code & PF_PK)
 		return true;
+	/* this checks permission keys on the VMA: */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE)))
+		return true;
 	return false;
 }
 
@@ -1091,6 +1094,13 @@ access_error(unsigned long error_code, s
 	 */
 	if (error_code & PF_PK)
 		return 1;
+	/*
+	 * Make sure to check the VMA so that we do not perform
+	 * faults just to hit a PF_PK as soon as we fill in a
+	 * page.
+	 */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE)))
+		return 1;
 
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 14/25] mm: simplify get_user_pages() PTE bit handling
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The current get_user_pages() code is a wee bit more complicated
than it needs to be for pte bit checking.  Currently, it establishes
a mask of required pte _PAGE_* bits and ensures that the pte it
goes after has all those bits.

We need to use the bits for our _PAGE_PRESENT check since
pte_present() is also true for _PAGE_PROTNONE, and we have no
accessor for _PAGE_USER, so need it there as well.

But we might as well just use pte_write() since we have it and
let the compiler work its magic on optimizing it.

This also consolidates the three identical copies of this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/gup.c |   34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff -puN arch/x86/mm/gup.c~pkeys-16-gup-swizzle arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-16-gup-swizzle	2015-09-28 11:39:47.203239951 -0700
+++ b/arch/x86/mm/gup.c	2015-09-28 11:39:47.206240088 -0700
@@ -63,6 +63,19 @@ retry:
 #endif
 }
 
+static inline int pte_allows_gup(pte_t pte, int write)
+{
+	/*
+	 * Note that pte_present() is true for !_PAGE_PRESENT
+	 * but _PAGE_PROTNONE, so we can not use it here.
+	 */
+	if (!(pte_flags(pte) & (_PAGE_PRESENT|_PAGE_USER)))
+		return 0;
+	if (write && !pte_write(pte))
+		return 0;
+	return 1;
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -71,13 +84,8 @@ retry:
 static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t *ptep;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-
 	ptep = pte_offset_map(&pmd, addr);
 	do {
 		pte_t pte = gup_get_pte(ptep);
@@ -88,8 +96,8 @@ static noinline int gup_pte_range(pmd_t
 			pte_unmap(ptep);
 			return 0;
 		}
-
-		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+		if (!pte_allows_gup(pte, write) ||
+		    pte_special(pte)) {
 			pte_unmap(ptep);
 			return 0;
 		}
@@ -117,15 +125,11 @@ static inline void get_head_page_multipl
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t pte = *(pte_t *)&pmd;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_flags(pte) & mask) != mask)
+	if (!pte_allows_gup(pte, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
@@ -194,15 +198,11 @@ static int gup_pmd_range(pud_t pud, unsi
 static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t pte = *(pte_t *)&pud;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_flags(pte) & mask) != mask)
+	if (!pte_allows_gup(pte, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 14/25] mm: simplify get_user_pages() PTE bit handling
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The current get_user_pages() code is a wee bit more complicated
than it needs to be for pte bit checking.  Currently, it establishes
a mask of required pte _PAGE_* bits and ensures that the pte it
goes after has all those bits.

We need to use the bits for our _PAGE_PRESENT check since
pte_present() is also true for _PAGE_PROTNONE, and we have no
accessor for _PAGE_USER, so need it there as well.

But we might as well just use pte_write() since we have it and
let the compiler work its magic on optimizing it.

This also consolidates the three identical copies of this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/gup.c |   34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff -puN arch/x86/mm/gup.c~pkeys-16-gup-swizzle arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-16-gup-swizzle	2015-09-28 11:39:47.203239951 -0700
+++ b/arch/x86/mm/gup.c	2015-09-28 11:39:47.206240088 -0700
@@ -63,6 +63,19 @@ retry:
 #endif
 }
 
+static inline int pte_allows_gup(pte_t pte, int write)
+{
+	/*
+	 * Note that pte_present() is true for !_PAGE_PRESENT
+	 * but _PAGE_PROTNONE, so we can not use it here.
+	 */
+	if (!(pte_flags(pte) & (_PAGE_PRESENT|_PAGE_USER)))
+		return 0;
+	if (write && !pte_write(pte))
+		return 0;
+	return 1;
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -71,13 +84,8 @@ retry:
 static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t *ptep;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-
 	ptep = pte_offset_map(&pmd, addr);
 	do {
 		pte_t pte = gup_get_pte(ptep);
@@ -88,8 +96,8 @@ static noinline int gup_pte_range(pmd_t
 			pte_unmap(ptep);
 			return 0;
 		}
-
-		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+		if (!pte_allows_gup(pte, write) ||
+		    pte_special(pte)) {
 			pte_unmap(ptep);
 			return 0;
 		}
@@ -117,15 +125,11 @@ static inline void get_head_page_multipl
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t pte = *(pte_t *)&pmd;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_flags(pte) & mask) != mask)
+	if (!pte_allows_gup(pte, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
@@ -194,15 +198,11 @@ static int gup_pmd_range(pud_t pud, unsi
 static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	pte_t pte = *(pte_t *)&pud;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_flags(pte) & mask) != mask)
+	if (!pte_allows_gup(pte, write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pte_flags(pte) & _PAGE_SPECIAL);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 15/25] x86, pkeys: check VMAs and PTEs for protection keys
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action.  For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.

We try to do the same thing for protection keys.  Basically, we
try to make sure that if a user does this:

	mprotect(ptr, size, PROT_NONE);
	*ptr = foo;

they see the same effects with protection keys when they do this:

	mprotect(ptr, size, PROT_READ|PROT_WRITE);
	set_pkey(ptr, size, 4);
	wrpkru(0xffffff3f); // access disable pkey 4
	*ptr = foo;

The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.

We add two functions and expose them to generic code:

	arch_pte_access_permitted(pte, write)
	arch_vma_access_permitted(vma, write)

These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.

But, there are also cases where we do not want to respect
protection keys.  When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/powerpc/include/asm/mmu_context.h   |   11 ++++++
 b/arch/s390/include/asm/mmu_context.h      |   11 ++++++
 b/arch/unicore32/include/asm/mmu_context.h |   11 ++++++
 b/arch/x86/include/asm/mmu_context.h       |   51 ++++++++++++++++++++++++++++-
 b/arch/x86/include/asm/pgtable.h           |   29 ++++++++++++++++
 b/arch/x86/mm/fault.c                      |   21 +++++++++++
 b/arch/x86/mm/gup.c                        |    3 +
 b/include/asm-generic/mm_hooks.h           |   11 ++++++
 b/mm/gup.c                                 |   17 ++++++++-
 b/mm/memory.c                              |    4 ++
 10 files changed, 165 insertions(+), 4 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-11-pte-fault arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-09-28 11:39:47.619258875 -0700
+++ b/arch/powerpc/include/asm/mmu_context.h	2015-09-28 11:39:47.637259694 -0700
@@ -148,5 +148,16 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-11-pte-fault arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-09-28 11:39:47.621258967 -0700
+++ b/arch/s390/include/asm/mmu_context.h	2015-09-28 11:39:47.638259740 -0700
@@ -130,4 +130,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __S390_MMU_CONTEXT_H */
diff -puN arch/unicore32/include/asm/mmu_context.h~pkeys-11-pte-fault arch/unicore32/include/asm/mmu_context.h
--- a/arch/unicore32/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-09-28 11:39:47.622259012 -0700
+++ b/arch/unicore32/include/asm/mmu_context.h	2015-09-28 11:39:47.638259740 -0700
@@ -97,4 +97,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-11-pte-fault	2015-09-28 11:39:47.624259103 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2015-09-28 11:39:47.638259740 -0700
@@ -263,4 +263,53 @@ static inline u16 vma_pkey(struct vm_are
 	return pkey;
 }
 
-#endif /* _ASM_X86_MMU_CONTEXT_H */
+static inline bool __pkru_allows_pkey(u16 pkey, bool write)
+{
+	u32 pkru = read_pkru();
+
+	if (!__pkru_allows_read(pkru, pkey))
+		return false;
+	if (write && !__pkru_allows_write(pkru, pkey))
+		return false;
+
+	return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to PKRU for other
+ * processes or any way to tell *which * PKRU in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+	if (!current->mm)
+		return true;
+	/*
+	 * Should PKRU be enforced on the access to this VMA?  If
+	 * the VMA is from another process, then PKRU has no
+	 * relevance and should not be enforced.
+	 */
+	if (current->mm != vma->vm_mm)
+		return true;
+
+	return false;
+}
+
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* allow access if the VMA is not one from this process */
+	if (vma_is_foreign(vma))
+		return true;
+	return __pkru_allows_pkey(vma_pkey(vma), write);
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	return __pkru_allows_pkey(pte_pkey(pte), write);
+}
+
+#endif /* _ASM_X86_MMUeCONTEXT_H */
diff -puN arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-11-pte-fault	2015-09-28 11:39:47.626259194 -0700
+++ b/arch/x86/include/asm/pgtable.h	2015-09-28 11:39:47.639259785 -0700
@@ -889,6 +889,35 @@ static inline pte_t pte_swp_clear_soft_d
 }
 #endif
 
+#define PKRU_AD_BIT 0x1
+#define PKRU_WD_BIT 0x2
+
+static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+}
+
+static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	/*
+	 * Access-disable disables writes too so we need to check
+	 * both bits here.
+	 */
+	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+}
+
+static inline u16 pte_pkey(pte_t pte)
+{
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/* ifdef to avoid doing 59-bit shift on 32-bit values */
+	return (pte_flags(pte) & _PAGE_PKEY_MASK) >> _PAGE_BIT_PKEY_BIT0;
+#else
+	return 0;
+#endif
+}
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff -puN arch/x86/mm/fault.c~pkeys-11-pte-fault arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-11-pte-fault	2015-09-28 11:39:47.627259240 -0700
+++ b/arch/x86/mm/fault.c	2015-09-28 11:39:47.639259785 -0700
@@ -897,11 +897,21 @@ bad_area(struct pt_regs *regs, unsigned
 	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
+static inline bool bad_area_access_from_pkeys(unsigned long error_code,
+		struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return false;
+	if (error_code & PF_PK)
+		return true;
+	return false;
+}
+
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
 {
-	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+	if (bad_area_access_from_pkeys(error_code, vma))
 		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
 	else
 		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
@@ -1073,6 +1083,15 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/*
+	 * Access or read was blocked by protection keys. We do
+	 * this check before any others because we do not want
+	 * to, for instance, confuse a protection-key-denied
+	 * write with one for which we should do a COW.
+	 */
+	if (error_code & PF_PK)
+		return 1;
+
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
diff -puN arch/x86/mm/gup.c~pkeys-11-pte-fault arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-11-pte-fault	2015-09-28 11:39:47.629259330 -0700
+++ b/arch/x86/mm/gup.c	2015-09-28 11:39:47.640259831 -0700
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/swap.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 
 static inline pte_t gup_get_pte(pte_t *ptep)
@@ -73,6 +74,8 @@ static inline int pte_allows_gup(pte_t p
 		return 0;
 	if (write && !pte_write(pte))
 		return 0;
+	if (!arch_pte_access_permitted(pte, write))
+		return 0;
 	return 1;
 }
 
diff -puN include/asm-generic/mm_hooks.h~pkeys-11-pte-fault include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-11-pte-fault	2015-09-28 11:39:47.631259421 -0700
+++ b/include/asm-generic/mm_hooks.h	2015-09-28 11:39:47.640259831 -0700
@@ -26,4 +26,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif	/* _ASM_GENERIC_MM_HOOKS_H */
diff -puN mm/gup.c~pkeys-11-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-11-pte-fault	2015-09-28 11:39:47.632259467 -0700
+++ b/mm/gup.c	2015-09-28 11:39:47.641259876 -0700
@@ -13,6 +13,7 @@
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
@@ -388,6 +389,8 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
+	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+		return -EFAULT;
 	return 0;
 }
 
@@ -556,12 +559,19 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-        vm_flags_t vm_flags =
-		(fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+	int write = (fault_flags & FAULT_FLAG_WRITE);
+	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
 		return false;
 
+	/*
+	 * The architecture might have a hardware protection
+	 * mechanism other than read/write that can deny access
+	 */
+	if (!arch_vma_access_permitted(vma, write))
+		return false;
+
 	return true;
 }
 
@@ -1079,6 +1089,9 @@ static int gup_pte_range(pmd_t pmd, unsi
 			pte_protnone(pte) || (write && !pte_write(pte)))
 			goto pte_unmap;
 
+		if (!arch_pte_access_permitted(pte, write))
+			goto pte_unmap;
+
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
diff -puN mm/memory.c~pkeys-11-pte-fault mm/memory.c
--- a/mm/memory.c~pkeys-11-pte-fault	2015-09-28 11:39:47.634259558 -0700
+++ b/mm/memory.c	2015-09-28 11:39:47.642259922 -0700
@@ -64,6 +64,7 @@
 #include <linux/userfaultfd_k.h>
 
 #include <asm/io.h>
+#include <asm/mmu_context.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
@@ -3342,6 +3343,9 @@ static int __handle_mm_fault(struct mm_s
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+		return VM_FAULT_SIGSEGV;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 16/25] x86, pkeys: optimize fault handling in access_error()
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We might not strictly have to make modifictions to
access_error() to check the VMA here.

If we do not, we will do this:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault(), allocates and maps page, sets pte.pkey=K
4. return to userspace
5. touch instruction reexecutes, but triggers PF_PK
6. do PKEY signal

What happens with this patch applied:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault() notices that K is inaccessible
4. do PKEY signal

We basically skip the fault that does an allocation.

So what this lets us do is protect areas from even being
*populated* unless it is accessible according to protection
keys.  That seems handy to me and makes protection keys work
more like an mprotect()'d mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/fault.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-15-access_error arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-15-access_error	2015-09-28 11:39:48.287289263 -0700
+++ b/arch/x86/mm/fault.c	2015-09-28 11:39:48.290289400 -0700
@@ -904,6 +904,9 @@ static inline bool bad_area_access_from_
 		return false;
 	if (error_code & PF_PK)
 		return true;
+	/* this checks permission keys on the VMA: */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE)))
+		return true;
 	return false;
 }
 
@@ -1091,6 +1094,13 @@ access_error(unsigned long error_code, s
 	 */
 	if (error_code & PF_PK)
 		return 1;
+	/*
+	 * Make sure to check the VMA so that we do not perform
+	 * faults just to hit a PF_PK as soon as we fill in a
+	 * page.
+	 */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE)))
+		return 1;
 
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 17/25] x86, pkeys: dump PKRU with other kernel registers
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I'm a bit ambivalent about whether this is needed or not.

Protection Keys never affect kernel mappings.  But, they can
affect whether the kernel will fault when it touches a user
mapping.  But, the kernel doesn't touch user mappings without
some careful choreography and these accesses don't generally
result in oopses.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/kernel/process_64.c |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps	2015-09-28 11:39:48.695307824 -0700
+++ b/arch/x86/kernel/process_64.c	2015-09-28 11:39:48.698307960 -0700
@@ -116,6 +116,8 @@ void __show_regs(struct pt_regs *regs, i
 	printk(KERN_DEFAULT "DR0: %016lx DR1: %016lx DR2: %016lx\n", d0, d1, d2);
 	printk(KERN_DEFAULT "DR3: %016lx DR6: %016lx DR7: %016lx\n", d3, d6, d7);
 
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		printk(KERN_DEFAULT "PKRU: %08x\n", read_pkru());
 }
 
 void release_thread(struct task_struct *dead_task)
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 19/25] x86, pkeys: add Kconfig prompt to existing config option
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I don't have a strong opinion on whether we need this or not.
Protection Keys has relatively little code associated with it,
and it is not a heavyweight feature to keep enabled.  However,
I can imagine that folks would still appreciate being able to
disable it.

Here's the option if folks want it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-40-kconfig-prompt arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-40-kconfig-prompt	2015-09-28 11:39:49.547346582 -0700
+++ b/arch/x86/Kconfig	2015-09-28 11:39:49.551346764 -0700
@@ -1696,8 +1696,18 @@ config X86_INTEL_MPX
 	  If unsure, say N.
 
 config X86_INTEL_MEMORY_PROTECTION_KEYS
+	prompt "Intel Memory Protection Keys"
 	def_bool y
+	# Note: only available in 64-bit mode
 	depends on CPU_SUP_INTEL && X86_64
+	---help---
+	  Memory Protection Keys provides a mechanism for enforcing
+	  page-based protections, but without requiring modification of the
+	  page tables when an application changes protection domains.
+
+	  For details, see Documentation/x86/protection-keys.txt
+
+	  If unsure, say y.
 
 config EFI
 	bool "EFI runtime service support"
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 18/25] x86, pkeys: dump PTE pkey in /proc/pid/smaps
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The protection key can now be just as important as read/write
permissions on a VMA.  We need some debug mechanism to help
figure out if it is in play.  smaps seems like a logical
place to expose it.

arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.

We also use no #ifdef.  If protection keys is .config'd out
we will get the same function as if we used the weak generic
function.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/kernel/setup.c |    9 +++++++++
 b/fs/proc/task_mmu.c      |    5 +++++
 2 files changed, 14 insertions(+)

diff -puN arch/x86/kernel/setup.c~pkeys-40-smaps arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~pkeys-40-smaps	2015-09-28 11:39:49.106326520 -0700
+++ b/arch/x86/kernel/setup.c	2015-09-28 11:39:49.111326748 -0700
@@ -111,6 +111,7 @@
 #include <asm/mce.h>
 #include <asm/alternative.h>
 #include <asm/prom.h>
+#include <asm/special_insns.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1264,3 +1265,11 @@ static int __init register_kernel_offset
 	return 0;
 }
 __initcall(register_kernel_offset_dumper);
+
+void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
+}
diff -puN fs/proc/task_mmu.c~pkeys-40-smaps fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~pkeys-40-smaps	2015-09-28 11:39:49.107326566 -0700
+++ b/fs/proc/task_mmu.c	2015-09-28 11:39:49.112326793 -0700
@@ -625,6 +625,10 @@ static void show_smap_vma_flags(struct s
 	seq_putc(m, '\n');
 }
 
+void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+}
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
@@ -674,6 +678,7 @@ static int show_smap(struct seq_file *m,
 		   (vma->vm_flags & VM_LOCKED) ?
 			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
 
+	arch_show_smap(m, vma);
 	show_smap_vma_flags(m, vma);
 	m_cache_vma(m, vma);
 	return 0;
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 19/25] x86, pkeys: add Kconfig prompt to existing config option
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I don't have a strong opinion on whether we need this or not.
Protection Keys has relatively little code associated with it,
and it is not a heavyweight feature to keep enabled.  However,
I can imagine that folks would still appreciate being able to
disable it.

Here's the option if folks want it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-40-kconfig-prompt arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-40-kconfig-prompt	2015-09-28 11:39:49.547346582 -0700
+++ b/arch/x86/Kconfig	2015-09-28 11:39:49.551346764 -0700
@@ -1696,8 +1696,18 @@ config X86_INTEL_MPX
 	  If unsure, say N.
 
 config X86_INTEL_MEMORY_PROTECTION_KEYS
+	prompt "Intel Memory Protection Keys"
 	def_bool y
+	# Note: only available in 64-bit mode
 	depends on CPU_SUP_INTEL && X86_64
+	---help---
+	  Memory Protection Keys provides a mechanism for enforcing
+	  page-based protections, but without requiring modification of the
+	  page tables when an application changes protection domains.
+
+	  For details, see Documentation/x86/protection-keys.txt
+
+	  If unsure, say y.
 
 config EFI
 	bool "EFI runtime service support"
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 18/25] x86, pkeys: dump PTE pkey in /proc/pid/smaps
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The protection key can now be just as important as read/write
permissions on a VMA.  We need some debug mechanism to help
figure out if it is in play.  smaps seems like a logical
place to expose it.

arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.

We also use no #ifdef.  If protection keys is .config'd out
we will get the same function as if we used the weak generic
function.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/kernel/setup.c |    9 +++++++++
 b/fs/proc/task_mmu.c      |    5 +++++
 2 files changed, 14 insertions(+)

diff -puN arch/x86/kernel/setup.c~pkeys-40-smaps arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~pkeys-40-smaps	2015-09-28 11:39:49.106326520 -0700
+++ b/arch/x86/kernel/setup.c	2015-09-28 11:39:49.111326748 -0700
@@ -111,6 +111,7 @@
 #include <asm/mce.h>
 #include <asm/alternative.h>
 #include <asm/prom.h>
+#include <asm/special_insns.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1264,3 +1265,11 @@ static int __init register_kernel_offset
 	return 0;
 }
 __initcall(register_kernel_offset_dumper);
+
+void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
+}
diff -puN fs/proc/task_mmu.c~pkeys-40-smaps fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~pkeys-40-smaps	2015-09-28 11:39:49.107326566 -0700
+++ b/fs/proc/task_mmu.c	2015-09-28 11:39:49.112326793 -0700
@@ -625,6 +625,10 @@ static void show_smap_vma_flags(struct s
 	seq_putc(m, '\n');
 }
 
+void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+}
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
@@ -674,6 +678,7 @@ static int show_smap(struct seq_file *m,
 		   (vma->vm_flags & VM_LOCKED) ?
 			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
 
+	arch_show_smap(m, vma);
 	show_smap_vma_flags(m, vma);
 	m_cache_vma(m, vma);
 	return 0;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 17/25] x86, pkeys: dump PKRU with other kernel registers
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I'm a bit ambivalent about whether this is needed or not.

Protection Keys never affect kernel mappings.  But, they can
affect whether the kernel will fault when it touches a user
mapping.  But, the kernel doesn't touch user mappings without
some careful choreography and these accesses don't generally
result in oopses.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/kernel/process_64.c |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps	2015-09-28 11:39:48.695307824 -0700
+++ b/arch/x86/kernel/process_64.c	2015-09-28 11:39:48.698307960 -0700
@@ -116,6 +116,8 @@ void __show_regs(struct pt_regs *regs, i
 	printk(KERN_DEFAULT "DR0: %016lx DR1: %016lx DR2: %016lx\n", d0, d1, d2);
 	printk(KERN_DEFAULT "DR3: %016lx DR6: %016lx DR7: %016lx\n", d3, d6, d7);
 
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		printk(KERN_DEFAULT "PKRU: %08x\n", read_pkru());
 }
 
 void release_thread(struct task_struct *dead_task)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 20/25] mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave
  Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen, linux-api,
	linux-arch


From: Dave Hansen <dave.hansen@linux.intel.com>

This plumbs a protection key through calc_vm_flag_bits().
We could of done this in calc_vm_prot_bits(), but I did not
feel super strongly which way to go.  It was pretty arbitrary
which one to use.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
---

 b/arch/powerpc/include/asm/mman.h  |    5 +++--
 b/drivers/char/agp/frontend.c      |    2 +-
 b/drivers/staging/android/ashmem.c |    9 +++++----
 b/include/linux/mman.h             |    6 +++---
 b/mm/mmap.c                        |    2 +-
 b/mm/mprotect.c                    |    2 +-
 b/mm/nommu.c                       |    2 +-
 7 files changed, 15 insertions(+), 13 deletions(-)

diff -puN arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits arch/powerpc/include/asm/mman.h
--- a/arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.962365460 -0700
+++ b/arch/powerpc/include/asm/mman.h	2015-09-28 11:39:49.976366097 -0700
@@ -18,11 +18,12 @@
  * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
  * here.  How important is the optimization?
  */
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot)
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+		unsigned long pkey)
 {
 	return (prot & PROT_SAO) ? VM_SAO : 0;
 }
-#define arch_calc_vm_prot_bits(prot) arch_calc_vm_prot_bits(prot)
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
diff -puN drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits drivers/char/agp/frontend.c
--- a/drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.964365551 -0700
+++ b/drivers/char/agp/frontend.c	2015-09-28 11:39:49.977366142 -0700
@@ -156,7 +156,7 @@ static pgprot_t agp_convert_mmap_flags(i
 {
 	unsigned long prot_bits;
 
-	prot_bits = calc_vm_prot_bits(prot) | VM_SHARED;
+	prot_bits = calc_vm_prot_bits(prot, 0) | VM_SHARED;
 	return vm_get_page_prot(prot_bits);
 }
 
diff -puN drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits drivers/staging/android/ashmem.c
--- a/drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.966365642 -0700
+++ b/drivers/staging/android/ashmem.c	2015-09-28 11:39:49.977366142 -0700
@@ -351,7 +351,8 @@ out:
 	return ret;
 }
 
-static inline vm_flags_t calc_vm_may_flags(unsigned long prot)
+static inline vm_flags_t calc_vm_may_flags(unsigned long prot,
+		unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_MAYREAD) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_MAYWRITE) |
@@ -372,12 +373,12 @@ static int ashmem_mmap(struct file *file
 	}
 
 	/* requested protection bits must match our allowed protection mask */
-	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask)) &
-		     calc_vm_prot_bits(PROT_MASK))) {
+	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask, 0)) &
+		     calc_vm_prot_bits(PROT_MASK, 0))) {
 		ret = -EPERM;
 		goto out;
 	}
-	vma->vm_flags &= ~calc_vm_may_flags(~asma->prot_mask);
+	vma->vm_flags &= ~calc_vm_may_flags(~asma->prot_mask, 0);
 
 	if (!asma->file) {
 		char *name = ASHMEM_NAME_DEF;
diff -puN include/linux/mman.h~pkeys-84-calc_vm_prot_bits include/linux/mman.h
--- a/include/linux/mman.h~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.967365688 -0700
+++ b/include/linux/mman.h	2015-09-28 11:39:49.977366142 -0700
@@ -35,7 +35,7 @@ static inline void vm_unacct_memory(long
  */
 
 #ifndef arch_calc_vm_prot_bits
-#define arch_calc_vm_prot_bits(prot) 0
+#define arch_calc_vm_prot_bits(prot, pkey) 0
 #endif
 
 #ifndef arch_vm_get_page_prot
@@ -70,12 +70,12 @@ static inline int arch_validate_prot(uns
  * Combine the mmap "prot" argument into "vm_flags" used internally.
  */
 static inline unsigned long
-calc_vm_prot_bits(unsigned long prot)
+calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
-	       arch_calc_vm_prot_bits(prot);
+	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
 /*
diff -puN mm/mmap.c~pkeys-84-calc_vm_prot_bits mm/mmap.c
--- a/mm/mmap.c~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.969365779 -0700
+++ b/mm/mmap.c	2015-09-28 11:39:49.978366188 -0700
@@ -1311,7 +1311,7 @@ unsigned long do_mmap(struct file *file,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
diff -puN mm/mprotect.c~pkeys-84-calc_vm_prot_bits mm/mprotect.c
--- a/mm/mprotect.c~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.971365870 -0700
+++ b/mm/mprotect.c	2015-09-28 11:39:49.979366234 -0700
@@ -373,7 +373,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot);
+	vm_flags = calc_vm_prot_bits(prot, 0);
 
 	down_write(&current->mm->mmap_sem);
 
diff -puN mm/nommu.c~pkeys-84-calc_vm_prot_bits mm/nommu.c
--- a/mm/nommu.c~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.973365961 -0700
+++ b/mm/nommu.c	2015-09-28 11:39:49.980366279 -0700
@@ -1084,7 +1084,7 @@ static unsigned long determine_vm_flags(
 {
 	unsigned long vm_flags;
 
-	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags);
+	vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
 	/* vm_flags |= mm->def_flags; */
 
 	if (!(capabilities & NOMMU_MAP_DIRECT)) {
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 20/25] mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave
  Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen, linux-api,
	linux-arch


From: Dave Hansen <dave.hansen@linux.intel.com>

This plumbs a protection key through calc_vm_flag_bits().
We could of done this in calc_vm_prot_bits(), but I did not
feel super strongly which way to go.  It was pretty arbitrary
which one to use.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
---

 b/arch/powerpc/include/asm/mman.h  |    5 +++--
 b/drivers/char/agp/frontend.c      |    2 +-
 b/drivers/staging/android/ashmem.c |    9 +++++----
 b/include/linux/mman.h             |    6 +++---
 b/mm/mmap.c                        |    2 +-
 b/mm/mprotect.c                    |    2 +-
 b/mm/nommu.c                       |    2 +-
 7 files changed, 15 insertions(+), 13 deletions(-)

diff -puN arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits arch/powerpc/include/asm/mman.h
--- a/arch/powerpc/include/asm/mman.h~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.962365460 -0700
+++ b/arch/powerpc/include/asm/mman.h	2015-09-28 11:39:49.976366097 -0700
@@ -18,11 +18,12 @@
  * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
  * here.  How important is the optimization?
  */
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot)
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+		unsigned long pkey)
 {
 	return (prot & PROT_SAO) ? VM_SAO : 0;
 }
-#define arch_calc_vm_prot_bits(prot) arch_calc_vm_prot_bits(prot)
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
diff -puN drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits drivers/char/agp/frontend.c
--- a/drivers/char/agp/frontend.c~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.964365551 -0700
+++ b/drivers/char/agp/frontend.c	2015-09-28 11:39:49.977366142 -0700
@@ -156,7 +156,7 @@ static pgprot_t agp_convert_mmap_flags(i
 {
 	unsigned long prot_bits;
 
-	prot_bits = calc_vm_prot_bits(prot) | VM_SHARED;
+	prot_bits = calc_vm_prot_bits(prot, 0) | VM_SHARED;
 	return vm_get_page_prot(prot_bits);
 }
 
diff -puN drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits drivers/staging/android/ashmem.c
--- a/drivers/staging/android/ashmem.c~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.966365642 -0700
+++ b/drivers/staging/android/ashmem.c	2015-09-28 11:39:49.977366142 -0700
@@ -351,7 +351,8 @@ out:
 	return ret;
 }
 
-static inline vm_flags_t calc_vm_may_flags(unsigned long prot)
+static inline vm_flags_t calc_vm_may_flags(unsigned long prot,
+		unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_MAYREAD) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_MAYWRITE) |
@@ -372,12 +373,12 @@ static int ashmem_mmap(struct file *file
 	}
 
 	/* requested protection bits must match our allowed protection mask */
-	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask)) &
-		     calc_vm_prot_bits(PROT_MASK))) {
+	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask, 0)) &
+		     calc_vm_prot_bits(PROT_MASK, 0))) {
 		ret = -EPERM;
 		goto out;
 	}
-	vma->vm_flags &= ~calc_vm_may_flags(~asma->prot_mask);
+	vma->vm_flags &= ~calc_vm_may_flags(~asma->prot_mask, 0);
 
 	if (!asma->file) {
 		char *name = ASHMEM_NAME_DEF;
diff -puN include/linux/mman.h~pkeys-84-calc_vm_prot_bits include/linux/mman.h
--- a/include/linux/mman.h~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.967365688 -0700
+++ b/include/linux/mman.h	2015-09-28 11:39:49.977366142 -0700
@@ -35,7 +35,7 @@ static inline void vm_unacct_memory(long
  */
 
 #ifndef arch_calc_vm_prot_bits
-#define arch_calc_vm_prot_bits(prot) 0
+#define arch_calc_vm_prot_bits(prot, pkey) 0
 #endif
 
 #ifndef arch_vm_get_page_prot
@@ -70,12 +70,12 @@ static inline int arch_validate_prot(uns
  * Combine the mmap "prot" argument into "vm_flags" used internally.
  */
 static inline unsigned long
-calc_vm_prot_bits(unsigned long prot)
+calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
-	       arch_calc_vm_prot_bits(prot);
+	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
 /*
diff -puN mm/mmap.c~pkeys-84-calc_vm_prot_bits mm/mmap.c
--- a/mm/mmap.c~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.969365779 -0700
+++ b/mm/mmap.c	2015-09-28 11:39:49.978366188 -0700
@@ -1311,7 +1311,7 @@ unsigned long do_mmap(struct file *file,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
diff -puN mm/mprotect.c~pkeys-84-calc_vm_prot_bits mm/mprotect.c
--- a/mm/mprotect.c~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.971365870 -0700
+++ b/mm/mprotect.c	2015-09-28 11:39:49.979366234 -0700
@@ -373,7 +373,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot);
+	vm_flags = calc_vm_prot_bits(prot, 0);
 
 	down_write(&current->mm->mmap_sem);
 
diff -puN mm/nommu.c~pkeys-84-calc_vm_prot_bits mm/nommu.c
--- a/mm/nommu.c~pkeys-84-calc_vm_prot_bits	2015-09-28 11:39:49.973365961 -0700
+++ b/mm/nommu.c	2015-09-28 11:39:49.980366279 -0700
@@ -1084,7 +1084,7 @@ static unsigned long determine_vm_flags(
 {
 	unsigned long vm_flags;
 
-	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags);
+	vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
 	/* vm_flags |= mm->def_flags; */
 
 	if (!(capabilities & NOMMU_MAP_DIRECT)) {
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 22/25] x86: wire up mprotect_key() system call
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

This is all that we need to get the new system call itself
working on x86.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 b/arch/x86/include/uapi/asm/mman.h       |    7 +++++++
 b/mm/Kconfig                             |    1 +
 4 files changed, 10 insertions(+)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.964411042 -0700
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-09-28 11:39:50.972411406 -0700
@@ -382,3 +382,4 @@
 373	i386	shutdown		sys_shutdown
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
+376	i386	mprotect_key		sys_mprotect_key
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.965411087 -0700
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-09-28 11:39:50.972411406 -0700
@@ -331,6 +331,7 @@
 322	64	execveat		stub_execveat
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
+325	common	mprotect_key		sys_mprotect_key
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.967411179 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-28 11:39:50.973411451 -0700
@@ -20,6 +20,13 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot, key) ( 		\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
+
 #endif
 
 #include <asm-generic/mman.h>
diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
--- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.969411269 -0700
+++ b/mm/Kconfig	2015-09-28 11:39:50.973411451 -0700
@@ -689,4 +689,5 @@ config NR_PROTECTION_KEYS
 	# Everything supports a _single_ key, so allow folks to
 	# at least call APIs that take keys, but require that the
 	# key be 0.
+	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
 	default 1
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 23/25] x86, pkeys: actually enable Memory Protection Keys in CPU
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This sets the bit in 'cr4' to actually enable the protection
keys feature.  We also include a boot-time disable for the
feature "nopku".

Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid
bit to appear set.  At this point in boot, identify_cpu()
has already run the actual CPUID instructions and populated
the "cpu features" structures.  We need to go back and
re-run identify_cpu() to make sure it gets updated values.

We *could* simply re-populate the 11th word of the cpuid
data, but this is probably quick enough.

Also note that with the cpu_has() check and X86_FEATURE_PKU
present in disabled-features.h, we do not need an #ifdef
for setup_pku().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/kernel-parameters.txt |    3 ++
 b/arch/x86/kernel/cpu/common.c        |   41 ++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch	2015-09-28 11:39:51.455433378 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-09-28 11:39:51.461433651 -0700
@@ -289,6 +289,46 @@ static __always_inline void setup_smap(s
 }
 
 /*
+ * Protection Keys are not available in 32-bit mode.
+ */
+static bool pku_disabled = false;
+static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+{
+	if (!cpu_has(c, X86_FEATURE_PKU))
+		return;
+	if (pku_disabled)
+		return;
+
+	cr4_set_bits(X86_CR4_PKE);
+	/*
+	 * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+	 * cpuid bit to be set.  We need to ensure that we
+	 * update that bit in this CPU's "cpu_info".
+	 */
+	get_cpu_cap(c);
+}
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static __init int setup_disable_pku(char *arg)
+{
+	/*
+	 * Do not clear the X86_FEATURE_PKU bit.  All of the
+	 * runtime checks are against OSPKE so clearing the
+	 * bit does nothing.
+	 *
+	 * This way, we will see "pku" in cpuinfo, but not
+	 * "ospke", which is exactly what we want.  It shows
+	 * that the CPU has PKU, but the OS has not enabled it.
+	 * This happens to be exactly how a system would look
+	 * if we disabled the config option.
+	 */
+	pr_info("x86: 'nopku' specified, disabling Memory Protection Keys\n");
+	pku_disabled = true;
+	return 1;
+}
+__setup("nopku", setup_disable_pku);
+#endif /* CONFIG_X86_64 */
+
+/*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
  * software.  Add those features to this table to auto-disable them.
@@ -947,6 +987,7 @@ static void identify_cpu(struct cpuinfo_
 	init_hypervisor(c);
 	x86_init_rdrand(c);
 	x86_init_cache_qos(c);
+	setup_pku(c);
 
 	/*
 	 * Clear/Set all flags overriden by options, need do it
diff -puN Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch	2015-09-28 11:39:51.457433469 -0700
+++ b/Documentation/kernel-parameters.txt	2015-09-28 11:39:51.462433696 -0700
@@ -955,6 +955,9 @@ bytes respectively. Such letter suffixes
 			See Documentation/x86/intel_mpx.txt for more
 			information about the feature.
 
+	nopku		[X86] Disable Memory Protection Keys CPU feature found
+			in some Intel CPUs.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 21/25] mm: implement new mprotect_key() system call
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

mprotect_key() is just like mprotect, except it also takes a
protection key as an argument.  On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

	pkey_deny_access(11); // random pkey
	int real_prot = PROT_READ|PROT_WRITE;
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set.

We settled on 'unsigned long' for the type of the key here.  We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/mm/Kconfig    |    7 +++++++
 b/mm/mprotect.c |   20 +++++++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff -puN mm/Kconfig~pkeys-85-mprotect_pkey mm/Kconfig
--- a/mm/Kconfig~pkeys-85-mprotect_pkey	2015-09-28 11:39:50.527391162 -0700
+++ b/mm/Kconfig	2015-09-28 11:39:50.532391390 -0700
@@ -683,3 +683,10 @@ config FRAME_VECTOR
 
 config ARCH_USES_HIGH_VMA_FLAGS
 	bool
+
+config NR_PROTECTION_KEYS
+	int
+	# Everything supports a _single_ key, so allow folks to
+	# at least call APIs that take keys, but require that the
+	# key be 0.
+	default 1
diff -puN mm/mprotect.c~pkeys-85-mprotect_pkey mm/mprotect.c
--- a/mm/mprotect.c~pkeys-85-mprotect_pkey	2015-09-28 11:39:50.529391253 -0700
+++ b/mm/mprotect.c	2015-09-28 11:39:50.532391390 -0700
@@ -344,8 +344,8 @@ fail:
 	return error;
 }
 
-SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
-		unsigned long, prot)
+static int do_mprotect_key(unsigned long start, size_t len,
+		unsigned long prot, unsigned long key)
 {
 	unsigned long vm_flags, nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
@@ -365,6 +365,8 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 		return -ENOMEM;
 	if (!arch_validate_prot(prot))
 		return -EINVAL;
+	if (key >= CONFIG_NR_PROTECTION_KEYS)
+		return -EINVAL;
 
 	reqprot = prot;
 	/*
@@ -373,7 +375,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot, 0);
+	vm_flags = calc_vm_prot_bits(prot, key);
 
 	down_write(&current->mm->mmap_sem);
 
@@ -443,3 +445,15 @@ out:
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
+
+SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot)
+{
+	return do_mprotect_key(start, len, prot, 0);
+}
+
+SYSCALL_DEFINE4(mprotect_key, unsigned long, start, size_t, len,
+		unsigned long, prot, unsigned long, key)
+{
+	return do_mprotect_key(start, len, prot, key);
+}
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 22/25] x86: wire up mprotect_key() system call
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave-gkUM19QKKo4
  Cc: borntraeger-tA70FqPdS9bQT0dZR+AlfA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	linux-api-u79uwXL29TY76Z2rM5mHXA


From: Dave Hansen <dave.hansen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

This is all that we need to get the new system call itself
working on x86.

Signed-off-by: Dave Hansen <dave.hansen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 b/arch/x86/include/uapi/asm/mman.h       |    7 +++++++
 b/mm/Kconfig                             |    1 +
 4 files changed, 10 insertions(+)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.964411042 -0700
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-09-28 11:39:50.972411406 -0700
@@ -382,3 +382,4 @@
 373	i386	shutdown		sys_shutdown
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
+376	i386	mprotect_key		sys_mprotect_key
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.965411087 -0700
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-09-28 11:39:50.972411406 -0700
@@ -331,6 +331,7 @@
 322	64	execveat		stub_execveat
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
+325	common	mprotect_key		sys_mprotect_key
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.967411179 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-28 11:39:50.973411451 -0700
@@ -20,6 +20,13 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot, key) ( 		\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
+
 #endif
 
 #include <asm-generic/mman.h>
diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
--- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.969411269 -0700
+++ b/mm/Kconfig	2015-09-28 11:39:50.973411451 -0700
@@ -689,4 +689,5 @@ config NR_PROTECTION_KEYS
 	# Everything supports a _single_ key, so allow folks to
 	# at least call APIs that take keys, but require that the
 	# key be 0.
+	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
 	default 1
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 23/25] x86, pkeys: actually enable Memory Protection Keys in CPU
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This sets the bit in 'cr4' to actually enable the protection
keys feature.  We also include a boot-time disable for the
feature "nopku".

Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid
bit to appear set.  At this point in boot, identify_cpu()
has already run the actual CPUID instructions and populated
the "cpu features" structures.  We need to go back and
re-run identify_cpu() to make sure it gets updated values.

We *could* simply re-populate the 11th word of the cpuid
data, but this is probably quick enough.

Also note that with the cpu_has() check and X86_FEATURE_PKU
present in disabled-features.h, we do not need an #ifdef
for setup_pku().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/kernel-parameters.txt |    3 ++
 b/arch/x86/kernel/cpu/common.c        |   41 ++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch	2015-09-28 11:39:51.455433378 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-09-28 11:39:51.461433651 -0700
@@ -289,6 +289,46 @@ static __always_inline void setup_smap(s
 }
 
 /*
+ * Protection Keys are not available in 32-bit mode.
+ */
+static bool pku_disabled = false;
+static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+{
+	if (!cpu_has(c, X86_FEATURE_PKU))
+		return;
+	if (pku_disabled)
+		return;
+
+	cr4_set_bits(X86_CR4_PKE);
+	/*
+	 * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+	 * cpuid bit to be set.  We need to ensure that we
+	 * update that bit in this CPU's "cpu_info".
+	 */
+	get_cpu_cap(c);
+}
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static __init int setup_disable_pku(char *arg)
+{
+	/*
+	 * Do not clear the X86_FEATURE_PKU bit.  All of the
+	 * runtime checks are against OSPKE so clearing the
+	 * bit does nothing.
+	 *
+	 * This way, we will see "pku" in cpuinfo, but not
+	 * "ospke", which is exactly what we want.  It shows
+	 * that the CPU has PKU, but the OS has not enabled it.
+	 * This happens to be exactly how a system would look
+	 * if we disabled the config option.
+	 */
+	pr_info("x86: 'nopku' specified, disabling Memory Protection Keys\n");
+	pku_disabled = true;
+	return 1;
+}
+__setup("nopku", setup_disable_pku);
+#endif /* CONFIG_X86_64 */
+
+/*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
  * software.  Add those features to this table to auto-disable them.
@@ -947,6 +987,7 @@ static void identify_cpu(struct cpuinfo_
 	init_hypervisor(c);
 	x86_init_rdrand(c);
 	x86_init_cache_qos(c);
+	setup_pku(c);
 
 	/*
 	 * Clear/Set all flags overriden by options, need do it
diff -puN Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch	2015-09-28 11:39:51.457433469 -0700
+++ b/Documentation/kernel-parameters.txt	2015-09-28 11:39:51.462433696 -0700
@@ -955,6 +955,9 @@ bytes respectively. Such letter suffixes
 			See Documentation/x86/intel_mpx.txt for more
 			information about the feature.
 
+	nopku		[X86] Disable Memory Protection Keys CPU feature found
+			in some Intel CPUs.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 21/25] mm: implement new mprotect_key() system call
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

mprotect_key() is just like mprotect, except it also takes a
protection key as an argument.  On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

	pkey_deny_access(11); // random pkey
	int real_prot = PROT_READ|PROT_WRITE;
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set.

We settled on 'unsigned long' for the type of the key here.  We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/mm/Kconfig    |    7 +++++++
 b/mm/mprotect.c |   20 +++++++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff -puN mm/Kconfig~pkeys-85-mprotect_pkey mm/Kconfig
--- a/mm/Kconfig~pkeys-85-mprotect_pkey	2015-09-28 11:39:50.527391162 -0700
+++ b/mm/Kconfig	2015-09-28 11:39:50.532391390 -0700
@@ -683,3 +683,10 @@ config FRAME_VECTOR
 
 config ARCH_USES_HIGH_VMA_FLAGS
 	bool
+
+config NR_PROTECTION_KEYS
+	int
+	# Everything supports a _single_ key, so allow folks to
+	# at least call APIs that take keys, but require that the
+	# key be 0.
+	default 1
diff -puN mm/mprotect.c~pkeys-85-mprotect_pkey mm/mprotect.c
--- a/mm/mprotect.c~pkeys-85-mprotect_pkey	2015-09-28 11:39:50.529391253 -0700
+++ b/mm/mprotect.c	2015-09-28 11:39:50.532391390 -0700
@@ -344,8 +344,8 @@ fail:
 	return error;
 }
 
-SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
-		unsigned long, prot)
+static int do_mprotect_key(unsigned long start, size_t len,
+		unsigned long prot, unsigned long key)
 {
 	unsigned long vm_flags, nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
@@ -365,6 +365,8 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 		return -ENOMEM;
 	if (!arch_validate_prot(prot))
 		return -EINVAL;
+	if (key >= CONFIG_NR_PROTECTION_KEYS)
+		return -EINVAL;
 
 	reqprot = prot;
 	/*
@@ -373,7 +375,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot, 0);
+	vm_flags = calc_vm_prot_bits(prot, key);
 
 	down_write(&current->mm->mmap_sem);
 
@@ -443,3 +445,15 @@ out:
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
+
+SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot)
+{
+	return do_mprotect_key(start, len, prot, 0);
+}
+
+SYSCALL_DEFINE4(mprotect_key, unsigned long, start, size_t, len,
+		unsigned long, prot, unsigned long, key)
+{
+	return do_mprotect_key(start, len, prot, key);
+}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 22/25] x86: wire up mprotect_key() system call
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen, linux-api


From: Dave Hansen <dave.hansen@linux.intel.com>

This is all that we need to get the new system call itself
working on x86.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 b/arch/x86/include/uapi/asm/mman.h       |    7 +++++++
 b/mm/Kconfig                             |    1 +
 4 files changed, 10 insertions(+)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.964411042 -0700
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2015-09-28 11:39:50.972411406 -0700
@@ -382,3 +382,4 @@
 373	i386	shutdown		sys_shutdown
 374	i386	userfaultfd		sys_userfaultfd
 375	i386	membarrier		sys_membarrier
+376	i386	mprotect_key		sys_mprotect_key
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.965411087 -0700
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2015-09-28 11:39:50.972411406 -0700
@@ -331,6 +331,7 @@
 322	64	execveat		stub_execveat
 323	common	userfaultfd		sys_userfaultfd
 324	common	membarrier		sys_membarrier
+325	common	mprotect_key		sys_mprotect_key
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.967411179 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-09-28 11:39:50.973411451 -0700
@@ -20,6 +20,13 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot, key) ( 		\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
+
 #endif
 
 #include <asm-generic/mman.h>
diff -puN mm/Kconfig~pkeys-16-x86-mprotect_key mm/Kconfig
--- a/mm/Kconfig~pkeys-16-x86-mprotect_key	2015-09-28 11:39:50.969411269 -0700
+++ b/mm/Kconfig	2015-09-28 11:39:50.973411451 -0700
@@ -689,4 +689,5 @@ config NR_PROTECTION_KEYS
 	# Everything supports a _single_ key, so allow folks to
 	# at least call APIs that take keys, but require that the
 	# key be 0.
+	default 16 if X86_INTEL_MEMORY_PROTECTION_KEYS
 	default 1
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 24/25] x86, pkeys: add self-tests
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This code should be a good demonstration of how to use the new
mprotect_pkey() system call as well as how to use protection keys
in general.

This code shows how to:
1. Manipulate the Protection Keys Rights User (PKRU) register with
   wrpkru/rdpkru
2. Set a protection key on memory
3. Fetch and/or modify PKRU from the signal XSAVE state
4. Read the kernel-provided protection key in the siginfo

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/tools/testing/selftests/x86/Makefile          |    3 
 b/tools/testing/selftests/x86/pkey-helpers.h    |  182 +++++
 b/tools/testing/selftests/x86/protection_keys.c |  827 ++++++++++++++++++++++++
 3 files changed, 1011 insertions(+), 1 deletion(-)

diff -puN tools/testing/selftests/x86/Makefile~pkeys-40-selftests tools/testing/selftests/x86/Makefile
--- a/tools/testing/selftests/x86/Makefile~pkeys-40-selftests	2015-09-28 11:39:51.905453848 -0700
+++ b/tools/testing/selftests/x86/Makefile	2015-09-28 11:39:51.909454031 -0700
@@ -4,7 +4,8 @@ include ../lib.mk
 
 .PHONY: all all_32 all_64 warn_32bit_failure clean
 
-TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs ldt_gdt syscall_nt
+TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs ldt_gdt syscall_nt \
+			protection_keys
 TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault sigreturn \
 			test_FCMOV test_FCOMI test_FISTTP
 
diff -puN /dev/null tools/testing/selftests/x86/pkey-helpers.h
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/tools/testing/selftests/x86/pkey-helpers.h	2015-09-28 11:39:51.909454031 -0700
@@ -0,0 +1,182 @@
+#define _GNU_SOURCE
+#include <string.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+
+#define NR_PKEYS 16
+
+#ifndef DEBUG_LEVEL
+#define DEBUG_LEVEL 0
+#endif
+#define dprintf_level(level, args...) do { if(level <= DEBUG_LEVEL) printf(args); } while(0)
+#define dprintf1(args...) dprintf_level(1, args)
+#define dprintf2(args...) dprintf_level(2, args)
+#define dprintf3(args...) dprintf_level(3, args)
+#define dprintf4(args...) dprintf_level(4, args)
+
+extern unsigned int shadow_pkru;
+static inline unsigned int __rdpkru(void)
+{
+        unsigned int eax, edx;
+	unsigned int ecx = 0;
+	unsigned int pkru;
+
+        asm volatile(".byte 0x0f,0x01,0xee\n\t"
+                     : "=a" (eax), "=d" (edx)
+		     : "c" (ecx));
+	pkru = eax;
+	return pkru;
+}
+
+static inline unsigned int rdpkru(void)
+{
+	unsigned int pkru = __rdpkru();
+	dprintf4("pkru: %x shadow: %x\n", pkru, shadow_pkru);
+	assert(pkru == shadow_pkru);
+	return pkru;
+}
+
+static inline void __wrpkru(unsigned int pkru)
+{
+        unsigned int eax = pkru;
+	unsigned int ecx = 0;
+	unsigned int edx = 0;
+
+        asm volatile(".byte 0x0f,0x01,0xef\n\t"
+                     : : "a" (eax), "c" (ecx), "d" (edx));
+	assert(pkru == __rdpkru());
+}
+
+static inline void wrpkru(unsigned int pkru)
+{
+	dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+	// will do the shadow check for us:
+	rdpkru();
+	__wrpkru(pkru);
+	shadow_pkru = pkru;
+	dprintf4("%s(%08x) pkru: %08x\n", __func__, pkru, __rdpkru());
+}
+
+/*
+ * These are technically racy. since something could
+ * change PKRU between the read and the write.
+ */
+static inline void __pkey_access_allow(int pkey, int do_allow)
+{
+	unsigned int pkru = rdpkru();
+	int bit = pkey * 2;
+
+	if (do_allow)
+		pkru &= (1<<bit);
+	else
+		pkru |= (1<<bit);
+
+	dprintf4("pkru now: %08x\n", rdpkru());
+	wrpkru(pkru);
+}
+static inline void __pkey_write_allow(int pkey, int do_allow_write)
+{
+	long pkru = rdpkru();
+	int bit = pkey * 2 + 1;
+
+	if (do_allow_write)
+		pkru &= (1<<bit);
+	else
+		pkru |= (1<<bit);
+
+	wrpkru(pkru);
+	dprintf4("pkru now: %08x\n", rdpkru());
+}
+#define pkey_access_allow(pkey) __pkey_access_allow(pkey, 1)
+#define pkey_access_deny(pkey)  __pkey_access_allow(pkey, 0)
+#define pkey_write_allow(pkey)  __pkey_write_allow(pkey, 1)
+#define pkey_write_deny(pkey)   __pkey_write_allow(pkey, 0)
+
+#define PROT_PKEY0     0x10            /* protection key value (bit 0) */
+#define PROT_PKEY1     0x20            /* protection key value (bit 1) */
+#define PROT_PKEY2     0x40            /* protection key value (bit 2) */
+#define PROT_PKEY3     0x80            /* protection key value (bit 3) */
+
+#define PAGE_SIZE 4096
+#define MB	(1<<20)
+
+static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
+                                unsigned int *ecx, unsigned int *edx)
+{
+	/* ecx is often an input as well as an output. */
+	asm volatile(
+		"cpuid;"
+		: "=a" (*eax),
+		  "=b" (*ebx),
+		  "=c" (*ecx),
+		  "=d" (*edx)
+		: "0" (*eax), "2" (*ecx));
+}
+
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx) */
+#define X86_FEATURE_PKU        (1<<3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE      (1<<4) /* OS Protection Keys Enable */
+
+static inline int cpu_has_pku(void)
+{
+	unsigned int eax;
+	unsigned int ebx;
+	unsigned int ecx;
+	unsigned int edx;
+	eax = 0x7;
+	ecx = 0x0;
+	__cpuid(&eax, &ebx, &ecx, &edx);
+
+	if (!(ecx & X86_FEATURE_PKU)) {
+		printf("cpu does not have PKU\n");
+		return 0;
+	}
+	if (!(ecx & X86_FEATURE_OSPKE)) {
+		printf("cpu does not have OSPKE\n");
+		return 0;
+	}
+	return 1;
+}
+
+#define XSTATE_PKRU_BIT	(9)
+#define XSTATE_PKRU	0x200
+
+int pkru_xstate_offset(void)
+{
+	unsigned int eax;
+	unsigned int ebx;
+	unsigned int ecx;
+	unsigned int edx;
+	int xstate_offset;
+	int xstate_size;
+	unsigned long XSTATE_CPUID = 0xd;
+	int leaf;
+
+	// assume that XSTATE_PKRU is set in XCR0
+	leaf = XSTATE_PKRU_BIT;
+	{
+		eax = XSTATE_CPUID;
+		// 0x2 !??! from setup_xstate_features() in the kernel
+		ecx = leaf;
+		__cpuid(&eax, &ebx, &ecx, &edx);
+
+		//printf("leaf[%d] offset: %d size: %d\n", leaf, ebx, eax);
+		if (leaf == XSTATE_PKRU_BIT) {
+			xstate_offset = ebx;
+			xstate_size = eax;
+		}
+	}
+
+	if (xstate_size== 0) {
+		printf("could not find size/offset of PKRU in xsave state\n");
+		return 0;
+	}
+
+	return xstate_offset;
+}
diff -puN /dev/null tools/testing/selftests/x86/protection_keys.c
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/tools/testing/selftests/x86/protection_keys.c	2015-09-28 11:39:51.910454076 -0700
@@ -0,0 +1,827 @@
+/*
+ * Tests x86 Memory Protection Keys (see Documentation/x86/protection-keys.txt)
+ *
+ * There are examples in here of:
+ *  * how to set protection keys on memory
+ *  * how to set/clear bits in PKRU (the rights register)
+ *  * how to handle SEGV_PKRU signals and extract pkey-relevant
+ *    information from the siginfo
+ *
+ * Things to add:
+ *	make sure KSM and KSM COW breaking works
+ *	prefault pages in at malloc, or not
+ *	protect MPX bounds tables with protection keys?
+ *	make sure VMA splitting/merging is working correctly
+ *	OOMs can destroy mm->mmap (see exit_mmap()), so make sure it is immune to pkeys
+ *
+ * Compile like this:
+ * 	gcc      -o protection_keys    -O2 -g -std=gnu99 -pthread -Wall protection_keys.c -lrt -ldl -lm
+ *	gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99 -pthread -Wall protection_keys.c -lrt -ldl -lm
+ */
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/futex.h>
+#include <sys/time.h>
+#include <sys/syscall.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+#include "pkey-helpers.h"
+
+unsigned int shadow_pkru;
+
+#define HPAGE_SIZE	(1UL<<21)
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+#define ALIGN(x, align_to)    (((x) + ((align_to)-1)) & ~((align_to)-1))
+#define ALIGN_PTR(p, ptr_align_to)    ((typeof(p))ALIGN((unsigned long)(p), ptr_align_to))
+
+extern void abort_hooks(void);
+#define pkey_assert(condition) do {		\
+	if (!(condition)) {			\
+		abort_hooks();			\
+		perror("errno at assert");	\
+		assert(condition);		\
+	}					\
+} while (0)
+#define raw_assert(cond) assert(cond)
+
+
+#define __SI_FAULT      (3 << 16)
+#define SEGV_BNDERR     (__SI_FAULT|3)  /* failed address bound checks */
+#define SEGV_PKUERR     (__SI_FAULT|4)
+
+void cat_into_file(char *str, char *file)
+{
+	int fd = open(file, O_RDWR);
+	int ret;
+	// these need to be raw because they are called under
+	// pkey_assert()
+	raw_assert(fd >= 0);
+	ret = write(fd, str, strlen(str));
+	if (ret != strlen(str)) {
+		perror("write to file failed");
+		fprintf(stderr, "filename: '%s'\n", file);
+		raw_assert(0);
+	}
+	close(fd);
+}
+
+void tracing_on(void)
+{
+#ifdef CONTROL_TRACING
+	char pidstr[32];
+	sprintf(pidstr, "%d", getpid());
+	//cat_into_file("20000", "/sys/kernel/debug/tracing/buffer_size_kb");
+	cat_into_file("0", "/sys/kernel/debug/tracing/tracing_on");
+	cat_into_file("\n", "/sys/kernel/debug/tracing/trace");
+	if (1) {
+		cat_into_file("function_graph", "/sys/kernel/debug/tracing/current_tracer");
+		cat_into_file("1", "/sys/kernel/debug/tracing/options/funcgraph-proc");
+	} else {
+		cat_into_file("nop", "/sys/kernel/debug/tracing/current_tracer");
+	}
+	cat_into_file(pidstr, "/sys/kernel/debug/tracing/set_ftrace_pid");
+	cat_into_file("1", "/sys/kernel/debug/tracing/tracing_on");
+#endif
+}
+
+void tracing_off(void)
+{
+#ifdef CONTROL_TRACING
+	cat_into_file("0", "/sys/kernel/debug/tracing/tracing_on");
+#endif
+}
+
+void abort_hooks(void)
+{
+	fprintf(stderr, "running %s()...\n", __func__);
+	tracing_off();
+}
+
+static char *si_code_str(int si_code)
+{
+	if (si_code & SEGV_MAPERR)
+		return "SEGV_MAPERR";
+	if (si_code & SEGV_ACCERR)
+		return "SEGV_ACCERR";
+	if (si_code & SEGV_BNDERR)
+		return "SEGV_BNDERR";
+	if (si_code & SEGV_PKUERR)
+		return "SEGV_PKUERR";
+	return "UNKNOWN";
+}
+
+// I'm addicted to the kernel types
+#define  u8 uint8_t
+#define u16 uint16_t
+#define u32 uint32_t
+#define u64 uint64_t
+
+#ifdef __i386__
+#define SYS_mprotect_key 376
+#define REG_IP_IDX REG_EIP
+#define si_pkey_offset 0x08
+#else
+#define SYS_mprotect_key 325
+#define REG_IP_IDX REG_RIP
+#define si_pkey_offset 0x20
+#endif
+
+void dump_mem(void *dumpme, int len_bytes)
+{
+	char *c = (void *)dumpme;
+	int i;
+	for (i = 0; i < len_bytes; i+= sizeof(u64)) {
+		dprintf1("dump[%03d]: %016jx\n", i, *(u64 *)(c + i));
+	}
+}
+
+
+int pkru_faults = 0;
+int last_si_pkey = -1;
+void handler(int signum, siginfo_t* si, void* vucontext)
+{
+	ucontext_t* uctxt = vucontext;
+	int trapno;
+	unsigned long ip;
+	char *fpregs;
+	u32 *pkru_ptr;
+	u64 si_pkey;
+	int pkru_offset;
+
+	trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
+	ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
+	fpregset_t fpregset = uctxt->uc_mcontext.fpregs;
+	fpregs = (void *)fpregset;
+	pkru_offset = pkru_xstate_offset();
+	pkru_ptr = (void *)(&fpregs[pkru_offset]);
+
+	/*
+	 * If we got a PKRU fault, we *HAVE* to have at least one bit set in
+	 * here.
+	 */
+	dprintf1("pkru_xstate_offset: %d\n", pkru_xstate_offset());
+	dump_mem(pkru_ptr - 8, 24);
+	assert(*pkru_ptr);
+
+	si_pkey = *(u64 *)(((u8 *)si) + si_pkey_offset);
+	last_si_pkey = si_pkey;
+
+	dprintf1("\n===================SIGSEGV============================\n");
+	dprintf2("%s() trapno: %d ip: 0x%lx info->si_code: %s/%d\n", __func__, trapno, ip,
+			si_code_str(si->si_code), si->si_code);
+	if ((si->si_code == SEGV_MAPERR) ||
+	    (si->si_code == SEGV_ACCERR) ||
+	    (si->si_code == SEGV_BNDERR)) {
+		printf("non-PK si_code, exiting...\n");
+		exit(4);
+	}
+
+	//printf("pkru_xstate_offset(): %d\n", pkru_xstate_offset());
+	dprintf1("signal pkru from xsave: %08x\n", *pkru_ptr);
+	// need __ version so we do not do shadow_pkru checking
+	dprintf1("signal pkru from  pkru: %08x\n", __rdpkru());
+	dprintf1("si_pkey from siginfo: %jx\n", si_pkey);
+	*pkru_ptr = 0;
+	dprintf1("WARNING: set PRKU=0 to allow faulting instruction to continue\n");
+	pkru_faults++;
+	dprintf1("======================================================\n\n");
+	return;
+	if (trapno == 14) {
+		fprintf(stderr,
+			"ERROR: In signal handler, page fault, trapno = %d, ip = %016lx\n",
+			trapno, ip);
+		fprintf(stderr, "si_addr %p\n", si->si_addr);
+		fprintf(stderr, "REG_ERR: %lx\n", (unsigned long)uctxt->uc_mcontext.gregs[REG_ERR]);
+		//sleep(999);
+		exit(1);
+	} else {
+		fprintf(stderr,"unexpected trap %d! at 0x%lx\n", trapno, ip);
+		fprintf(stderr, "si_addr %p\n", si->si_addr);
+		fprintf(stderr, "REG_ERR: %lx\n", (unsigned long)uctxt->uc_mcontext.gregs[REG_ERR]);
+		exit(2);
+	}
+}
+
+int wait_all_children()
+{
+        int status;
+        return waitpid(-1, &status, 0);
+}
+
+void sig_chld(int x)
+{
+        dprintf2("[%d] SIGCHLD: %d\n", getpid(), x);
+}
+
+void setup_sigsegv_handler()
+{
+	int r,rs;
+	struct sigaction newact;
+	struct sigaction oldact;
+
+	/* #PF is mapped to sigsegv */
+	int signum  = SIGSEGV;
+
+	newact.sa_handler = 0;   /* void(*)(int)*/
+	newact.sa_sigaction = handler; /* void (*)(int, siginfo_t*, void*) */
+
+	/*sigset_t - signals to block while in the handler */
+	/* get the old signal mask. */
+	rs = sigprocmask(SIG_SETMASK, 0, &newact.sa_mask);
+	pkey_assert(rs == 0);
+
+	/* call sa_sigaction, not sa_handler*/
+	newact.sa_flags = SA_SIGINFO;
+
+	newact.sa_restorer = 0;  /* void(*)(), obsolete */
+	r = sigaction(signum, &newact, &oldact);
+	r = sigaction(SIGALRM, &newact, &oldact);
+	pkey_assert(r == 0);
+}
+
+void setup_handlers(void)
+{
+	signal(SIGCHLD, &sig_chld);
+	setup_sigsegv_handler();
+}
+
+void tag_each_buffer_page(void *buf, int nr_pages, unsigned long tag)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++) {
+		unsigned long *tag_at = (buf + i * PAGE_SIZE);
+		*tag_at = tag;
+	}
+}
+
+pid_t fork_lazy_child(void *buf)
+{
+	pid_t forkret;
+
+	// Tag the buffers in both parent and child
+	tag_each_buffer_page(buf, NR_PKEYS, 0xDEADBEEFUL);
+
+	forkret = fork();
+	pkey_assert(forkret >= 0);
+	dprintf3("[%d] fork() ret: %d\n", getpid(), forkret);
+
+	// Tag the buffers in both parent and child
+	tag_each_buffer_page(buf, NR_PKEYS, getpid());
+
+	if (!forkret) {
+		/* in the child */
+		while (1) {
+			dprintf1("child sleeping...\n");
+			sleep(30);
+		}
+	}
+	return forkret;
+}
+
+void davecmp(void *_a, void *_b, int len)
+{
+	int i;
+	unsigned long *a = _a;
+	unsigned long *b = _b;
+	for (i = 0; i < len / sizeof(*a); i++) {
+		if (a[i] == b[i])
+			continue;
+
+		dprintf3("[%3d]: a: %016lx b: %016lx\n", i, a[i], b[i]);
+	}
+}
+
+void dumpit(char *f)
+{
+	int fd = open(f, O_RDONLY);
+	char buf[100];
+	int nr_read;
+
+	dprintf2("maps fd: %d\n", fd);
+	do {
+		nr_read = read(fd, &buf[0], sizeof(buf));
+		write(1, buf, nr_read);
+	} while (nr_read > 0);
+	close(fd);
+}
+
+int mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot, unsigned long pkey)
+{
+	int sret;
+	pkey_assert(pkey < NR_PKEYS);
+
+	// do not let 'prot' protection key bits be set here
+	assert(orig_prot < 0x10);
+	errno = 0;
+	sret = syscall(SYS_mprotect_key, ptr, size, orig_prot, pkey);
+	if (errno) {
+		dprintf1("SYS_mprotect_key sret: %d\n", sret);
+		dprintf1("SYS_mprotect_key prot: 0x%lx\n", orig_prot);
+		dprintf1("SYS_mprotect_key failed, errno: %d\n", errno);
+		assert(0);
+	}
+	return sret;
+}
+
+struct pkey_malloc_record {
+	void *ptr;
+	long size;
+};
+struct pkey_malloc_record *pkey_malloc_records;
+long nr_pkey_malloc_records;
+void record_pkey_malloc(void *ptr, long size)
+{
+	long i;
+	struct pkey_malloc_record *rec = NULL;
+
+	for (i = 0; i < nr_pkey_malloc_records; i++) {
+		rec = &pkey_malloc_records[i];
+		// find a free record
+		if (rec)
+			break;
+	}
+	if (!rec) {
+		// every record is full
+		size_t old_nr_records = nr_pkey_malloc_records;
+		size_t new_nr_records = (nr_pkey_malloc_records * 2 + 1);
+		size_t new_size = new_nr_records * sizeof(struct pkey_malloc_record);
+		dprintf1("new_nr_records: %zd\n", new_nr_records);
+		dprintf1("new_size: %zd\n", new_size);
+		pkey_malloc_records = realloc(pkey_malloc_records, new_size);
+		pkey_assert(pkey_malloc_records != NULL);
+		rec = &pkey_malloc_records[nr_pkey_malloc_records];
+		// realloc() does not initalize memory, so zero it from
+		// the first new record all the way to the end.
+		for (i = 0; i < new_nr_records - old_nr_records; i++)
+			memset(rec + i, 0, sizeof(*rec));
+	}
+	dprintf3("filling malloc record[%d/%p]: {%p, %ld}\n",
+		(int)(rec - pkey_malloc_records), rec, ptr, size);
+	rec->ptr = ptr;
+	rec->size = size;
+	nr_pkey_malloc_records++;
+}
+
+void free_pkey_malloc(void *ptr)
+{
+	long i;
+	int ret;
+	dprintf3("%s(%p)\n", __func__, ptr);
+	for (i = 0; i < nr_pkey_malloc_records; i++) {
+		struct pkey_malloc_record *rec = &pkey_malloc_records[i];
+		dprintf4("looking for ptr %p at record[%ld/%p]: {%p, %ld}\n",
+				ptr, i, rec, rec->ptr, rec->size);
+		if ((ptr <  rec->ptr) ||
+		    (ptr >= rec->ptr + rec->size))
+			continue;
+
+		dprintf3("found ptr %p at record[%ld/%p]: {%p, %ld}\n",
+				ptr, i, rec, rec->ptr, rec->size);
+		nr_pkey_malloc_records--;
+		ret = munmap(rec->ptr, rec->size);
+		dprintf3("munmap ret: %d\n", ret);
+		pkey_assert(!ret);
+		dprintf3("clearing rec->ptr, rec: %p\n", rec);
+		rec->ptr = NULL;
+		dprintf3("done clearing rec->ptr, rec: %p\n", rec);
+		return;
+	}
+	pkey_assert(false);
+}
+
+
+void *malloc_pkey_with_mprotect(long size, int prot, u16 pkey)
+{
+	void *ptr;
+	int ret;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	pkey_assert(pkey < NR_PKEYS);
+	ptr = mmap(NULL, size, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+	ret = mprotect_pkey((void *)ptr, PAGE_SIZE, prot, pkey);
+	pkey_assert(!ret);
+	record_pkey_malloc(ptr, size);
+
+	dprintf1("%s() for pkey %d @ %p\n", __func__, pkey, ptr);
+	return ptr;
+}
+
+
+void *malloc_pkey_mmap_direct(long size, int prot, u16 pkey)
+{
+	void *ptr;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	pkey_assert(pkey < NR_PKEYS);
+	prot = prot_add_pkey(prot, pkey);
+	ptr = mmap(NULL, size, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+
+	record_pkey_malloc(ptr, size);
+
+	dprintf1("mmap()'d for pkey %d @ %p\n", pkey, ptr);
+	return ptr;
+}
+
+void *malloc_pkey_anon_huge(long size, int prot, u16 pkey)
+{
+	int ret;
+	void *ptr;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	// Guarantee we can fit at least one huge page in the resulting
+	// allocation by allocating space for 2:
+	size = ALIGN(size, HPAGE_SIZE * 2);
+	ptr = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+	record_pkey_malloc(ptr, size);
+	mprotect_pkey(ptr, size, prot, pkey);
+
+	dprintf1("unaligned ptr: %p\n", ptr);
+	ptr = ALIGN_PTR(ptr, HPAGE_SIZE);
+	dprintf1("  aligned ptr: %p\n", ptr);
+	ret = madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE);
+	dprintf1("MADV_HUGEPAGE ret: %d\n", ret);
+	ret = madvise(ptr, HPAGE_SIZE, MADV_WILLNEED);
+	dprintf1("MADV_WILLNEED ret: %d\n", ret);
+	memset(ptr, 0, HPAGE_SIZE);
+
+	dprintf1("mmap()'d thp for pkey %d @ %p\n", pkey, ptr);
+	return ptr;
+}
+
+void *malloc_pkey_hugetlb(long size, int prot, u16 pkey)
+{
+	void *ptr;
+	int flags = MAP_ANONYMOUS|MAP_PRIVATE|MAP_HUGETLB;
+
+	dprintf1("doing %s(%ld, %x, %x)\n", __func__, size, prot, pkey);
+	size = ALIGN(size, HPAGE_SIZE * 2);
+	pkey_assert(pkey < NR_PKEYS);
+	ptr = mmap(NULL, size, PROT_NONE, flags, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+	mprotect_pkey(ptr, size, prot, pkey);
+
+	record_pkey_malloc(ptr, size);
+
+	dprintf1("mmap()'d hugetlbfs for pkey %d @ %p\n", pkey, ptr);
+	return ptr;
+}
+
+void *malloc_pkey_mmap_dax(long size, int prot, u16 pkey)
+{
+	void *ptr;
+	int fd;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	pkey_assert(pkey < NR_PKEYS);
+	fd = open("/dax/foo", O_RDWR);
+	assert(fd >= 0);
+
+	ptr = mmap(0, size, prot, MAP_SHARED, fd, 0);
+	pkey_assert(ptr != (void *)-1);
+
+	mprotect_pkey(ptr, size, prot, pkey);
+
+	record_pkey_malloc(ptr, size);
+
+	dprintf1("mmap()'d for pkey %d @ %p\n", pkey, ptr);
+	close(fd);
+	return ptr;
+}
+
+//void *malloc_pkey_with_mprotect(long size, int prot, u16 pkey)
+void *(*pkey_malloc[])(long size, int prot, u16 pkey) = {
+
+	malloc_pkey_with_mprotect,
+	malloc_pkey_anon_huge,
+	malloc_pkey_hugetlb,
+// can not do direct with the mprotect_pkey() API
+//	malloc_pkey_mmap_direct,
+//	malloc_pkey_mmap_dax,
+};
+
+void *malloc_pkey(long size, int prot, u16 pkey)
+{
+	void *ret;
+	static int malloc_type = 0;
+	int nr_malloc_types = ARRAY_SIZE(pkey_malloc);
+
+	pkey_assert(pkey < NR_PKEYS);
+	pkey_assert(malloc_type < nr_malloc_types);
+	ret = pkey_malloc[malloc_type](size, prot, pkey);
+	pkey_assert(ret != (void *)-1);
+	malloc_type++;
+	if (malloc_type >= nr_malloc_types)
+		malloc_type = (random()%nr_malloc_types);
+
+	dprintf3("%s(%ld, prot=%x, pkey=%x) returning: %p\n", __func__, size, prot, pkey, ret);
+	return ret;
+}
+
+int last_pkru_faults = 0;
+void expected_pk_fault(int pkey)
+{
+	dprintf2("%s(): last_pkru_faults: %d pkru_faults: %d\n",
+			__func__, last_pkru_faults, pkru_faults);
+	dprintf2("%s(%d): last_si_pkey: %d\n", __func__, pkey, last_si_pkey);
+	pkey_assert(last_pkru_faults + 1 == pkru_faults);
+	pkey_assert(last_si_pkey == pkey);
+	/*
+	 * The signal handler shold have cleared out PKRU to let the
+	 * test program continue.  We now have to restore it.
+	 */
+	if (__rdpkru() != 0) {
+		pkey_assert(0);
+	}
+	__wrpkru(shadow_pkru);
+	dprintf1("%s() set PKRU=%x to restore state after signal nuked it\n",
+			__func__, shadow_pkru);
+	last_pkru_faults = pkru_faults;
+	last_si_pkey = -1;
+}
+
+int test_fds[10] = { -1 };
+int nr_test_fds;
+void __save_test_fd(int fd)
+{
+	pkey_assert(fd >= 0);
+	pkey_assert(nr_test_fds < ARRAY_SIZE(test_fds));
+	test_fds[nr_test_fds] = fd;
+	nr_test_fds++;
+}
+
+int get_test_read_fd(void)
+{
+	int test_fd = open("/etc/passwd", O_RDONLY);
+	__save_test_fd(test_fd);
+	return test_fd;
+}
+
+void close_test_fds(void)
+{
+	int i;
+
+	for (i = 0; i < nr_test_fds; i++) {
+		if (test_fds[i] < 0)
+			continue;
+		close(test_fds[i]);
+		test_fds[i] = -1;
+	}
+	nr_test_fds = 0;
+}
+
+void* malloc_one_page_of_each_pkey(void)
+{
+	int prot = PROT_READ|PROT_WRITE;
+	void *ret;
+	int i;
+
+	ret = mmap(NULL, PAGE_SIZE * NR_PKEYS, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ret != (void *)-1);
+	for (i = 0; i < NR_PKEYS; i++) {
+		int mprotect_ret;
+		mprotect_ret = mprotect_pkey(ret + i * PAGE_SIZE, PAGE_SIZE, prot, i);
+		pkey_assert(!mprotect_ret);
+	}
+	return ret;
+}
+
+__attribute__((noinline)) int read_ptr(int *ptr)
+{
+	return *ptr;
+}
+
+void test_read_of_write_disabled_region(int *ptr, u16 pkey)
+{
+	int ptr_contents;
+	dprintf1("disabling write access to PKEY[1], doing read\n");
+	pkey_write_deny(pkey);
+	ptr_contents = read_ptr(ptr);
+	dprintf1("*ptr: %d\n", ptr_contents);
+	dprintf1("\n");
+}
+void test_read_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	int ptr_contents;
+	dprintf1("disabling access to PKEY[%02d], doing read @ %p\n", pkey, ptr);
+	pkey_access_deny(pkey);
+	ptr_contents = read_ptr(ptr);
+	dprintf1("*ptr: %d\n", ptr_contents);
+	expected_pk_fault(pkey);
+}
+void test_write_of_write_disabled_region(int *ptr, u16 pkey)
+{
+	dprintf1("disabling write access to PKEY[%02d], doing write\n", pkey);
+	pkey_write_deny(pkey);
+	*ptr = __LINE__;
+	expected_pk_fault(pkey);
+}
+void test_write_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	dprintf1("disabling access to PKEY[%02d], doing write\n", pkey);
+	pkey_access_deny(pkey);
+	*ptr = __LINE__;
+	expected_pk_fault(pkey);
+}
+void test_kernel_write_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	int ret;
+	int test_fd = get_test_read_fd();
+
+	dprintf1("disabling access to PKEY[%02d], having kernel read() to buffer\n", pkey);
+	pkey_access_deny(pkey);
+	ret = read(test_fd, ptr, 1);
+	dprintf1("read ret: %d\n", ret);
+	pkey_assert(ret);
+}
+void test_kernel_write_of_write_disabled_region(int *ptr, u16 pkey)
+{
+	int ret;
+	int test_fd = get_test_read_fd();
+
+	pkey_write_deny(pkey);
+	ret = read(test_fd, ptr, 100);
+	dprintf1("read ret: %d\n", ret);
+	if (ret < 0 && (DEBUG_LEVEL > 0))
+		perror("read");
+	pkey_assert(ret);
+}
+
+void test_kernel_gup_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	int pipe_ret, vmsplice_ret;
+	struct iovec iov;
+	int pipe_fds[2];
+
+	pipe_ret = pipe(pipe_fds);
+
+	pkey_assert(pipe_ret == 0);
+	dprintf1("disabling access to PKEY[%02d], having kernel vmsplice from buffer\n", pkey);
+	pkey_access_deny(pkey);
+	iov.iov_base = ptr;
+	iov.iov_len = PAGE_SIZE;
+	vmsplice_ret = vmsplice(pipe_fds[1], &iov, 1, SPLICE_F_GIFT);
+	dprintf1("vmsplice() ret: %d\n", vmsplice_ret);
+	pkey_assert(vmsplice_ret == -1);
+
+	close(pipe_fds[0]);
+	close(pipe_fds[1]);
+}
+
+void test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
+{
+	int ignored = 0xdada;
+	int futex_ret;
+	int some_int = __LINE__;
+
+	dprintf1("disabling write to PKEY[%02d], doing futex gunk in buffer\n", pkey);
+	*ptr = some_int;
+	pkey_write_deny(pkey);
+	futex_ret = syscall(SYS_futex, ptr, FUTEX_WAIT, some_int-1, NULL, &ignored, ignored);
+	if (DEBUG_LEVEL > 0)
+		perror("futex");
+	dprintf1("futex() ret: %d\n", futex_ret);
+	//pkey_assert(vmsplice_ret == -1);
+}
+
+void test_ptrace_of_child(int *ptr, u16 pkey)
+{
+	void *buf = malloc_one_page_of_each_pkey();
+	pid_t child_pid = fork_lazy_child(buf);
+	void *ignored = 0;
+	long ret;
+	int i;
+	int status;
+
+	dprintf1("[%d] child pid: %d\n", getpid(), child_pid);
+
+	ret = ptrace(PTRACE_ATTACH, child_pid, ignored, ignored);
+	if (ret)
+		perror("attach");
+	dprintf1("[%d] attach ret: %ld %d\n", getpid(), ret, __LINE__);
+	pkey_assert(ret != -1);
+	ret = waitpid(child_pid, &status, WUNTRACED);
+	if ((ret != child_pid) || !(WIFSTOPPED(status)) ) {
+		fprintf(stderr, "weird waitpid result %ld stat %x\n", ret, status);
+		pkey_assert(0);
+	}
+	dprintf2("waitpid ret: %ld\n", ret);
+	dprintf2("waitpid status: %d\n", status);
+
+	//if (0)
+	for (i = 1; i < NR_PKEYS; i++) {
+		pkey_access_deny(i);
+		pkey_write_deny(i);
+	}
+	for (i = 0; i < NR_PKEYS; i++) {
+		void *peek_at = buf + i * PAGE_SIZE;
+		long peek_result;
+
+		//ret = ptrace(PTRACE_POKEDATA, child_pid, peek_at, data);
+		//pkey_assert(ret != -1);
+		//printf("poke at %p: %ld\n", peek_at, ret);
+
+		ret = ptrace(PTRACE_PEEKDATA, child_pid, peek_at, ignored);
+		pkey_assert(ret != -1);
+
+		peek_result = *(long *)peek_at;
+		// for the *peek_at access
+		if (i >= 1) // did not disable access to pkey 0
+			expected_pk_fault(i);
+
+		dprintf1("peek at pkey[%2d] @ %p: %lx (local: %ld) pkru: %08x\n", i, peek_at, ret, peek_result, rdpkru());
+	}
+	ret = ptrace(PTRACE_DETACH, child_pid, ignored, 0);
+	pkey_assert(ret != -1);
+
+	ret = kill(child_pid, SIGKILL);
+	pkey_assert(ret != -1);
+
+	ret = munmap(buf, PAGE_SIZE * NR_PKEYS);
+	pkey_assert(!ret);
+}
+
+void (*pkey_tests[])(int *ptr, u16 pkey) = {
+	test_read_of_write_disabled_region,
+	test_read_of_access_disabled_region,
+	test_write_of_write_disabled_region,
+	test_write_of_access_disabled_region,
+	test_kernel_write_of_access_disabled_region,
+	test_kernel_write_of_write_disabled_region,
+	test_kernel_gup_of_access_disabled_region,
+	test_kernel_gup_write_to_write_disabled_region,
+//	test_ptrace_of_child,
+};
+
+void run_tests_once(void)
+{
+	static int iteration_nr = 1;
+	int *ptr;
+	int prot = PROT_READ|PROT_WRITE;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(pkey_tests); i++) {
+		int orig_pkru_faults = pkru_faults;
+		// reset pkru:
+		wrpkru(0);
+
+		static u16 pkey;
+		pkey = 1 + (rand() % 15);
+		dprintf1("================\n");
+		dprintf1("test %d starting with pkey: %d\n", i, pkey);
+		tracing_on();
+		ptr = malloc_pkey(PAGE_SIZE, prot, pkey);
+		//dumpit("/proc/self/maps");
+		pkey_tests[i](ptr, pkey);
+		//sleep(999);
+		dprintf1("freeing test memory: %p\n", ptr);
+		free_pkey_malloc(ptr);
+
+		dprintf1("pkru_faults: %d\n", pkru_faults);
+		dprintf1("orig_pkru_faults: %d\n", orig_pkru_faults);
+
+		tracing_off();
+		close_test_fds();
+		//system("dmesg -c");
+		//sleep(2);
+		printf("test %d PASSED (itertation %d)\n", i, iteration_nr);
+		dprintf1("================\n\n");
+	}
+	iteration_nr++;
+}
+
+int main()
+{
+	int nr_iterations = 5;
+	setup_handlers();
+	printf("has pku: %d\n", cpu_has_pku());
+	printf("pkru: %x\n", rdpkru());
+	pkey_assert(cpu_has_pku());
+	pkey_assert(!rdpkru());
+
+	cat_into_file("10", "/proc/sys/vm/nr_hugepages");
+
+	while (nr_iterations-- > 0)
+		run_tests_once();
+
+	printf("done (all tests OK)\n");
+	return 0;
+}
+
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 25/25] x86, pkeys: Documentation
  2015-09-28 19:18 ` Dave Hansen
@ 2015-09-28 19:18   ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>


Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/x86/protection-keys.txt |   54 ++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff -puN /dev/null Documentation/x86/protection-keys.txt
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/Documentation/x86/protection-keys.txt	2015-09-28 11:40:16.120555350 -0700
@@ -0,0 +1,54 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+which will be found on future Intel CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.  It works by
+dedicating 4 previously ignored bits in each page table entry to a
+"protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+	mprotect(ptr, size, PROT_NONE);
+	something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+	mprotect(ptr, size, PROT_READ|PROT_WRITE);
+	set_pkey(ptr, size, 4);
+	wrpkru(0xffffff3f); // access disable pkey 4
+	something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+	*ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+	read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
+
+=========================== Config Option ===========================
+
+This config option adds approximately 1.5kb of text. and 50 bytes of
+data to the executable.  A workload which does large O_DIRECT reads
+of holes in XFS files was run to exercise get_user_pages_fast().  No
+performance delta was observed with the config option
+enabled or disabled.
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 24/25] x86, pkeys: add self-tests
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This code should be a good demonstration of how to use the new
mprotect_pkey() system call as well as how to use protection keys
in general.

This code shows how to:
1. Manipulate the Protection Keys Rights User (PKRU) register with
   wrpkru/rdpkru
2. Set a protection key on memory
3. Fetch and/or modify PKRU from the signal XSAVE state
4. Read the kernel-provided protection key in the siginfo

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/tools/testing/selftests/x86/Makefile          |    3 
 b/tools/testing/selftests/x86/pkey-helpers.h    |  182 +++++
 b/tools/testing/selftests/x86/protection_keys.c |  827 ++++++++++++++++++++++++
 3 files changed, 1011 insertions(+), 1 deletion(-)

diff -puN tools/testing/selftests/x86/Makefile~pkeys-40-selftests tools/testing/selftests/x86/Makefile
--- a/tools/testing/selftests/x86/Makefile~pkeys-40-selftests	2015-09-28 11:39:51.905453848 -0700
+++ b/tools/testing/selftests/x86/Makefile	2015-09-28 11:39:51.909454031 -0700
@@ -4,7 +4,8 @@ include ../lib.mk
 
 .PHONY: all all_32 all_64 warn_32bit_failure clean
 
-TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs ldt_gdt syscall_nt
+TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs ldt_gdt syscall_nt \
+			protection_keys
 TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault sigreturn \
 			test_FCMOV test_FCOMI test_FISTTP
 
diff -puN /dev/null tools/testing/selftests/x86/pkey-helpers.h
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/tools/testing/selftests/x86/pkey-helpers.h	2015-09-28 11:39:51.909454031 -0700
@@ -0,0 +1,182 @@
+#define _GNU_SOURCE
+#include <string.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+
+#define NR_PKEYS 16
+
+#ifndef DEBUG_LEVEL
+#define DEBUG_LEVEL 0
+#endif
+#define dprintf_level(level, args...) do { if(level <= DEBUG_LEVEL) printf(args); } while(0)
+#define dprintf1(args...) dprintf_level(1, args)
+#define dprintf2(args...) dprintf_level(2, args)
+#define dprintf3(args...) dprintf_level(3, args)
+#define dprintf4(args...) dprintf_level(4, args)
+
+extern unsigned int shadow_pkru;
+static inline unsigned int __rdpkru(void)
+{
+        unsigned int eax, edx;
+	unsigned int ecx = 0;
+	unsigned int pkru;
+
+        asm volatile(".byte 0x0f,0x01,0xee\n\t"
+                     : "=a" (eax), "=d" (edx)
+		     : "c" (ecx));
+	pkru = eax;
+	return pkru;
+}
+
+static inline unsigned int rdpkru(void)
+{
+	unsigned int pkru = __rdpkru();
+	dprintf4("pkru: %x shadow: %x\n", pkru, shadow_pkru);
+	assert(pkru == shadow_pkru);
+	return pkru;
+}
+
+static inline void __wrpkru(unsigned int pkru)
+{
+        unsigned int eax = pkru;
+	unsigned int ecx = 0;
+	unsigned int edx = 0;
+
+        asm volatile(".byte 0x0f,0x01,0xef\n\t"
+                     : : "a" (eax), "c" (ecx), "d" (edx));
+	assert(pkru == __rdpkru());
+}
+
+static inline void wrpkru(unsigned int pkru)
+{
+	dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+	// will do the shadow check for us:
+	rdpkru();
+	__wrpkru(pkru);
+	shadow_pkru = pkru;
+	dprintf4("%s(%08x) pkru: %08x\n", __func__, pkru, __rdpkru());
+}
+
+/*
+ * These are technically racy. since something could
+ * change PKRU between the read and the write.
+ */
+static inline void __pkey_access_allow(int pkey, int do_allow)
+{
+	unsigned int pkru = rdpkru();
+	int bit = pkey * 2;
+
+	if (do_allow)
+		pkru &= (1<<bit);
+	else
+		pkru |= (1<<bit);
+
+	dprintf4("pkru now: %08x\n", rdpkru());
+	wrpkru(pkru);
+}
+static inline void __pkey_write_allow(int pkey, int do_allow_write)
+{
+	long pkru = rdpkru();
+	int bit = pkey * 2 + 1;
+
+	if (do_allow_write)
+		pkru &= (1<<bit);
+	else
+		pkru |= (1<<bit);
+
+	wrpkru(pkru);
+	dprintf4("pkru now: %08x\n", rdpkru());
+}
+#define pkey_access_allow(pkey) __pkey_access_allow(pkey, 1)
+#define pkey_access_deny(pkey)  __pkey_access_allow(pkey, 0)
+#define pkey_write_allow(pkey)  __pkey_write_allow(pkey, 1)
+#define pkey_write_deny(pkey)   __pkey_write_allow(pkey, 0)
+
+#define PROT_PKEY0     0x10            /* protection key value (bit 0) */
+#define PROT_PKEY1     0x20            /* protection key value (bit 1) */
+#define PROT_PKEY2     0x40            /* protection key value (bit 2) */
+#define PROT_PKEY3     0x80            /* protection key value (bit 3) */
+
+#define PAGE_SIZE 4096
+#define MB	(1<<20)
+
+static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
+                                unsigned int *ecx, unsigned int *edx)
+{
+	/* ecx is often an input as well as an output. */
+	asm volatile(
+		"cpuid;"
+		: "=a" (*eax),
+		  "=b" (*ebx),
+		  "=c" (*ecx),
+		  "=d" (*edx)
+		: "0" (*eax), "2" (*ecx));
+}
+
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx) */
+#define X86_FEATURE_PKU        (1<<3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE      (1<<4) /* OS Protection Keys Enable */
+
+static inline int cpu_has_pku(void)
+{
+	unsigned int eax;
+	unsigned int ebx;
+	unsigned int ecx;
+	unsigned int edx;
+	eax = 0x7;
+	ecx = 0x0;
+	__cpuid(&eax, &ebx, &ecx, &edx);
+
+	if (!(ecx & X86_FEATURE_PKU)) {
+		printf("cpu does not have PKU\n");
+		return 0;
+	}
+	if (!(ecx & X86_FEATURE_OSPKE)) {
+		printf("cpu does not have OSPKE\n");
+		return 0;
+	}
+	return 1;
+}
+
+#define XSTATE_PKRU_BIT	(9)
+#define XSTATE_PKRU	0x200
+
+int pkru_xstate_offset(void)
+{
+	unsigned int eax;
+	unsigned int ebx;
+	unsigned int ecx;
+	unsigned int edx;
+	int xstate_offset;
+	int xstate_size;
+	unsigned long XSTATE_CPUID = 0xd;
+	int leaf;
+
+	// assume that XSTATE_PKRU is set in XCR0
+	leaf = XSTATE_PKRU_BIT;
+	{
+		eax = XSTATE_CPUID;
+		// 0x2 !??! from setup_xstate_features() in the kernel
+		ecx = leaf;
+		__cpuid(&eax, &ebx, &ecx, &edx);
+
+		//printf("leaf[%d] offset: %d size: %d\n", leaf, ebx, eax);
+		if (leaf == XSTATE_PKRU_BIT) {
+			xstate_offset = ebx;
+			xstate_size = eax;
+		}
+	}
+
+	if (xstate_size== 0) {
+		printf("could not find size/offset of PKRU in xsave state\n");
+		return 0;
+	}
+
+	return xstate_offset;
+}
diff -puN /dev/null tools/testing/selftests/x86/protection_keys.c
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/tools/testing/selftests/x86/protection_keys.c	2015-09-28 11:39:51.910454076 -0700
@@ -0,0 +1,827 @@
+/*
+ * Tests x86 Memory Protection Keys (see Documentation/x86/protection-keys.txt)
+ *
+ * There are examples in here of:
+ *  * how to set protection keys on memory
+ *  * how to set/clear bits in PKRU (the rights register)
+ *  * how to handle SEGV_PKRU signals and extract pkey-relevant
+ *    information from the siginfo
+ *
+ * Things to add:
+ *	make sure KSM and KSM COW breaking works
+ *	prefault pages in at malloc, or not
+ *	protect MPX bounds tables with protection keys?
+ *	make sure VMA splitting/merging is working correctly
+ *	OOMs can destroy mm->mmap (see exit_mmap()), so make sure it is immune to pkeys
+ *
+ * Compile like this:
+ * 	gcc      -o protection_keys    -O2 -g -std=gnu99 -pthread -Wall protection_keys.c -lrt -ldl -lm
+ *	gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99 -pthread -Wall protection_keys.c -lrt -ldl -lm
+ */
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/futex.h>
+#include <sys/time.h>
+#include <sys/syscall.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+#include "pkey-helpers.h"
+
+unsigned int shadow_pkru;
+
+#define HPAGE_SIZE	(1UL<<21)
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+#define ALIGN(x, align_to)    (((x) + ((align_to)-1)) & ~((align_to)-1))
+#define ALIGN_PTR(p, ptr_align_to)    ((typeof(p))ALIGN((unsigned long)(p), ptr_align_to))
+
+extern void abort_hooks(void);
+#define pkey_assert(condition) do {		\
+	if (!(condition)) {			\
+		abort_hooks();			\
+		perror("errno at assert");	\
+		assert(condition);		\
+	}					\
+} while (0)
+#define raw_assert(cond) assert(cond)
+
+
+#define __SI_FAULT      (3 << 16)
+#define SEGV_BNDERR     (__SI_FAULT|3)  /* failed address bound checks */
+#define SEGV_PKUERR     (__SI_FAULT|4)
+
+void cat_into_file(char *str, char *file)
+{
+	int fd = open(file, O_RDWR);
+	int ret;
+	// these need to be raw because they are called under
+	// pkey_assert()
+	raw_assert(fd >= 0);
+	ret = write(fd, str, strlen(str));
+	if (ret != strlen(str)) {
+		perror("write to file failed");
+		fprintf(stderr, "filename: '%s'\n", file);
+		raw_assert(0);
+	}
+	close(fd);
+}
+
+void tracing_on(void)
+{
+#ifdef CONTROL_TRACING
+	char pidstr[32];
+	sprintf(pidstr, "%d", getpid());
+	//cat_into_file("20000", "/sys/kernel/debug/tracing/buffer_size_kb");
+	cat_into_file("0", "/sys/kernel/debug/tracing/tracing_on");
+	cat_into_file("\n", "/sys/kernel/debug/tracing/trace");
+	if (1) {
+		cat_into_file("function_graph", "/sys/kernel/debug/tracing/current_tracer");
+		cat_into_file("1", "/sys/kernel/debug/tracing/options/funcgraph-proc");
+	} else {
+		cat_into_file("nop", "/sys/kernel/debug/tracing/current_tracer");
+	}
+	cat_into_file(pidstr, "/sys/kernel/debug/tracing/set_ftrace_pid");
+	cat_into_file("1", "/sys/kernel/debug/tracing/tracing_on");
+#endif
+}
+
+void tracing_off(void)
+{
+#ifdef CONTROL_TRACING
+	cat_into_file("0", "/sys/kernel/debug/tracing/tracing_on");
+#endif
+}
+
+void abort_hooks(void)
+{
+	fprintf(stderr, "running %s()...\n", __func__);
+	tracing_off();
+}
+
+static char *si_code_str(int si_code)
+{
+	if (si_code & SEGV_MAPERR)
+		return "SEGV_MAPERR";
+	if (si_code & SEGV_ACCERR)
+		return "SEGV_ACCERR";
+	if (si_code & SEGV_BNDERR)
+		return "SEGV_BNDERR";
+	if (si_code & SEGV_PKUERR)
+		return "SEGV_PKUERR";
+	return "UNKNOWN";
+}
+
+// I'm addicted to the kernel types
+#define  u8 uint8_t
+#define u16 uint16_t
+#define u32 uint32_t
+#define u64 uint64_t
+
+#ifdef __i386__
+#define SYS_mprotect_key 376
+#define REG_IP_IDX REG_EIP
+#define si_pkey_offset 0x08
+#else
+#define SYS_mprotect_key 325
+#define REG_IP_IDX REG_RIP
+#define si_pkey_offset 0x20
+#endif
+
+void dump_mem(void *dumpme, int len_bytes)
+{
+	char *c = (void *)dumpme;
+	int i;
+	for (i = 0; i < len_bytes; i+= sizeof(u64)) {
+		dprintf1("dump[%03d]: %016jx\n", i, *(u64 *)(c + i));
+	}
+}
+
+
+int pkru_faults = 0;
+int last_si_pkey = -1;
+void handler(int signum, siginfo_t* si, void* vucontext)
+{
+	ucontext_t* uctxt = vucontext;
+	int trapno;
+	unsigned long ip;
+	char *fpregs;
+	u32 *pkru_ptr;
+	u64 si_pkey;
+	int pkru_offset;
+
+	trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
+	ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
+	fpregset_t fpregset = uctxt->uc_mcontext.fpregs;
+	fpregs = (void *)fpregset;
+	pkru_offset = pkru_xstate_offset();
+	pkru_ptr = (void *)(&fpregs[pkru_offset]);
+
+	/*
+	 * If we got a PKRU fault, we *HAVE* to have at least one bit set in
+	 * here.
+	 */
+	dprintf1("pkru_xstate_offset: %d\n", pkru_xstate_offset());
+	dump_mem(pkru_ptr - 8, 24);
+	assert(*pkru_ptr);
+
+	si_pkey = *(u64 *)(((u8 *)si) + si_pkey_offset);
+	last_si_pkey = si_pkey;
+
+	dprintf1("\n===================SIGSEGV============================\n");
+	dprintf2("%s() trapno: %d ip: 0x%lx info->si_code: %s/%d\n", __func__, trapno, ip,
+			si_code_str(si->si_code), si->si_code);
+	if ((si->si_code == SEGV_MAPERR) ||
+	    (si->si_code == SEGV_ACCERR) ||
+	    (si->si_code == SEGV_BNDERR)) {
+		printf("non-PK si_code, exiting...\n");
+		exit(4);
+	}
+
+	//printf("pkru_xstate_offset(): %d\n", pkru_xstate_offset());
+	dprintf1("signal pkru from xsave: %08x\n", *pkru_ptr);
+	// need __ version so we do not do shadow_pkru checking
+	dprintf1("signal pkru from  pkru: %08x\n", __rdpkru());
+	dprintf1("si_pkey from siginfo: %jx\n", si_pkey);
+	*pkru_ptr = 0;
+	dprintf1("WARNING: set PRKU=0 to allow faulting instruction to continue\n");
+	pkru_faults++;
+	dprintf1("======================================================\n\n");
+	return;
+	if (trapno == 14) {
+		fprintf(stderr,
+			"ERROR: In signal handler, page fault, trapno = %d, ip = %016lx\n",
+			trapno, ip);
+		fprintf(stderr, "si_addr %p\n", si->si_addr);
+		fprintf(stderr, "REG_ERR: %lx\n", (unsigned long)uctxt->uc_mcontext.gregs[REG_ERR]);
+		//sleep(999);
+		exit(1);
+	} else {
+		fprintf(stderr,"unexpected trap %d! at 0x%lx\n", trapno, ip);
+		fprintf(stderr, "si_addr %p\n", si->si_addr);
+		fprintf(stderr, "REG_ERR: %lx\n", (unsigned long)uctxt->uc_mcontext.gregs[REG_ERR]);
+		exit(2);
+	}
+}
+
+int wait_all_children()
+{
+        int status;
+        return waitpid(-1, &status, 0);
+}
+
+void sig_chld(int x)
+{
+        dprintf2("[%d] SIGCHLD: %d\n", getpid(), x);
+}
+
+void setup_sigsegv_handler()
+{
+	int r,rs;
+	struct sigaction newact;
+	struct sigaction oldact;
+
+	/* #PF is mapped to sigsegv */
+	int signum  = SIGSEGV;
+
+	newact.sa_handler = 0;   /* void(*)(int)*/
+	newact.sa_sigaction = handler; /* void (*)(int, siginfo_t*, void*) */
+
+	/*sigset_t - signals to block while in the handler */
+	/* get the old signal mask. */
+	rs = sigprocmask(SIG_SETMASK, 0, &newact.sa_mask);
+	pkey_assert(rs == 0);
+
+	/* call sa_sigaction, not sa_handler*/
+	newact.sa_flags = SA_SIGINFO;
+
+	newact.sa_restorer = 0;  /* void(*)(), obsolete */
+	r = sigaction(signum, &newact, &oldact);
+	r = sigaction(SIGALRM, &newact, &oldact);
+	pkey_assert(r == 0);
+}
+
+void setup_handlers(void)
+{
+	signal(SIGCHLD, &sig_chld);
+	setup_sigsegv_handler();
+}
+
+void tag_each_buffer_page(void *buf, int nr_pages, unsigned long tag)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++) {
+		unsigned long *tag_at = (buf + i * PAGE_SIZE);
+		*tag_at = tag;
+	}
+}
+
+pid_t fork_lazy_child(void *buf)
+{
+	pid_t forkret;
+
+	// Tag the buffers in both parent and child
+	tag_each_buffer_page(buf, NR_PKEYS, 0xDEADBEEFUL);
+
+	forkret = fork();
+	pkey_assert(forkret >= 0);
+	dprintf3("[%d] fork() ret: %d\n", getpid(), forkret);
+
+	// Tag the buffers in both parent and child
+	tag_each_buffer_page(buf, NR_PKEYS, getpid());
+
+	if (!forkret) {
+		/* in the child */
+		while (1) {
+			dprintf1("child sleeping...\n");
+			sleep(30);
+		}
+	}
+	return forkret;
+}
+
+void davecmp(void *_a, void *_b, int len)
+{
+	int i;
+	unsigned long *a = _a;
+	unsigned long *b = _b;
+	for (i = 0; i < len / sizeof(*a); i++) {
+		if (a[i] == b[i])
+			continue;
+
+		dprintf3("[%3d]: a: %016lx b: %016lx\n", i, a[i], b[i]);
+	}
+}
+
+void dumpit(char *f)
+{
+	int fd = open(f, O_RDONLY);
+	char buf[100];
+	int nr_read;
+
+	dprintf2("maps fd: %d\n", fd);
+	do {
+		nr_read = read(fd, &buf[0], sizeof(buf));
+		write(1, buf, nr_read);
+	} while (nr_read > 0);
+	close(fd);
+}
+
+int mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot, unsigned long pkey)
+{
+	int sret;
+	pkey_assert(pkey < NR_PKEYS);
+
+	// do not let 'prot' protection key bits be set here
+	assert(orig_prot < 0x10);
+	errno = 0;
+	sret = syscall(SYS_mprotect_key, ptr, size, orig_prot, pkey);
+	if (errno) {
+		dprintf1("SYS_mprotect_key sret: %d\n", sret);
+		dprintf1("SYS_mprotect_key prot: 0x%lx\n", orig_prot);
+		dprintf1("SYS_mprotect_key failed, errno: %d\n", errno);
+		assert(0);
+	}
+	return sret;
+}
+
+struct pkey_malloc_record {
+	void *ptr;
+	long size;
+};
+struct pkey_malloc_record *pkey_malloc_records;
+long nr_pkey_malloc_records;
+void record_pkey_malloc(void *ptr, long size)
+{
+	long i;
+	struct pkey_malloc_record *rec = NULL;
+
+	for (i = 0; i < nr_pkey_malloc_records; i++) {
+		rec = &pkey_malloc_records[i];
+		// find a free record
+		if (rec)
+			break;
+	}
+	if (!rec) {
+		// every record is full
+		size_t old_nr_records = nr_pkey_malloc_records;
+		size_t new_nr_records = (nr_pkey_malloc_records * 2 + 1);
+		size_t new_size = new_nr_records * sizeof(struct pkey_malloc_record);
+		dprintf1("new_nr_records: %zd\n", new_nr_records);
+		dprintf1("new_size: %zd\n", new_size);
+		pkey_malloc_records = realloc(pkey_malloc_records, new_size);
+		pkey_assert(pkey_malloc_records != NULL);
+		rec = &pkey_malloc_records[nr_pkey_malloc_records];
+		// realloc() does not initalize memory, so zero it from
+		// the first new record all the way to the end.
+		for (i = 0; i < new_nr_records - old_nr_records; i++)
+			memset(rec + i, 0, sizeof(*rec));
+	}
+	dprintf3("filling malloc record[%d/%p]: {%p, %ld}\n",
+		(int)(rec - pkey_malloc_records), rec, ptr, size);
+	rec->ptr = ptr;
+	rec->size = size;
+	nr_pkey_malloc_records++;
+}
+
+void free_pkey_malloc(void *ptr)
+{
+	long i;
+	int ret;
+	dprintf3("%s(%p)\n", __func__, ptr);
+	for (i = 0; i < nr_pkey_malloc_records; i++) {
+		struct pkey_malloc_record *rec = &pkey_malloc_records[i];
+		dprintf4("looking for ptr %p at record[%ld/%p]: {%p, %ld}\n",
+				ptr, i, rec, rec->ptr, rec->size);
+		if ((ptr <  rec->ptr) ||
+		    (ptr >= rec->ptr + rec->size))
+			continue;
+
+		dprintf3("found ptr %p at record[%ld/%p]: {%p, %ld}\n",
+				ptr, i, rec, rec->ptr, rec->size);
+		nr_pkey_malloc_records--;
+		ret = munmap(rec->ptr, rec->size);
+		dprintf3("munmap ret: %d\n", ret);
+		pkey_assert(!ret);
+		dprintf3("clearing rec->ptr, rec: %p\n", rec);
+		rec->ptr = NULL;
+		dprintf3("done clearing rec->ptr, rec: %p\n", rec);
+		return;
+	}
+	pkey_assert(false);
+}
+
+
+void *malloc_pkey_with_mprotect(long size, int prot, u16 pkey)
+{
+	void *ptr;
+	int ret;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	pkey_assert(pkey < NR_PKEYS);
+	ptr = mmap(NULL, size, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+	ret = mprotect_pkey((void *)ptr, PAGE_SIZE, prot, pkey);
+	pkey_assert(!ret);
+	record_pkey_malloc(ptr, size);
+
+	dprintf1("%s() for pkey %d @ %p\n", __func__, pkey, ptr);
+	return ptr;
+}
+
+
+void *malloc_pkey_mmap_direct(long size, int prot, u16 pkey)
+{
+	void *ptr;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	pkey_assert(pkey < NR_PKEYS);
+	prot = prot_add_pkey(prot, pkey);
+	ptr = mmap(NULL, size, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+
+	record_pkey_malloc(ptr, size);
+
+	dprintf1("mmap()'d for pkey %d @ %p\n", pkey, ptr);
+	return ptr;
+}
+
+void *malloc_pkey_anon_huge(long size, int prot, u16 pkey)
+{
+	int ret;
+	void *ptr;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	// Guarantee we can fit at least one huge page in the resulting
+	// allocation by allocating space for 2:
+	size = ALIGN(size, HPAGE_SIZE * 2);
+	ptr = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+	record_pkey_malloc(ptr, size);
+	mprotect_pkey(ptr, size, prot, pkey);
+
+	dprintf1("unaligned ptr: %p\n", ptr);
+	ptr = ALIGN_PTR(ptr, HPAGE_SIZE);
+	dprintf1("  aligned ptr: %p\n", ptr);
+	ret = madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE);
+	dprintf1("MADV_HUGEPAGE ret: %d\n", ret);
+	ret = madvise(ptr, HPAGE_SIZE, MADV_WILLNEED);
+	dprintf1("MADV_WILLNEED ret: %d\n", ret);
+	memset(ptr, 0, HPAGE_SIZE);
+
+	dprintf1("mmap()'d thp for pkey %d @ %p\n", pkey, ptr);
+	return ptr;
+}
+
+void *malloc_pkey_hugetlb(long size, int prot, u16 pkey)
+{
+	void *ptr;
+	int flags = MAP_ANONYMOUS|MAP_PRIVATE|MAP_HUGETLB;
+
+	dprintf1("doing %s(%ld, %x, %x)\n", __func__, size, prot, pkey);
+	size = ALIGN(size, HPAGE_SIZE * 2);
+	pkey_assert(pkey < NR_PKEYS);
+	ptr = mmap(NULL, size, PROT_NONE, flags, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+	mprotect_pkey(ptr, size, prot, pkey);
+
+	record_pkey_malloc(ptr, size);
+
+	dprintf1("mmap()'d hugetlbfs for pkey %d @ %p\n", pkey, ptr);
+	return ptr;
+}
+
+void *malloc_pkey_mmap_dax(long size, int prot, u16 pkey)
+{
+	void *ptr;
+	int fd;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	pkey_assert(pkey < NR_PKEYS);
+	fd = open("/dax/foo", O_RDWR);
+	assert(fd >= 0);
+
+	ptr = mmap(0, size, prot, MAP_SHARED, fd, 0);
+	pkey_assert(ptr != (void *)-1);
+
+	mprotect_pkey(ptr, size, prot, pkey);
+
+	record_pkey_malloc(ptr, size);
+
+	dprintf1("mmap()'d for pkey %d @ %p\n", pkey, ptr);
+	close(fd);
+	return ptr;
+}
+
+//void *malloc_pkey_with_mprotect(long size, int prot, u16 pkey)
+void *(*pkey_malloc[])(long size, int prot, u16 pkey) = {
+
+	malloc_pkey_with_mprotect,
+	malloc_pkey_anon_huge,
+	malloc_pkey_hugetlb,
+// can not do direct with the mprotect_pkey() API
+//	malloc_pkey_mmap_direct,
+//	malloc_pkey_mmap_dax,
+};
+
+void *malloc_pkey(long size, int prot, u16 pkey)
+{
+	void *ret;
+	static int malloc_type = 0;
+	int nr_malloc_types = ARRAY_SIZE(pkey_malloc);
+
+	pkey_assert(pkey < NR_PKEYS);
+	pkey_assert(malloc_type < nr_malloc_types);
+	ret = pkey_malloc[malloc_type](size, prot, pkey);
+	pkey_assert(ret != (void *)-1);
+	malloc_type++;
+	if (malloc_type >= nr_malloc_types)
+		malloc_type = (random()%nr_malloc_types);
+
+	dprintf3("%s(%ld, prot=%x, pkey=%x) returning: %p\n", __func__, size, prot, pkey, ret);
+	return ret;
+}
+
+int last_pkru_faults = 0;
+void expected_pk_fault(int pkey)
+{
+	dprintf2("%s(): last_pkru_faults: %d pkru_faults: %d\n",
+			__func__, last_pkru_faults, pkru_faults);
+	dprintf2("%s(%d): last_si_pkey: %d\n", __func__, pkey, last_si_pkey);
+	pkey_assert(last_pkru_faults + 1 == pkru_faults);
+	pkey_assert(last_si_pkey == pkey);
+	/*
+	 * The signal handler shold have cleared out PKRU to let the
+	 * test program continue.  We now have to restore it.
+	 */
+	if (__rdpkru() != 0) {
+		pkey_assert(0);
+	}
+	__wrpkru(shadow_pkru);
+	dprintf1("%s() set PKRU=%x to restore state after signal nuked it\n",
+			__func__, shadow_pkru);
+	last_pkru_faults = pkru_faults;
+	last_si_pkey = -1;
+}
+
+int test_fds[10] = { -1 };
+int nr_test_fds;
+void __save_test_fd(int fd)
+{
+	pkey_assert(fd >= 0);
+	pkey_assert(nr_test_fds < ARRAY_SIZE(test_fds));
+	test_fds[nr_test_fds] = fd;
+	nr_test_fds++;
+}
+
+int get_test_read_fd(void)
+{
+	int test_fd = open("/etc/passwd", O_RDONLY);
+	__save_test_fd(test_fd);
+	return test_fd;
+}
+
+void close_test_fds(void)
+{
+	int i;
+
+	for (i = 0; i < nr_test_fds; i++) {
+		if (test_fds[i] < 0)
+			continue;
+		close(test_fds[i]);
+		test_fds[i] = -1;
+	}
+	nr_test_fds = 0;
+}
+
+void* malloc_one_page_of_each_pkey(void)
+{
+	int prot = PROT_READ|PROT_WRITE;
+	void *ret;
+	int i;
+
+	ret = mmap(NULL, PAGE_SIZE * NR_PKEYS, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ret != (void *)-1);
+	for (i = 0; i < NR_PKEYS; i++) {
+		int mprotect_ret;
+		mprotect_ret = mprotect_pkey(ret + i * PAGE_SIZE, PAGE_SIZE, prot, i);
+		pkey_assert(!mprotect_ret);
+	}
+	return ret;
+}
+
+__attribute__((noinline)) int read_ptr(int *ptr)
+{
+	return *ptr;
+}
+
+void test_read_of_write_disabled_region(int *ptr, u16 pkey)
+{
+	int ptr_contents;
+	dprintf1("disabling write access to PKEY[1], doing read\n");
+	pkey_write_deny(pkey);
+	ptr_contents = read_ptr(ptr);
+	dprintf1("*ptr: %d\n", ptr_contents);
+	dprintf1("\n");
+}
+void test_read_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	int ptr_contents;
+	dprintf1("disabling access to PKEY[%02d], doing read @ %p\n", pkey, ptr);
+	pkey_access_deny(pkey);
+	ptr_contents = read_ptr(ptr);
+	dprintf1("*ptr: %d\n", ptr_contents);
+	expected_pk_fault(pkey);
+}
+void test_write_of_write_disabled_region(int *ptr, u16 pkey)
+{
+	dprintf1("disabling write access to PKEY[%02d], doing write\n", pkey);
+	pkey_write_deny(pkey);
+	*ptr = __LINE__;
+	expected_pk_fault(pkey);
+}
+void test_write_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	dprintf1("disabling access to PKEY[%02d], doing write\n", pkey);
+	pkey_access_deny(pkey);
+	*ptr = __LINE__;
+	expected_pk_fault(pkey);
+}
+void test_kernel_write_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	int ret;
+	int test_fd = get_test_read_fd();
+
+	dprintf1("disabling access to PKEY[%02d], having kernel read() to buffer\n", pkey);
+	pkey_access_deny(pkey);
+	ret = read(test_fd, ptr, 1);
+	dprintf1("read ret: %d\n", ret);
+	pkey_assert(ret);
+}
+void test_kernel_write_of_write_disabled_region(int *ptr, u16 pkey)
+{
+	int ret;
+	int test_fd = get_test_read_fd();
+
+	pkey_write_deny(pkey);
+	ret = read(test_fd, ptr, 100);
+	dprintf1("read ret: %d\n", ret);
+	if (ret < 0 && (DEBUG_LEVEL > 0))
+		perror("read");
+	pkey_assert(ret);
+}
+
+void test_kernel_gup_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	int pipe_ret, vmsplice_ret;
+	struct iovec iov;
+	int pipe_fds[2];
+
+	pipe_ret = pipe(pipe_fds);
+
+	pkey_assert(pipe_ret == 0);
+	dprintf1("disabling access to PKEY[%02d], having kernel vmsplice from buffer\n", pkey);
+	pkey_access_deny(pkey);
+	iov.iov_base = ptr;
+	iov.iov_len = PAGE_SIZE;
+	vmsplice_ret = vmsplice(pipe_fds[1], &iov, 1, SPLICE_F_GIFT);
+	dprintf1("vmsplice() ret: %d\n", vmsplice_ret);
+	pkey_assert(vmsplice_ret == -1);
+
+	close(pipe_fds[0]);
+	close(pipe_fds[1]);
+}
+
+void test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
+{
+	int ignored = 0xdada;
+	int futex_ret;
+	int some_int = __LINE__;
+
+	dprintf1("disabling write to PKEY[%02d], doing futex gunk in buffer\n", pkey);
+	*ptr = some_int;
+	pkey_write_deny(pkey);
+	futex_ret = syscall(SYS_futex, ptr, FUTEX_WAIT, some_int-1, NULL, &ignored, ignored);
+	if (DEBUG_LEVEL > 0)
+		perror("futex");
+	dprintf1("futex() ret: %d\n", futex_ret);
+	//pkey_assert(vmsplice_ret == -1);
+}
+
+void test_ptrace_of_child(int *ptr, u16 pkey)
+{
+	void *buf = malloc_one_page_of_each_pkey();
+	pid_t child_pid = fork_lazy_child(buf);
+	void *ignored = 0;
+	long ret;
+	int i;
+	int status;
+
+	dprintf1("[%d] child pid: %d\n", getpid(), child_pid);
+
+	ret = ptrace(PTRACE_ATTACH, child_pid, ignored, ignored);
+	if (ret)
+		perror("attach");
+	dprintf1("[%d] attach ret: %ld %d\n", getpid(), ret, __LINE__);
+	pkey_assert(ret != -1);
+	ret = waitpid(child_pid, &status, WUNTRACED);
+	if ((ret != child_pid) || !(WIFSTOPPED(status)) ) {
+		fprintf(stderr, "weird waitpid result %ld stat %x\n", ret, status);
+		pkey_assert(0);
+	}
+	dprintf2("waitpid ret: %ld\n", ret);
+	dprintf2("waitpid status: %d\n", status);
+
+	//if (0)
+	for (i = 1; i < NR_PKEYS; i++) {
+		pkey_access_deny(i);
+		pkey_write_deny(i);
+	}
+	for (i = 0; i < NR_PKEYS; i++) {
+		void *peek_at = buf + i * PAGE_SIZE;
+		long peek_result;
+
+		//ret = ptrace(PTRACE_POKEDATA, child_pid, peek_at, data);
+		//pkey_assert(ret != -1);
+		//printf("poke at %p: %ld\n", peek_at, ret);
+
+		ret = ptrace(PTRACE_PEEKDATA, child_pid, peek_at, ignored);
+		pkey_assert(ret != -1);
+
+		peek_result = *(long *)peek_at;
+		// for the *peek_at access
+		if (i >= 1) // did not disable access to pkey 0
+			expected_pk_fault(i);
+
+		dprintf1("peek at pkey[%2d] @ %p: %lx (local: %ld) pkru: %08x\n", i, peek_at, ret, peek_result, rdpkru());
+	}
+	ret = ptrace(PTRACE_DETACH, child_pid, ignored, 0);
+	pkey_assert(ret != -1);
+
+	ret = kill(child_pid, SIGKILL);
+	pkey_assert(ret != -1);
+
+	ret = munmap(buf, PAGE_SIZE * NR_PKEYS);
+	pkey_assert(!ret);
+}
+
+void (*pkey_tests[])(int *ptr, u16 pkey) = {
+	test_read_of_write_disabled_region,
+	test_read_of_access_disabled_region,
+	test_write_of_write_disabled_region,
+	test_write_of_access_disabled_region,
+	test_kernel_write_of_access_disabled_region,
+	test_kernel_write_of_write_disabled_region,
+	test_kernel_gup_of_access_disabled_region,
+	test_kernel_gup_write_to_write_disabled_region,
+//	test_ptrace_of_child,
+};
+
+void run_tests_once(void)
+{
+	static int iteration_nr = 1;
+	int *ptr;
+	int prot = PROT_READ|PROT_WRITE;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(pkey_tests); i++) {
+		int orig_pkru_faults = pkru_faults;
+		// reset pkru:
+		wrpkru(0);
+
+		static u16 pkey;
+		pkey = 1 + (rand() % 15);
+		dprintf1("================\n");
+		dprintf1("test %d starting with pkey: %d\n", i, pkey);
+		tracing_on();
+		ptr = malloc_pkey(PAGE_SIZE, prot, pkey);
+		//dumpit("/proc/self/maps");
+		pkey_tests[i](ptr, pkey);
+		//sleep(999);
+		dprintf1("freeing test memory: %p\n", ptr);
+		free_pkey_malloc(ptr);
+
+		dprintf1("pkru_faults: %d\n", pkru_faults);
+		dprintf1("orig_pkru_faults: %d\n", orig_pkru_faults);
+
+		tracing_off();
+		close_test_fds();
+		//system("dmesg -c");
+		//sleep(2);
+		printf("test %d PASSED (itertation %d)\n", i, iteration_nr);
+		dprintf1("================\n\n");
+	}
+	iteration_nr++;
+}
+
+int main()
+{
+	int nr_iterations = 5;
+	setup_handlers();
+	printf("has pku: %d\n", cpu_has_pku());
+	printf("pkru: %x\n", rdpkru());
+	pkey_assert(cpu_has_pku());
+	pkey_assert(!rdpkru());
+
+	cat_into_file("10", "/proc/sys/vm/nr_hugepages");
+
+	while (nr_iterations-- > 0)
+		run_tests_once();
+
+	printf("done (all tests OK)\n");
+	return 0;
+}
+
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 25/25] x86, pkeys: Documentation
@ 2015-09-28 19:18   ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 19:18 UTC (permalink / raw)
  To: dave; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>


Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/x86/protection-keys.txt |   54 ++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff -puN /dev/null Documentation/x86/protection-keys.txt
--- /dev/null	2015-07-13 14:24:11.435656502 -0700
+++ b/Documentation/x86/protection-keys.txt	2015-09-28 11:40:16.120555350 -0700
@@ -0,0 +1,54 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+which will be found on future Intel CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.  It works by
+dedicating 4 previously ignored bits in each page table entry to a
+"protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+	mprotect(ptr, size, PROT_NONE);
+	something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+	mprotect(ptr, size, PROT_READ|PROT_WRITE);
+	set_pkey(ptr, size, 4);
+	wrpkru(0xffffff3f); // access disable pkey 4
+	something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+	*ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+	read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
+
+=========================== Config Option ===========================
+
+This config option adds approximately 1.5kb of text. and 50 bytes of
+data to the executable.  A workload which does large O_DIRECT reads
+of holes in XFS files was run to exercise get_user_pages_fast().  No
+performance delta was observed with the config option
+enabled or disabled.
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 25/25] x86, pkeys: Documentation
  2015-09-28 19:18   ` Dave Hansen
@ 2015-09-28 20:34     ` Andi Kleen
  -1 siblings, 0 replies; 86+ messages in thread
From: Andi Kleen @ 2015-09-28 20:34 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

Dave Hansen <dave@sr71.net> writes:

> From: Dave Hansen <dave.hansen@linux.intel.com>

Do you have a manpage for the new syscall too?

-Andi

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 25/25] x86, pkeys: Documentation
@ 2015-09-28 20:34     ` Andi Kleen
  0 siblings, 0 replies; 86+ messages in thread
From: Andi Kleen @ 2015-09-28 20:34 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

Dave Hansen <dave@sr71.net> writes:

> From: Dave Hansen <dave.hansen@linux.intel.com>

Do you have a manpage for the new syscall too?

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 25/25] x86, pkeys: Documentation
  2015-09-28 20:34     ` Andi Kleen
  (?)
@ 2015-09-28 20:41     ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-28 20:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

[-- Attachment #1: Type: text/plain, Size: 140 bytes --]

On 09/28/2015 01:34 PM, Andi Kleen wrote:
> Do you have a manpage for the new syscall too?

Yep, I just added it to the mprotect() manpage.

[-- Attachment #2: mprotect_key.patch --]
[-- Type: text/x-patch, Size: 1285 bytes --]

diff --git a/man2/mprotect.2 b/man2/mprotect.2
index ae305f6..5ba6c58 100644
--- a/man2/mprotect.2
+++ b/man2/mprotect.2
@@ -38,16 +38,19 @@
 .\"
 .TH MPROTECT 2 2015-07-23 "Linux" "Linux Programmer's Manual"
 .SH NAME
-mprotect \- set protection on a region of memory
+mprotect, mprotect_key \- set protection on a region of memory
 .SH SYNOPSIS
 .nf
 .B #include <sys/mman.h>
 .sp
 .BI "int mprotect(void *" addr ", size_t " len ", int " prot );
+.BI "int mprotect_key(void *" addr ", size_t " len ", int " prot , " unsigned long " key);
 .fi
 .SH DESCRIPTION
 .BR mprotect ()
-changes protection for the calling process's memory page(s)
+and
+.BR mprotect_key ()
+change protection for the calling process's memory page(s)
 containing any part of the address range in the
 interval [\fIaddr\fP,\ \fIaddr\fP+\fIlen\fP\-1].
 .I addr
@@ -74,10 +77,18 @@ The memory can be modified.
 .TP
 .B PROT_EXEC
 The memory can be executed.
+.PP
+.I key
+is the protection or storage key to assign to the memory.
+The number of keys supported is dependent on the architecture
+and is always at least one.
+The default key is 0.
 .SH RETURN VALUE
 On success,
 .BR mprotect ()
-returns zero.
+and
+.BR mprotect_key ()
+return zero.
 On error, \-1 is returned, and
 .I errno
 is set appropriately.

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 21/25] mm: implement new mprotect_key() system call
@ 2015-09-29  6:39     ` Michael Ellerman
  0 siblings, 0 replies; 86+ messages in thread
From: Michael Ellerman @ 2015-09-29  6:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen, linux-api

On Mon, 2015-09-28 at 12:18 -0700, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> mprotect_key() is just like mprotect, except it also takes a
> protection key as an argument.  On systems that do not support
> protection keys, it still works, but requires that key=0.

I'm not sure how userspace is going to use the key=0 feature? ie. userspace
will still have to detect that keys are not supported and use key 0 everywhere.
At that point it could just as well skip the mprotect_key() syscalls entirely
couldn't it?

> I expect it to get used like this, if you want to guarantee that
> any mapping you create can *never* be accessed without the right
> protection keys set up.
> 
> 	pkey_deny_access(11); // random pkey
> 	int real_prot = PROT_READ|PROT_WRITE;
> 	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> 	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);
> 
> This way, there is *no* window where the mapping is accessible
> since it was always either PROT_NONE or had a protection key set.
> 
> We settled on 'unsigned long' for the type of the key here.  We
> only need 4 bits on x86 today, but I figured that other
> architectures might need some more space.

If the existing mprotect() syscall had a flags argument you could have just
used that. So is it worth just adding mprotect2() now and using it for this? ie:

int mprotect2(unsigned long start, size_t len, unsigned long prot, unsigned long flags) ..

And then you define bit zero of flags to say you're passing a pkey, and it's in
bits 1-63?

That way if other arches need to do something different you at least have the
flags available?

cheers



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 21/25] mm: implement new mprotect_key() system call
@ 2015-09-29  6:39     ` Michael Ellerman
  0 siblings, 0 replies; 86+ messages in thread
From: Michael Ellerman @ 2015-09-29  6:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: borntraeger-tA70FqPdS9bQT0dZR+AlfA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	dave.hansen-VuQAYsv1563Yd54FQh9/CA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Mon, 2015-09-28 at 12:18 -0700, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> 
> mprotect_key() is just like mprotect, except it also takes a
> protection key as an argument.  On systems that do not support
> protection keys, it still works, but requires that key=0.

I'm not sure how userspace is going to use the key=0 feature? ie. userspace
will still have to detect that keys are not supported and use key 0 everywhere.
At that point it could just as well skip the mprotect_key() syscalls entirely
couldn't it?

> I expect it to get used like this, if you want to guarantee that
> any mapping you create can *never* be accessed without the right
> protection keys set up.
> 
> 	pkey_deny_access(11); // random pkey
> 	int real_prot = PROT_READ|PROT_WRITE;
> 	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> 	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);
> 
> This way, there is *no* window where the mapping is accessible
> since it was always either PROT_NONE or had a protection key set.
> 
> We settled on 'unsigned long' for the type of the key here.  We
> only need 4 bits on x86 today, but I figured that other
> architectures might need some more space.

If the existing mprotect() syscall had a flags argument you could have just
used that. So is it worth just adding mprotect2() now and using it for this? ie:

int mprotect2(unsigned long start, size_t len, unsigned long prot, unsigned long flags) ..

And then you define bit zero of flags to say you're passing a pkey, and it's in
bits 1-63?

That way if other arches need to do something different you at least have the
flags available?

cheers

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 21/25] mm: implement new mprotect_key() system call
@ 2015-09-29  6:39     ` Michael Ellerman
  0 siblings, 0 replies; 86+ messages in thread
From: Michael Ellerman @ 2015-09-29  6:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen, linux-api

On Mon, 2015-09-28 at 12:18 -0700, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> mprotect_key() is just like mprotect, except it also takes a
> protection key as an argument.  On systems that do not support
> protection keys, it still works, but requires that key=0.

I'm not sure how userspace is going to use the key=0 feature? ie. userspace
will still have to detect that keys are not supported and use key 0 everywhere.
At that point it could just as well skip the mprotect_key() syscalls entirely
couldn't it?

> I expect it to get used like this, if you want to guarantee that
> any mapping you create can *never* be accessed without the right
> protection keys set up.
> 
> 	pkey_deny_access(11); // random pkey
> 	int real_prot = PROT_READ|PROT_WRITE;
> 	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> 	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);
> 
> This way, there is *no* window where the mapping is accessible
> since it was always either PROT_NONE or had a protection key set.
> 
> We settled on 'unsigned long' for the type of the key here.  We
> only need 4 bits on x86 today, but I figured that other
> architectures might need some more space.

If the existing mprotect() syscall had a flags argument you could have just
used that. So is it worth just adding mprotect2() now and using it for this? ie:

int mprotect2(unsigned long start, size_t len, unsigned long prot, unsigned long flags) ..

And then you define bit zero of flags to say you're passing a pkey, and it's in
bits 1-63?

That way if other arches need to do something different you at least have the
flags available?

cheers


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 21/25] mm: implement new mprotect_key() system call
  2015-09-29  6:39     ` Michael Ellerman
@ 2015-09-29 14:16       ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-29 14:16 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen, linux-api

On 09/28/2015 11:39 PM, Michael Ellerman wrote:
> On Mon, 2015-09-28 at 12:18 -0700, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> mprotect_key() is just like mprotect, except it also takes a
>> protection key as an argument.  On systems that do not support
>> protection keys, it still works, but requires that key=0.
> 
> I'm not sure how userspace is going to use the key=0 feature? ie. userspace
> will still have to detect that keys are not supported and use key 0 everywhere.
> At that point it could just as well skip the mprotect_key() syscalls entirely
> couldn't it?

Yep.

Or, a new architecture could just skip mprotect() itself entirely and
only wire up mprotect_pkey().  I don't see this pkey=0 thing as an
important feature or anything.  I just wanted to call out the behavior.

>> I expect it to get used like this, if you want to guarantee that
>> any mapping you create can *never* be accessed without the right
>> protection keys set up.
>>
>> 	pkey_deny_access(11); // random pkey
>> 	int real_prot = PROT_READ|PROT_WRITE;
>> 	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>> 	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);
>>
>> This way, there is *no* window where the mapping is accessible
>> since it was always either PROT_NONE or had a protection key set.
>>
>> We settled on 'unsigned long' for the type of the key here.  We
>> only need 4 bits on x86 today, but I figured that other
>> architectures might need some more space.
> 
> If the existing mprotect() syscall had a flags argument you could have just
> used that. So is it worth just adding mprotect2() now and using it for this? ie:
> 
> int mprotect2(unsigned long start, size_t len, unsigned long prot, unsigned long flags) ..
> 
> And then you define bit zero of flags to say you're passing a pkey, and it's in
> bits 1-63?
> 
> That way if other arches need to do something different you at least have the
> flags available?

But what problem does that solve?

mprotect() itself has plenty of space in prot.  Do any of the other
architectures need to pass in more than just an integer key to implement
storage/protection keys?

I'd much rather have a set of (relatively) arch-specific system calls
implementing protection keys rather than a single one with one
arch-specific argument.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 21/25] mm: implement new mprotect_key() system call
@ 2015-09-29 14:16       ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-09-29 14:16 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen, linux-api

On 09/28/2015 11:39 PM, Michael Ellerman wrote:
> On Mon, 2015-09-28 at 12:18 -0700, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> mprotect_key() is just like mprotect, except it also takes a
>> protection key as an argument.  On systems that do not support
>> protection keys, it still works, but requires that key=0.
> 
> I'm not sure how userspace is going to use the key=0 feature? ie. userspace
> will still have to detect that keys are not supported and use key 0 everywhere.
> At that point it could just as well skip the mprotect_key() syscalls entirely
> couldn't it?

Yep.

Or, a new architecture could just skip mprotect() itself entirely and
only wire up mprotect_pkey().  I don't see this pkey=0 thing as an
important feature or anything.  I just wanted to call out the behavior.

>> I expect it to get used like this, if you want to guarantee that
>> any mapping you create can *never* be accessed without the right
>> protection keys set up.
>>
>> 	pkey_deny_access(11); // random pkey
>> 	int real_prot = PROT_READ|PROT_WRITE;
>> 	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>> 	ret = mprotect_key(ptr, PAGE_SIZE, real_prot, 11);
>>
>> This way, there is *no* window where the mapping is accessible
>> since it was always either PROT_NONE or had a protection key set.
>>
>> We settled on 'unsigned long' for the type of the key here.  We
>> only need 4 bits on x86 today, but I figured that other
>> architectures might need some more space.
> 
> If the existing mprotect() syscall had a flags argument you could have just
> used that. So is it worth just adding mprotect2() now and using it for this? ie:
> 
> int mprotect2(unsigned long start, size_t len, unsigned long prot, unsigned long flags) ..
> 
> And then you define bit zero of flags to say you're passing a pkey, and it's in
> bits 1-63?
> 
> That way if other arches need to do something different you at least have the
> flags available?

But what problem does that solve?

mprotect() itself has plenty of space in prot.  Do any of the other
architectures need to pass in more than just an integer key to implement
storage/protection keys?

I'd much rather have a set of (relatively) arch-specific system calls
implementing protection keys rather than a single one with one
arch-specific argument.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01/25] x86, fpu: add placeholder for Processor Trace XSAVE state
  2015-09-28 19:18   ` Dave Hansen
@ 2015-10-01 11:01     ` Thomas Gleixner
  -1 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:01 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> There is an XSAVE state component for Intel Processor Trace.  But,
> we do not use it and do not expect to ever use it.
> 
> We add a placeholder in the code for it so it is not a mystery and
> also so we do not need an explicit enum initialization for Protection
> Keys in a moment.
> 
> Why will we never use it?  According to Andi Kleen:
> 
> 	The XSAVE support assumes that there is a single buffer
> 	for each thread. But perf generally doesn't work this
> 	way, it usually has only a single perf event per CPU per
> 	user, and when tracing multiple threads on that CPU it
> 	inherits perf event buffers between different threads. So
> 	XSAVE per thread cannot handle this inheritance case
> 	directly.
> 
> 	Using multiple XSAVE areas (another one per perf event)
> 	would defeat some of the state caching that the CPUs do.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01/25] x86, fpu: add placeholder for Processor Trace XSAVE state
@ 2015-10-01 11:01     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:01 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> There is an XSAVE state component for Intel Processor Trace.  But,
> we do not use it and do not expect to ever use it.
> 
> We add a placeholder in the code for it so it is not a mystery and
> also so we do not need an explicit enum initialization for Protection
> Keys in a moment.
> 
> Why will we never use it?  According to Andi Kleen:
> 
> 	The XSAVE support assumes that there is a single buffer
> 	for each thread. But perf generally doesn't work this
> 	way, it usually has only a single perf event per CPU per
> 	user, and when tracing multiple threads on that CPU it
> 	inherits perf event buffers between different threads. So
> 	XSAVE per thread cannot handle this inheritance case
> 	directly.
> 
> 	Using multiple XSAVE areas (another one per perf event)
> 	would defeat some of the state caching that the CPUs do.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 02/25] x86, pkeys: Add Kconfig option
  2015-09-28 19:18   ` Dave Hansen
@ 2015-10-01 11:02     ` Thomas Gleixner
  -1 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:02 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I don't have a strong opinion on whether we need a Kconfig prompt
> or not.  Protection Keys has relatively little code associated
> with it, and it is not a heavyweight feature to keep enabled.
> However, I can imagine that folks would still appreciate being
> able to disable it.
> 
> We will hide the prompt for now.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 02/25] x86, pkeys: Add Kconfig option
@ 2015-10-01 11:02     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:02 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I don't have a strong opinion on whether we need a Kconfig prompt
> or not.  Protection Keys has relatively little code associated
> with it, and it is not a heavyweight feature to keep enabled.
> However, I can imagine that folks would still appreciate being
> able to disable it.
> 
> We will hide the prompt for now.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 03/25] x86, pkeys: cpuid bit definition
  2015-09-28 19:18   ` Dave Hansen
@ 2015-10-01 11:02     ` Thomas Gleixner
  -1 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:02 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:
> This means that in almost all code, you should use:
> 
> 	cpu_has(X86_FEATURE_PKU)
> 
> and *not* the CONFIG option.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 03/25] x86, pkeys: cpuid bit definition
@ 2015-10-01 11:02     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:02 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:
> This means that in almost all code, you should use:
> 
> 	cpu_has(X86_FEATURE_PKU)
> 
> and *not* the CONFIG option.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04/25] x86, pku: define new CR4 bit
  2015-09-28 19:18   ` Dave Hansen
@ 2015-10-01 11:03     ` Thomas Gleixner
  -1 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:03 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> There is a new bit in CR4 for enabling protection keys.  We
> will actually enable it later in the series.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04/25] x86, pku: define new CR4 bit
@ 2015-10-01 11:03     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:03 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> There is a new bit in CR4 for enabling protection keys.  We
> will actually enable it later in the series.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 05/25] x86, pkey: add PKRU xsave fields and data structure(s)
  2015-09-28 19:18   ` Dave Hansen
@ 2015-10-01 11:50     ` Thomas Gleixner
  -1 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:50 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:
> +/*
> + * State component 9: 32-bit PKRU register.
> + */
> +struct pkru {
> +	u32 pkru;
> +} __packed;
> +
> +struct pkru_state {
> +	union {
> +		struct pkru		pkru;
> +		u8			pad_to_8_bytes[8];
> +	};

Why do you need two structs?

    struct pkru_state {
    	   u32 pkru;
	   u32 pad;
    }

should be sufficient. So instead of

       xsave.pkru_state.pkru.pkru

you get the more obvious

       xsave.pkru_state.pkru

Hmm?

Thanks,

	tglx



      

       

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 05/25] x86, pkey: add PKRU xsave fields and data structure(s)
@ 2015-10-01 11:50     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:50 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:
> +/*
> + * State component 9: 32-bit PKRU register.
> + */
> +struct pkru {
> +	u32 pkru;
> +} __packed;
> +
> +struct pkru_state {
> +	union {
> +		struct pkru		pkru;
> +		u8			pad_to_8_bytes[8];
> +	};

Why do you need two structs?

    struct pkru_state {
    	   u32 pkru;
	   u32 pad;
    }

should be sufficient. So instead of

       xsave.pkru_state.pkru.pkru

you get the more obvious

       xsave.pkru_state.pkru

Hmm?

Thanks,

	tglx



      

       

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 06/25] x86, pkeys: PTE bits for storing protection key
  2015-09-28 19:18   ` Dave Hansen
@ 2015-10-01 11:51     ` Thomas Gleixner
  -1 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:51 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Previous documentation has referred to these 4 bits as "ignored".
> That means that software could have made use of them.  But, as
> far as I know, the kernel never used them.
> 
> They are still ignored when protection keys is not enabled, so
> they could theoretically still get used for software purposes.
> 
> We also implement "empty" versions so that code that references
> to them can be optimized away by the compiler when the config
> option is not enabled.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 06/25] x86, pkeys: PTE bits for storing protection key
@ 2015-10-01 11:51     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:51 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Previous documentation has referred to these 4 bits as "ignored".
> That means that software could have made use of them.  But, as
> far as I know, the kernel never used them.
> 
> They are still ignored when protection keys is not enabled, so
> they could theoretically still get used for software purposes.
> 
> We also implement "empty" versions so that code that references
> to them can be optimized away by the compiler when the config
> option is not enabled.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 07/25] x86, pkeys: new page fault error code bit: PF_PK
  2015-09-28 19:18   ` Dave Hansen
@ 2015-10-01 11:54     ` Thomas Gleixner
  -1 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:54 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:
>  
>  /*
> @@ -916,7 +918,10 @@ static int spurious_fault_check(unsigned
>  
>  	if ((error_code & PF_INSTR) && !pte_exec(*pte))
>  		return 0;
> -
> +	/*
> +	 * Note: We do not do lazy flushing on protection key
> +	 * changes, so no spurious fault will ever set PF_PK.
> +	 */

It might be a bit more clear to have:

   	/* Comment .... */
  	if ((error_code & PF_PK))
  		return 1;

  	return 1;

That way the comment is associated to obviously redundant code, but
it's easier to read, especially if we add some new PF_ thingy after
that.

Other than that:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 07/25] x86, pkeys: new page fault error code bit: PF_PK
@ 2015-10-01 11:54     ` Thomas Gleixner
  0 siblings, 0 replies; 86+ messages in thread
From: Thomas Gleixner @ 2015-10-01 11:54 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, 28 Sep 2015, Dave Hansen wrote:
>  
>  /*
> @@ -916,7 +918,10 @@ static int spurious_fault_check(unsigned
>  
>  	if ((error_code & PF_INSTR) && !pte_exec(*pte))
>  		return 0;
> -
> +	/*
> +	 * Note: We do not do lazy flushing on protection key
> +	 * changes, so no spurious fault will ever set PF_PK.
> +	 */

It might be a bit more clear to have:

   	/* Comment .... */
  	if ((error_code & PF_PK))
  		return 1;

  	return 1;

That way the comment is associated to obviously redundant code, but
it's easier to read, especially if we add some new PF_ thingy after
that.

Other than that:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 05/25] x86, pkey: add PKRU xsave fields and data structure(s)
  2015-10-01 11:50     ` Thomas Gleixner
@ 2015-10-01 17:17       ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-10-01 17:17 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On 10/01/2015 04:50 AM, Thomas Gleixner wrote:
> On Mon, 28 Sep 2015, Dave Hansen wrote:
>> +/*
>> + * State component 9: 32-bit PKRU register.
>> + */
>> +struct pkru {
>> +	u32 pkru;
>> +} __packed;
>> +
>> +struct pkru_state {
>> +	union {
>> +		struct pkru		pkru;
>> +		u8			pad_to_8_bytes[8];
>> +	};
> 
> Why do you need two structs?
> 
>     struct pkru_state {
>     	   u32 pkru;
> 	   u32 pad;
>     }
> 
> should be sufficient. So instead of
> 
>        xsave.pkru_state.pkru.pkru
> 
> you get the more obvious
> 
>        xsave.pkru_state.pkru
> 
> Hmm?

I was trying to get across that PKRU itself and the "PKRU state" are
differently-sized.

But, it does just end up looking funky if we _use_ it.  I'll fix it up.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 05/25] x86, pkey: add PKRU xsave fields and data structure(s)
@ 2015-10-01 17:17       ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-10-01 17:17 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On 10/01/2015 04:50 AM, Thomas Gleixner wrote:
> On Mon, 28 Sep 2015, Dave Hansen wrote:
>> +/*
>> + * State component 9: 32-bit PKRU register.
>> + */
>> +struct pkru {
>> +	u32 pkru;
>> +} __packed;
>> +
>> +struct pkru_state {
>> +	union {
>> +		struct pkru		pkru;
>> +		u8			pad_to_8_bytes[8];
>> +	};
> 
> Why do you need two structs?
> 
>     struct pkru_state {
>     	   u32 pkru;
> 	   u32 pad;
>     }
> 
> should be sufficient. So instead of
> 
>        xsave.pkru_state.pkru.pkru
> 
> you get the more obvious
> 
>        xsave.pkru_state.pkru
> 
> Hmm?

I was trying to get across that PKRU itself and the "PKRU state" are
differently-sized.

But, it does just end up looking funky if we _use_ it.  I'll fix it up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 07/25] x86, pkeys: new page fault error code bit: PF_PK
  2015-10-01 11:54     ` Thomas Gleixner
@ 2015-10-01 17:19       ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-10-01 17:19 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On 10/01/2015 04:54 AM, Thomas Gleixner wrote:
> On Mon, 28 Sep 2015, Dave Hansen wrote:
>> >  
>> >  /*
>> > @@ -916,7 +918,10 @@ static int spurious_fault_check(unsigned
>> >  
>> >  	if ((error_code & PF_INSTR) && !pte_exec(*pte))
>> >  		return 0;
>> > -
>> > +	/*
>> > +	 * Note: We do not do lazy flushing on protection key
>> > +	 * changes, so no spurious fault will ever set PF_PK.
>> > +	 */
> It might be a bit more clear to have:
> 
>    	/* Comment .... */
>   	if ((error_code & PF_PK))
>   		return 1;
> 
>   	return 1;
> 
> That way the comment is associated to obviously redundant code, but
> it's easier to read, especially if we add some new PF_ thingy after
> that.

Agreed, that's a nicer way to do it.  I'll fix it up.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 07/25] x86, pkeys: new page fault error code bit: PF_PK
@ 2015-10-01 17:19       ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-10-01 17:19 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On 10/01/2015 04:54 AM, Thomas Gleixner wrote:
> On Mon, 28 Sep 2015, Dave Hansen wrote:
>> >  
>> >  /*
>> > @@ -916,7 +918,10 @@ static int spurious_fault_check(unsigned
>> >  
>> >  	if ((error_code & PF_INSTR) && !pte_exec(*pte))
>> >  		return 0;
>> > -
>> > +	/*
>> > +	 * Note: We do not do lazy flushing on protection key
>> > +	 * changes, so no spurious fault will ever set PF_PK.
>> > +	 */
> It might be a bit more clear to have:
> 
>    	/* Comment .... */
>   	if ((error_code & PF_PK))
>   		return 1;
> 
>   	return 1;
> 
> That way the comment is associated to obviously redundant code, but
> it's easier to read, especially if we add some new PF_ thingy after
> that.

Agreed, that's a nicer way to do it.  I'll fix it up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 15/25] x86, pkeys: check VMAs and PTEs for protection keys
  2015-09-28 19:18   ` Dave Hansen
@ 2015-10-22 20:57     ` Jerome Glisse
  -1 siblings, 0 replies; 86+ messages in thread
From: Jerome Glisse @ 2015-10-22 20:57 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, Sep 28, 2015 at 12:18:23PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Today, for normal faults and page table walks, we check the VMA
> and/or PTE to ensure that it is compatible with the action.  For
> instance, if we get a write fault on a non-writeable VMA, we
> SIGSEGV.
> 
> We try to do the same thing for protection keys.  Basically, we
> try to make sure that if a user does this:
> 
> 	mprotect(ptr, size, PROT_NONE);
> 	*ptr = foo;
> 
> they see the same effects with protection keys when they do this:
> 
> 	mprotect(ptr, size, PROT_READ|PROT_WRITE);
> 	set_pkey(ptr, size, 4);
> 	wrpkru(0xffffff3f); // access disable pkey 4
> 	*ptr = foo;
> 
> The state to do that checking is in the VMA, but we also
> sometimes have to do it on the page tables only, like when doing
> a get_user_pages_fast() where we have no VMA.
> 
> We add two functions and expose them to generic code:
> 
> 	arch_pte_access_permitted(pte, write)
> 	arch_vma_access_permitted(vma, write)
> 
> These are, of course, backed up in x86 arch code with checks
> against the PTE or VMA's protection key.
> 
> But, there are also cases where we do not want to respect
> protection keys.  When we ptrace(), for instance, we do not want
> to apply the tracer's PKRU permissions to the PTEs from the
> process being traced.


Well i am bit puzzle here because this will not provide consistant
protection as far as GUP (get_user_pages) is concern, assuming i
understand the pkru thing properly. Those are register local to CPU
and they are writeable by userspace thread so thread can temporarily
revoke access to range while executing untrusted subfunctions.

I have not read all the patches, but here i assume that for GUP you do
not first call arch_vma_access_permitted(). So issue i see is that GUP
for a process might happen inside another process and that process might
have different pkru protection keys, effectively randomly allowing or
forbidding a device driver to perform a GUP from say some workqueue that
just happen to be schedule against a different processor/thread than the
one against which it is doing the GUP for.

Second and more fundamental thing i have issue with is that this whole
pkru keys are centric to CPU POV ie this is a CPU feature. So i do not
believe that device driver should be forbidden to do GUP base on pkru
keys.

Tying this to the pkru reg value of whatever processor happens to be
running some device driver kernel function that try to do a GUP seems
broken to me.

Sadly setting properties like pkru keys per device is not something that
is easy to do. I would do it on a per device file basis and allow user
space program to change them against the device file, then device driver
doing GUP would use that to check against the pte key and allow forbid
GUP.

Also doing it on per device file makes it harder for program to leverage
this feature as now they have to think about all device file they have
open. Maybe we need to keep a list of device that are use by a process
in the task struct and allow to set pkey globaly for all devices, while
allowing overriding this common default on per device basis.

So as first i would just allow GUP to always work and then come up with
syscall to allow to set pkey on device file. This obviously is a lot more
work as you need to go over all device driver using GUP.

This are my thoughts so far.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 15/25] x86, pkeys: check VMAs and PTEs for protection keys
@ 2015-10-22 20:57     ` Jerome Glisse
  0 siblings, 0 replies; 86+ messages in thread
From: Jerome Glisse @ 2015-10-22 20:57 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Mon, Sep 28, 2015 at 12:18:23PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Today, for normal faults and page table walks, we check the VMA
> and/or PTE to ensure that it is compatible with the action.  For
> instance, if we get a write fault on a non-writeable VMA, we
> SIGSEGV.
> 
> We try to do the same thing for protection keys.  Basically, we
> try to make sure that if a user does this:
> 
> 	mprotect(ptr, size, PROT_NONE);
> 	*ptr = foo;
> 
> they see the same effects with protection keys when they do this:
> 
> 	mprotect(ptr, size, PROT_READ|PROT_WRITE);
> 	set_pkey(ptr, size, 4);
> 	wrpkru(0xffffff3f); // access disable pkey 4
> 	*ptr = foo;
> 
> The state to do that checking is in the VMA, but we also
> sometimes have to do it on the page tables only, like when doing
> a get_user_pages_fast() where we have no VMA.
> 
> We add two functions and expose them to generic code:
> 
> 	arch_pte_access_permitted(pte, write)
> 	arch_vma_access_permitted(vma, write)
> 
> These are, of course, backed up in x86 arch code with checks
> against the PTE or VMA's protection key.
> 
> But, there are also cases where we do not want to respect
> protection keys.  When we ptrace(), for instance, we do not want
> to apply the tracer's PKRU permissions to the PTEs from the
> process being traced.


Well i am bit puzzle here because this will not provide consistant
protection as far as GUP (get_user_pages) is concern, assuming i
understand the pkru thing properly. Those are register local to CPU
and they are writeable by userspace thread so thread can temporarily
revoke access to range while executing untrusted subfunctions.

I have not read all the patches, but here i assume that for GUP you do
not first call arch_vma_access_permitted(). So issue i see is that GUP
for a process might happen inside another process and that process might
have different pkru protection keys, effectively randomly allowing or
forbidding a device driver to perform a GUP from say some workqueue that
just happen to be schedule against a different processor/thread than the
one against which it is doing the GUP for.

Second and more fundamental thing i have issue with is that this whole
pkru keys are centric to CPU POV ie this is a CPU feature. So i do not
believe that device driver should be forbidden to do GUP base on pkru
keys.

Tying this to the pkru reg value of whatever processor happens to be
running some device driver kernel function that try to do a GUP seems
broken to me.

Sadly setting properties like pkru keys per device is not something that
is easy to do. I would do it on a per device file basis and allow user
space program to change them against the device file, then device driver
doing GUP would use that to check against the pte key and allow forbid
GUP.

Also doing it on per device file makes it harder for program to leverage
this feature as now they have to think about all device file they have
open. Maybe we need to keep a list of device that are use by a process
in the task struct and allow to set pkey globaly for all devices, while
allowing overriding this common default on per device basis.

So as first i would just allow GUP to always work and then come up with
syscall to allow to set pkey on device file. This obviously is a lot more
work as you need to go over all device driver using GUP.

This are my thoughts so far.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 15/25] x86, pkeys: check VMAs and PTEs for protection keys
  2015-10-22 20:57     ` Jerome Glisse
@ 2015-10-22 21:23       ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-10-22 21:23 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On 10/22/2015 01:57 PM, Jerome Glisse wrote:
> I have not read all the patches, but here i assume that for GUP you do
> not first call arch_vma_access_permitted(). So issue i see is that GUP
> for a process might happen inside another process and that process might
> have different pkru protection keys, effectively randomly allowing or
> forbidding a device driver to perform a GUP from say some workqueue that
> just happen to be schedule against a different processor/thread than the
> one against which it is doing the GUP for.

There are some places where there is no real context from which we can
determine access rights.  ptrace is a good example.  We don't enforce
PKEYs when walking _another_ process's page tables.

Can you give an example of where a process might be doing a gup and it
is completely separate from the CPU context that it's being executed under?

> Second and more fundamental thing i have issue with is that this whole
> pkru keys are centric to CPU POV ie this is a CPU feature. So i do not
> believe that device driver should be forbidden to do GUP base on pkru
> keys.

I don't think of it as something necessarily central to the CPU, but
something central to things that walk page tables.  We mark page tables
with PKEYs and things that walk them will have certain rights.

> Tying this to the pkru reg value of whatever processor happens to be
> running some device driver kernel function that try to do a GUP seems
> broken to me.

That's one way to look at it.  Another way is that PKRU is specifying
some real _intent_ about whether we want access to be allowed to some
memory.

> So as first i would just allow GUP to always work and then come up with
> syscall to allow to set pkey on device file. This obviously is a lot more
> work as you need to go over all device driver using GUP.

I wouldn't be opposed to adding some context to the thread (like
pagefault_disable()) that indicates whether we should enforce protection
keys.  If we are in some asynchronous context, disassociated from the
running CPU's protection keys, we could set a flag.

I'd really appreciate if you could point to some concrete examples here
which could actually cause a problem, like workqueues doing gups.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 15/25] x86, pkeys: check VMAs and PTEs for protection keys
@ 2015-10-22 21:23       ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-10-22 21:23 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On 10/22/2015 01:57 PM, Jerome Glisse wrote:
> I have not read all the patches, but here i assume that for GUP you do
> not first call arch_vma_access_permitted(). So issue i see is that GUP
> for a process might happen inside another process and that process might
> have different pkru protection keys, effectively randomly allowing or
> forbidding a device driver to perform a GUP from say some workqueue that
> just happen to be schedule against a different processor/thread than the
> one against which it is doing the GUP for.

There are some places where there is no real context from which we can
determine access rights.  ptrace is a good example.  We don't enforce
PKEYs when walking _another_ process's page tables.

Can you give an example of where a process might be doing a gup and it
is completely separate from the CPU context that it's being executed under?

> Second and more fundamental thing i have issue with is that this whole
> pkru keys are centric to CPU POV ie this is a CPU feature. So i do not
> believe that device driver should be forbidden to do GUP base on pkru
> keys.

I don't think of it as something necessarily central to the CPU, but
something central to things that walk page tables.  We mark page tables
with PKEYs and things that walk them will have certain rights.

> Tying this to the pkru reg value of whatever processor happens to be
> running some device driver kernel function that try to do a GUP seems
> broken to me.

That's one way to look at it.  Another way is that PKRU is specifying
some real _intent_ about whether we want access to be allowed to some
memory.

> So as first i would just allow GUP to always work and then come up with
> syscall to allow to set pkey on device file. This obviously is a lot more
> work as you need to go over all device driver using GUP.

I wouldn't be opposed to adding some context to the thread (like
pagefault_disable()) that indicates whether we should enforce protection
keys.  If we are in some asynchronous context, disassociated from the
running CPU's protection keys, we could set a flag.

I'd really appreciate if you could point to some concrete examples here
which could actually cause a problem, like workqueues doing gups.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 15/25] x86, pkeys: check VMAs and PTEs for protection keys
  2015-10-22 21:23       ` Dave Hansen
@ 2015-10-22 22:25         ` Jerome Glisse
  -1 siblings, 0 replies; 86+ messages in thread
From: Jerome Glisse @ 2015-10-22 22:25 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Thu, Oct 22, 2015 at 02:23:08PM -0700, Dave Hansen wrote:
> On 10/22/2015 01:57 PM, Jerome Glisse wrote:
> > I have not read all the patches, but here i assume that for GUP you do
> > not first call arch_vma_access_permitted(). So issue i see is that GUP
> > for a process might happen inside another process and that process might
> > have different pkru protection keys, effectively randomly allowing or
> > forbidding a device driver to perform a GUP from say some workqueue that
> > just happen to be schedule against a different processor/thread than the
> > one against which it is doing the GUP for.
> 
> There are some places where there is no real context from which we can
> determine access rights.  ptrace is a good example.  We don't enforce
> PKEYs when walking _another_ process's page tables.
> 
> Can you give an example of where a process might be doing a gup and it
> is completely separate from the CPU context that it's being executed under?

In drivers/iommu/amd_iommu_v2.c thought this is on AMD platform. I also
believe that in infiniband one can have GUP call from workqueue that can
run at any time. In GPU driver we also use GUP thought at this point we
do not allow another process from accessing a buffer that is populated
by GUP from another process.

I am also here mainly talking about what future GPU will do where you will
have the CPU service page fault from GPU inside a workqueue that can run
at any point in time.

> 
> > Second and more fundamental thing i have issue with is that this whole
> > pkru keys are centric to CPU POV ie this is a CPU feature. So i do not
> > believe that device driver should be forbidden to do GUP base on pkru
> > keys.
> 
> I don't think of it as something necessarily central to the CPU, but
> something central to things that walk page tables.  We mark page tables
> with PKEYs and things that walk them will have certain rights.

My point is that we are seing devices that want to walk the page table and
they do it from a work queue inside the kernel which can run against another
process than the one they are doing the walk from.

I am sure there is already upstream device driver that does so, i have not
check all of them to confirm thought.


> > Tying this to the pkru reg value of whatever processor happens to be
> > running some device driver kernel function that try to do a GUP seems
> > broken to me.
> 
> That's one way to look at it.  Another way is that PKRU is specifying
> some real _intent_ about whether we want access to be allowed to some
> memory.

I think i misexpress myself here, yes PKRU is about specifying intent but
specifying it for CPU thread not for device thread. GPU for instance have
threads that run on behalf of a given process and i would rather see some
kind of coherent way to specify that for each devices like you allow it
to specify it on per CPU thread basis.


> > So as first i would just allow GUP to always work and then come up with
> > syscall to allow to set pkey on device file. This obviously is a lot more
> > work as you need to go over all device driver using GUP.
> 
> I wouldn't be opposed to adding some context to the thread (like
> pagefault_disable()) that indicates whether we should enforce protection
> keys.  If we are in some asynchronous context, disassociated from the
> running CPU's protection keys, we could set a flag.

I was simply thinking of having a global set of pkeys against the process
mm struct which would be the default global setting for all device GUP
access. This global set could be override by userspace on a per device
basis allowing some device to have more access than others.


> I'd really appreciate if you could point to some concrete examples here
> which could actually cause a problem, like workqueues doing gups.

Well i could grep for all current user of GUP, but i can tell you that this
is gonna be the model for GPU thread ie a kernel workqueue gonna handle
page fault on behalf of GPU and will perform equivalent of GUP. Also apply
for infiniband ODP thing which is upstream.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 15/25] x86, pkeys: check VMAs and PTEs for protection keys
@ 2015-10-22 22:25         ` Jerome Glisse
  0 siblings, 0 replies; 86+ messages in thread
From: Jerome Glisse @ 2015-10-22 22:25 UTC (permalink / raw)
  To: Dave Hansen; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

On Thu, Oct 22, 2015 at 02:23:08PM -0700, Dave Hansen wrote:
> On 10/22/2015 01:57 PM, Jerome Glisse wrote:
> > I have not read all the patches, but here i assume that for GUP you do
> > not first call arch_vma_access_permitted(). So issue i see is that GUP
> > for a process might happen inside another process and that process might
> > have different pkru protection keys, effectively randomly allowing or
> > forbidding a device driver to perform a GUP from say some workqueue that
> > just happen to be schedule against a different processor/thread than the
> > one against which it is doing the GUP for.
> 
> There are some places where there is no real context from which we can
> determine access rights.  ptrace is a good example.  We don't enforce
> PKEYs when walking _another_ process's page tables.
> 
> Can you give an example of where a process might be doing a gup and it
> is completely separate from the CPU context that it's being executed under?

In drivers/iommu/amd_iommu_v2.c thought this is on AMD platform. I also
believe that in infiniband one can have GUP call from workqueue that can
run at any time. In GPU driver we also use GUP thought at this point we
do not allow another process from accessing a buffer that is populated
by GUP from another process.

I am also here mainly talking about what future GPU will do where you will
have the CPU service page fault from GPU inside a workqueue that can run
at any point in time.

> 
> > Second and more fundamental thing i have issue with is that this whole
> > pkru keys are centric to CPU POV ie this is a CPU feature. So i do not
> > believe that device driver should be forbidden to do GUP base on pkru
> > keys.
> 
> I don't think of it as something necessarily central to the CPU, but
> something central to things that walk page tables.  We mark page tables
> with PKEYs and things that walk them will have certain rights.

My point is that we are seing devices that want to walk the page table and
they do it from a work queue inside the kernel which can run against another
process than the one they are doing the walk from.

I am sure there is already upstream device driver that does so, i have not
check all of them to confirm thought.


> > Tying this to the pkru reg value of whatever processor happens to be
> > running some device driver kernel function that try to do a GUP seems
> > broken to me.
> 
> That's one way to look at it.  Another way is that PKRU is specifying
> some real _intent_ about whether we want access to be allowed to some
> memory.

I think i misexpress myself here, yes PKRU is about specifying intent but
specifying it for CPU thread not for device thread. GPU for instance have
threads that run on behalf of a given process and i would rather see some
kind of coherent way to specify that for each devices like you allow it
to specify it on per CPU thread basis.


> > So as first i would just allow GUP to always work and then come up with
> > syscall to allow to set pkey on device file. This obviously is a lot more
> > work as you need to go over all device driver using GUP.
> 
> I wouldn't be opposed to adding some context to the thread (like
> pagefault_disable()) that indicates whether we should enforce protection
> keys.  If we are in some asynchronous context, disassociated from the
> running CPU's protection keys, we could set a flag.

I was simply thinking of having a global set of pkeys against the process
mm struct which would be the default global setting for all device GUP
access. This global set could be override by userspace on a per device
basis allowing some device to have more access than others.


> I'd really appreciate if you could point to some concrete examples here
> which could actually cause a problem, like workqueues doing gups.

Well i could grep for all current user of GUP, but i can tell you that this
is gonna be the model for GPU thread ie a kernel workqueue gonna handle
page fault on behalf of GPU and will perform equivalent of GUP. Also apply
for infiniband ODP thing which is upstream.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 15/25] x86, pkeys: check VMAs and PTEs for protection keys
  2015-10-22 22:25         ` Jerome Glisse
  (?)
@ 2015-10-23  0:49         ` Dave Hansen
  -1 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2015-10-23  0:49 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: borntraeger, x86, linux-kernel, linux-mm, dave.hansen

[-- Attachment #1: Type: text/plain, Size: 2941 bytes --]

On 10/22/2015 03:25 PM, Jerome Glisse wrote:
> On Thu, Oct 22, 2015 at 02:23:08PM -0700, Dave Hansen wrote:
...
>> Can you give an example of where a process might be doing a gup and it
>> is completely separate from the CPU context that it's being executed under?
> 
> In drivers/iommu/amd_iommu_v2.c thought this is on AMD platform. I also
> believe that in infiniband one can have GUP call from workqueue that can
> run at any time. In GPU driver we also use GUP thought at this point we
> do not allow another process from accessing a buffer that is populated
> by GUP from another process.

>From quick grepping, there are only a couple of callers that do
get_user_pages() on something that isn't current->mm.

We can fairly easily introduce something new, like

	get_foreign_user_pages()

That sets a flag to tell us to ignore the current PKRU state.

I've attached a patch that at creates a variant of get_user_pages() for
when you're going after another process's mm.  This even makes a few of
the gup call sites look nicer because they're not passing 'current,
current->mm'.

>>> So as first i would just allow GUP to always work and then come up with
>>> syscall to allow to set pkey on device file. This obviously is a lot more
>>> work as you need to go over all device driver using GUP.
>>
>> I wouldn't be opposed to adding some context to the thread (like
>> pagefault_disable()) that indicates whether we should enforce protection
>> keys.  If we are in some asynchronous context, disassociated from the
>> running CPU's protection keys, we could set a flag.
> 
> I was simply thinking of having a global set of pkeys against the process
> mm struct which would be the default global setting for all device GUP
> access. This global set could be override by userspace on a per device
> basis allowing some device to have more access than others.

For now, I think leaving it permissive by default is probably OK.  A
device's access to memory is permissive after a gup anyway.

As you note, doing this is going to require another whole set of user
interfaces, so I'd rather revisit it later once we have a more concrete
need for it.

1. Store a common PKRU value somewhere and activate when servicing work
   outside of the context of the actual process.  Set this PKRU value
   with input from userspace and new user APIs.
2. When work is queued, copy the PKRU value and use it while servicing
   the work.
3. Do all out-of-context work with PKRU=0, or by disabling the PKRU
   checks conditionally.

>> I'd really appreciate if you could point to some concrete examples here
>> which could actually cause a problem, like workqueues doing gups.
> 
> Well i could grep for all current user of GUP, but i can tell you that this
> is gonna be the model for GPU thread ie a kernel workqueue gonna handle
> page fault on behalf of GPU and will perform equivalent of GUP. Also apply
> for infiniband ODP thing which is upstream.



[-- Attachment #2: get_current_user_pages.patch --]
[-- Type: text/x-patch, Size: 17101 bytes --]



---

 b/arch/x86/mm/mpx.c                           |    4 +-
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     |    4 +-
 b/drivers/gpu/drm/i915/i915_gem_userptr.c     |    2 -
 b/drivers/gpu/drm/radeon/radeon_ttm.c         |    4 +-
 b/drivers/gpu/drm/via/via_dmablit.c           |    3 --
 b/drivers/infiniband/core/umem.c              |    2 -
 b/drivers/infiniband/core/umem_odp.c          |    8 ++---
 b/drivers/infiniband/hw/mthca/mthca_memfree.c |    3 --
 b/drivers/infiniband/hw/qib/qib_user_pages.c  |    3 --
 b/drivers/infiniband/hw/usnic/usnic_uiom.c    |    2 -
 b/drivers/media/pci/ivtv/ivtv-yuv.c           |    8 ++---
 b/drivers/media/v4l2-core/videobuf-dma-sg.c   |    3 --
 b/drivers/virt/fsl_hypervisor.c               |    5 +--
 b/fs/exec.c                                   |    8 ++++-
 b/include/linux/mm.h                          |   14 +++++----
 b/mm/frame_vector.c                           |    2 -
 b/mm/gup.c                                    |   39 ++++++++++++++++++++------
 b/mm/mempolicy.c                              |    6 ++--
 b/security/tomoyo/domain.c                    |    9 +++++-
 19 files changed, 79 insertions(+), 50 deletions(-)

diff -puN mm/gup.c~get_current_user_pages mm/gup.c
--- a/mm/gup.c~get_current_user_pages	2015-10-22 16:03:24.957026355 -0700
+++ b/mm/gup.c	2015-10-22 16:46:58.181109179 -0700
@@ -752,11 +752,12 @@ EXPORT_SYMBOL(get_user_pages_locked);
  * according to the parameters "pages", "write", "force"
  * respectively.
  */
-__always_inline long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-					       unsigned long start, unsigned long nr_pages,
+__always_inline long __get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 					       int write, int force, struct page **pages,
 					       unsigned int gup_flags)
 {
+	struct task_struct *tsk = curent;
+	struct mm_struct *mm = tsk->mm
 	long ret;
 	int locked = 1;
 	down_read(&mm->mmap_sem);
@@ -795,7 +796,7 @@ long get_user_pages_unlocked(struct task
 EXPORT_SYMBOL(get_user_pages_unlocked);
 
 /*
- * get_user_pages() - pin user pages in memory
+ * get_foreign_user_pages() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
  *		NULL if faults are not to be recorded.
  * @mm:		mm_struct of target mm
@@ -849,14 +850,34 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  * should use get_user_pages because it cannot pass
  * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, unsigned long nr_pages, int write,
-		int force, struct page **pages, struct vm_area_struct **vmas)
+long get_foreign_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas)
 {
-	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
-				       pages, vmas, NULL, false, FOLL_TOUCH);
+	/* disable protection key checks */
+	ret = __get_user_pages_locked(tsk, mm,
+				      start, nr_pages, write, force,
+				      pages, vmas, NULL, false, FOLL_TOUCH);
+	/* enable protection key checks */
+	return ret;
+}
+EXPORT_SYMBOL(get_current_user_pages);
+
+/*
+ * This is exactly the same as get_foreign_user_pages(), just
+ * with a less-flexible calling convention where we assume that
+ * the task and mm being operated on are the current task's.
+ */
+long get_current_user_pages(unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas)
+{
+	return get_foreign_user_pages(current, current->mm,
+				      start, nr_pages, write, force,
+				      pages, vmas, NULL, false, FOLL_TOUCH);
 }
-EXPORT_SYMBOL(get_user_pages);
+EXPORT_SYMBOL(get_current_user_pages);
 
 /**
  * populate_vma_page_range() -  populate a range of pages in the vma.
diff -puN arch/x86/mm/mpx.c~get_current_user_pages arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~get_current_user_pages	2015-10-22 16:04:18.711461016 -0700
+++ b/arch/x86/mm/mpx.c	2015-10-22 16:04:33.661138119 -0700
@@ -546,8 +546,8 @@ static int mpx_resolve_fault(long __user
 	int nr_pages = 1;
 	int force = 0;
 
-	gup_ret = get_user_pages(current, current->mm, (unsigned long)addr,
-				 nr_pages, write, force, NULL, NULL);
+	gup_ret = get_current_user_pages((unsigned long)addr, nr_pages, write,
+			force, NULL, NULL);
 	/*
 	 * get_user_pages() returns number of pages gotten.
 	 * 0 means we failed to fault in and get anything,
diff -puN drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c~get_current_user_pages drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c~get_current_user_pages	2015-10-22 16:04:35.800235003 -0700
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c	2015-10-22 16:04:52.116974022 -0700
@@ -518,8 +518,8 @@ static int amdgpu_ttm_tt_pin_userptr(str
 		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
 		struct page **pages = ttm->pages + pinned;
 
-		r = get_user_pages(current, current->mm, userptr, num_pages,
-				   write, 0, pages, NULL);
+		r = get_current_user_pages(userptr, num_pages, write, 0, pages,
+				NULL);
 		if (r < 0)
 			goto release_pages;
 
diff -puN drivers/gpu/drm/radeon/radeon_ttm.c~get_current_user_pages drivers/gpu/drm/radeon/radeon_ttm.c
--- a/drivers/gpu/drm/radeon/radeon_ttm.c~get_current_user_pages	2015-10-22 16:04:53.241024932 -0700
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c	2015-10-22 16:05:04.892552652 -0700
@@ -554,8 +554,8 @@ static int radeon_ttm_tt_pin_userptr(str
 		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
 		struct page **pages = ttm->pages + pinned;
 
-		r = get_user_pages(current, current->mm, userptr, num_pages,
-				   write, 0, pages, NULL);
+		r = get_current_user_pages(userptr, num_pages, write, 0, pages,
+				NULL);
 		if (r < 0)
 			goto release_pages;
 
diff -puN drivers/gpu/drm/via/via_dmablit.c~get_current_user_pages drivers/gpu/drm/via/via_dmablit.c
--- a/drivers/gpu/drm/via/via_dmablit.c~get_current_user_pages	2015-10-22 16:05:05.882597493 -0700
+++ b/drivers/gpu/drm/via/via_dmablit.c	2015-10-22 16:05:15.053012839 -0700
@@ -239,8 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t
 	if (NULL == vsg->pages)
 		return -ENOMEM;
 	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(current, current->mm,
-			     (unsigned long)xfer->mem_addr,
+	ret = get_current_user_pages((unsigned long)xfer->mem_addr,
 			     vsg->num_pages,
 			     (vsg->direction == DMA_FROM_DEVICE),
 			     0, vsg->pages, NULL);
diff -puN drivers/infiniband/core/umem.c~get_current_user_pages drivers/infiniband/core/umem.c
--- a/drivers/infiniband/core/umem.c~get_current_user_pages	2015-10-22 16:05:15.898051112 -0700
+++ b/drivers/infiniband/core/umem.c	2015-10-22 16:05:24.188426599 -0700
@@ -188,7 +188,7 @@ struct ib_umem *ib_umem_get(struct ib_uc
 	sg_list_start = umem->sg_head.sgl;
 
 	while (npages) {
-		ret = get_user_pages(current, current->mm, cur_base,
+		ret = get_current_user_pages(cur_base,
 				     min_t(unsigned long, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
 				     1, !umem->writable, page_list, vma_list);
diff -puN drivers/infiniband/hw/mthca/mthca_memfree.c~get_current_user_pages drivers/infiniband/hw/mthca/mthca_memfree.c
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c~get_current_user_pages	2015-10-22 16:05:25.008463740 -0700
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c	2015-10-22 16:05:36.492983894 -0700
@@ -472,8 +472,7 @@ int mthca_map_user_db(struct mthca_dev *
 		goto out;
 	}
 
-	ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0,
-			     pages, NULL);
+	ret = get_current_user_pages(uaddr & PAGE_MASK, 1, 1, 0, pages, NULL);
 	if (ret < 0)
 		goto out;
 
diff -puN drivers/infiniband/hw/qib/qib_user_pages.c~get_current_user_pages drivers/infiniband/hw/qib/qib_user_pages.c
--- a/drivers/infiniband/hw/qib/qib_user_pages.c~get_current_user_pages	2015-10-22 16:05:37.424026063 -0700
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c	2015-10-22 16:05:47.924501648 -0700
@@ -66,8 +66,7 @@ static int __qib_get_user_pages(unsigned
 	}
 
 	for (got = 0; got < num_pages; got += ret) {
-		ret = get_user_pages(current, current->mm,
-				     start_page + got * PAGE_SIZE,
+		ret = get_current_user_pages(start_page + got * PAGE_SIZE,
 				     num_pages - got, 1, 1,
 				     p + got, NULL);
 		if (ret < 0)
diff -puN drivers/infiniband/hw/usnic/usnic_uiom.c~get_current_user_pages drivers/infiniband/hw/usnic/usnic_uiom.c
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c~get_current_user_pages	2015-10-22 16:05:49.341565829 -0700
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c	2015-10-22 16:06:04.868269060 -0700
@@ -144,7 +144,7 @@ static int usnic_uiom_get_pages(unsigned
 	ret = 0;
 
 	while (npages) {
-		ret = get_user_pages(current, current->mm, cur_base,
+		ret = get_current_user_pages(cur_base,
 					min_t(unsigned long, npages,
 					PAGE_SIZE / sizeof(struct page *)),
 					1, !writable, page_list, NULL);
diff -puN drivers/media/pci/ivtv/ivtv-yuv.c~get_current_user_pages drivers/media/pci/ivtv/ivtv-yuv.c
--- a/drivers/media/pci/ivtv/ivtv-yuv.c~get_current_user_pages	2015-10-22 16:06:05.723307787 -0700
+++ b/drivers/media/pci/ivtv/ivtv-yuv.c	2015-10-22 16:06:43.060998869 -0700
@@ -76,12 +76,12 @@ static int ivtv_yuv_prep_user_dma(struct
 
 	/* Get user pages for DMA Xfer */
 	down_read(&current->mm->mmap_sem);
-	y_pages = get_user_pages(current, current->mm, y_dma.uaddr, y_dma.page_count, 0, 1, &dma->map[0], NULL);
+	y_pages = get_current_user_pages(y_dma.uaddr, y_dma.page_count, 0, 1, &dma->map[0], NULL);
 	uv_pages = 0; /* silence gcc. value is set and consumed only if: */
 	if (y_pages == y_dma.page_count) {
-		uv_pages = get_user_pages(current, current->mm,
-					  uv_dma.uaddr, uv_dma.page_count, 0, 1,
-					  &dma->map[y_pages], NULL);
+		uv_pages = get_current_user_pages(uv_dma.uaddr,
+				uv_dma.page_count, 0, 1,
+				&dma->map[y_pages], NULL);
 	}
 	up_read(&current->mm->mmap_sem);
 
diff -puN drivers/media/v4l2-core/videobuf-dma-sg.c~get_current_user_pages drivers/media/v4l2-core/videobuf-dma-sg.c
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~get_current_user_pages	2015-10-22 16:06:43.743029759 -0700
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c	2015-10-22 16:06:53.716481470 -0700
@@ -181,8 +181,7 @@ static int videobuf_dma_init_user_locked
 	dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
 		data, size, dma->nr_pages);
 
-	err = get_user_pages(current, current->mm,
-			     data & PAGE_MASK, dma->nr_pages,
+	err = get_current_user_pages(data & PAGE_MASK, dma->nr_pages,
 			     rw == READ, 1, /* force */
 			     dma->pages, NULL);
 
diff -puN drivers/virt/fsl_hypervisor.c~get_current_user_pages drivers/virt/fsl_hypervisor.c
--- a/drivers/virt/fsl_hypervisor.c~get_current_user_pages	2015-10-22 16:06:55.280552310 -0700
+++ b/drivers/virt/fsl_hypervisor.c	2015-10-22 16:07:09.261185511 -0700
@@ -244,9 +244,8 @@ static long ioctl_memcpy(struct fsl_hv_i
 
 	/* Get the physical addresses of the source buffer */
 	down_read(&current->mm->mmap_sem);
-	num_pinned = get_user_pages(current, current->mm,
-		param.local_vaddr - lb_offset, num_pages,
-		(param.source == -1) ? READ : WRITE,
+	num_pinned = get_current_user_pages(param.local_vaddr - lb_offset,
+		num_pages, (param.source == -1) ? READ : WRITE,
 		0, pages, NULL);
 	up_read(&current->mm->mmap_sem);
 
diff -puN fs/exec.c~get_current_user_pages fs/exec.c
--- a/fs/exec.c~get_current_user_pages	2015-10-22 16:07:10.134225053 -0700
+++ b/fs/exec.c	2015-10-22 16:35:11.763231100 -0700
@@ -198,8 +198,12 @@ static struct page *get_arg_page(struct
 			return NULL;
 	}
 #endif
-	ret = get_user_pages(current, bprm->mm, pos,
-			1, write, 1, &page, NULL);
+	/*
+	 * We are doing an exec().  'current' is the process
+	 * doing the exec and bprm->mm is the new process's mm.
+	 */
+	ret = get_foreign_user_pages(current, bprm->mm, pos, 1, write,
+			1, &page, NULL);
 	if (ret <= 0)
 		return NULL;
 
diff -puN mm/mempolicy.c~get_current_user_pages mm/mempolicy.c
--- a/mm/mempolicy.c~get_current_user_pages	2015-10-22 16:07:19.296640031 -0700
+++ b/mm/mempolicy.c	2015-10-22 16:08:04.949707713 -0700
@@ -813,12 +813,12 @@ static void get_policy_nodemask(struct m
 	}
 }
 
-static int lookup_node(struct mm_struct *mm, unsigned long addr)
+static int lookup_node(unsigned long addr)
 {
 	struct page *p;
 	int err;
 
-	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	err = get_current_user_pages(addr & PAGE_MASK, 1, 0, 0, &p, NULL);
 	if (err >= 0) {
 		err = page_to_nid(p);
 		put_page(p);
@@ -873,7 +873,7 @@ static long do_get_mempolicy(int *policy
 
 	if (flags & MPOL_F_NODE) {
 		if (flags & MPOL_F_ADDR) {
-			err = lookup_node(mm, addr);
+			err = lookup_node(addr);
 			if (err < 0)
 				goto out;
 			*policy = err;
diff -puN security/tomoyo/domain.c~get_current_user_pages security/tomoyo/domain.c
--- a/security/tomoyo/domain.c~get_current_user_pages	2015-10-22 16:08:06.037756992 -0700
+++ b/security/tomoyo/domain.c	2015-10-22 16:33:33.154780307 -0700
@@ -874,7 +874,14 @@ bool tomoyo_dump_page(struct linux_binpr
 	}
 	/* Same with get_arg_page(bprm, pos, 0) in fs/exec.c */
 #ifdef CONFIG_MMU
-	if (get_user_pages(current, bprm->mm, pos, 1, 0, 1, &page, NULL) <= 0)
+	/*
+	 * This is called at execve() time in order to dig around
+	 * in the argv/environment of the new proceess
+	 * (represented by bprm).  'current' is the process doing
+	 * the execve().
+	 */
+	if (get_foreign_user_pages(current, bprm->mm, pos, 1, 
+				0, 1, &page, NULL) <= 0)
 		return false;
 #else
 	page = bprm->page[pos / PAGE_SIZE];
diff -puN include/linux/mm.h~get_current_user_pages include/linux/mm.h
--- a/include/linux/mm.h~get_current_user_pages	2015-10-22 16:35:32.799180621 -0700
+++ b/include/linux/mm.h	2015-10-22 16:43:06.235641368 -0700
@@ -1198,12 +1198,14 @@ long __get_user_pages(struct task_struct
 		      unsigned long start, unsigned long nr_pages,
 		      unsigned int foll_flags, struct page **pages,
 		      struct vm_area_struct **vmas, int *nonblocking);
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
-		    int write, int force, struct page **pages,
-		    struct vm_area_struct **vmas);
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
+long get_foreign_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    struct vm_area_struct **vmas);
+long get_current_user_pages(unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    struct vm_area_struct **vmas);
+long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages,
 		    int *locked);
 long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
diff -puN mm/frame_vector.c~get_current_user_pages mm/frame_vector.c
--- a/mm/frame_vector.c~get_current_user_pages	2015-10-22 16:37:35.449716063 -0700
+++ b/mm/frame_vector.c	2015-10-22 16:38:40.858667050 -0700
@@ -58,7 +58,7 @@ int get_vaddr_frames(unsigned long start
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
 		vec->got_ref = true;
 		vec->is_pfns = false;
-		ret = get_user_pages_locked(current, mm, start, nr_frames,
+		ret = get_user_pages_locked(start, nr_frames,
 			write, force, (struct page **)(vec->ptrs), &locked);
 		goto out;
 	}
diff -puN drivers/infiniband/core/umem_odp.c~get_current_user_pages drivers/infiniband/core/umem_odp.c
--- a/drivers/infiniband/core/umem_odp.c~get_current_user_pages	2015-10-22 16:43:10.019812135 -0700
+++ b/drivers/infiniband/core/umem_odp.c	2015-10-22 16:45:02.802901881 -0700
@@ -572,10 +572,10 @@ int ib_umem_odp_map_dma_pages(struct ib_
 		 * complex (and doesn't gain us much performance in most use
 		 * cases).
 		 */
-		npages = get_user_pages(owning_process, owning_mm, user_virt,
-					gup_num_pages,
-					access_mask & ODP_WRITE_ALLOWED_BIT, 0,
-					local_page_list, NULL);
+		npages = get_foreign_user_pages(owning_process, owning_mm,
+				user_virt, gup_num_pages,
+				access_mask & ODP_WRITE_ALLOWED_BIT,
+				0, local_page_list, NULL);
 		up_read(&owning_mm->mmap_sem);
 
 		if (npages < 0)
diff -puN drivers/gpu/drm/i915/i915_gem_userptr.c~get_current_user_pages drivers/gpu/drm/i915/i915_gem_userptr.c
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c~get_current_user_pages	2015-10-22 16:45:09.589208151 -0700
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c	2015-10-22 16:45:20.979722216 -0700
@@ -587,7 +587,7 @@ __i915_gem_userptr_get_pages_worker(stru
 
 		down_read(&mm->mmap_sem);
 		while (pinned < num_pages) {
-			ret = get_user_pages(work->task, mm,
+			ret = get_foreign_user_pages(work->task, mm,
 					     obj->userptr.ptr + pinned * PAGE_SIZE,
 					     num_pages - pinned,
 					     !obj->userptr.read_only, 0,
_

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2015-10-23  0:49 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-28 19:18 [PATCH 00/25] x86: Memory Protection Keys Dave Hansen
2015-09-28 19:18 ` Dave Hansen
2015-09-28 19:18 ` [PATCH 03/25] x86, pkeys: cpuid bit definition Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-10-01 11:02   ` Thomas Gleixner
2015-10-01 11:02     ` Thomas Gleixner
2015-09-28 19:18 ` [PATCH 02/25] x86, pkeys: Add Kconfig option Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-10-01 11:02   ` Thomas Gleixner
2015-10-01 11:02     ` Thomas Gleixner
2015-09-28 19:18 ` [PATCH 01/25] x86, fpu: add placeholder for Processor Trace XSAVE state Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-10-01 11:01   ` Thomas Gleixner
2015-10-01 11:01     ` Thomas Gleixner
2015-09-28 19:18 ` [PATCH 06/25] x86, pkeys: PTE bits for storing protection key Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-10-01 11:51   ` Thomas Gleixner
2015-10-01 11:51     ` Thomas Gleixner
2015-09-28 19:18 ` [PATCH 04/25] x86, pku: define new CR4 bit Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-10-01 11:03   ` Thomas Gleixner
2015-10-01 11:03     ` Thomas Gleixner
2015-09-28 19:18 ` [PATCH 05/25] x86, pkey: add PKRU xsave fields and data structure(s) Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-10-01 11:50   ` Thomas Gleixner
2015-10-01 11:50     ` Thomas Gleixner
2015-10-01 17:17     ` Dave Hansen
2015-10-01 17:17       ` Dave Hansen
2015-09-28 19:18 ` [PATCH 08/25] x86, pkeys: store protection in high VMA flags Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 07/25] x86, pkeys: new page fault error code bit: PF_PK Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-10-01 11:54   ` Thomas Gleixner
2015-10-01 11:54     ` Thomas Gleixner
2015-10-01 17:19     ` Dave Hansen
2015-10-01 17:19       ` Dave Hansen
2015-09-28 19:18 ` [PATCH 09/25] x86, pkeys: arch-specific protection bits Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 10/25] x86, pkeys: pass VMA down in to fault signal generation code Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 13/25] mm: factor out VMA fault permission checking Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 11/25] x86, pkeys: notify userspace about protection key faults Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 12/25] x86, pkeys: add functions to fetch PKRU Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 14/25] mm: simplify get_user_pages() PTE bit handling Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 15/25] x86, pkeys: check VMAs and PTEs for protection keys Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-10-22 20:57   ` Jerome Glisse
2015-10-22 20:57     ` Jerome Glisse
2015-10-22 21:23     ` Dave Hansen
2015-10-22 21:23       ` Dave Hansen
2015-10-22 22:25       ` Jerome Glisse
2015-10-22 22:25         ` Jerome Glisse
2015-10-23  0:49         ` Dave Hansen
2015-09-28 19:18 ` [PATCH 16/25] x86, pkeys: optimize fault handling in access_error() Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 19/25] x86, pkeys: add Kconfig prompt to existing config option Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 18/25] x86, pkeys: dump PTE pkey in /proc/pid/smaps Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 17/25] x86, pkeys: dump PKRU with other kernel registers Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 20/25] mm, multi-arch: pass a protection key in to calc_vm_flag_bits() Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 23/25] x86, pkeys: actually enable Memory Protection Keys in CPU Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 21/25] mm: implement new mprotect_key() system call Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-29  6:39   ` Michael Ellerman
2015-09-29  6:39     ` Michael Ellerman
2015-09-29  6:39     ` Michael Ellerman
2015-09-29 14:16     ` Dave Hansen
2015-09-29 14:16       ` Dave Hansen
2015-09-28 19:18 ` [PATCH 22/25] x86: wire up " Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 24/25] x86, pkeys: add self-tests Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 19:18 ` [PATCH 25/25] x86, pkeys: Documentation Dave Hansen
2015-09-28 19:18   ` Dave Hansen
2015-09-28 20:34   ` Andi Kleen
2015-09-28 20:34     ` Andi Kleen
2015-09-28 20:41     ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.