linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] [v2] System Calls for Memory Protection Keys
@ 2016-06-07 20:47 Dave Hansen
  2016-06-07 20:47 ` [PATCH 1/9] x86, pkeys: add fault handling for PF_PK page fault bit Dave Hansen
                   ` (9 more replies)
  0 siblings, 10 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-07 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm, Dave Hansen, arnd

Are there any concerns with merging these into the x86 tree so
that they go upstream for 4.8?  The updates here are pretty
minor.

Changes from v1:
 * updates to alloc/free patch description calling out that
   "in-use" pkeys may still be pkey_free()'d successfully.
 * Fixed a bug in the selftest where the 'flags' argument was
   not passed to pkey_get().
 * Added all syscalls to generic syscalls header
 * Added extra checking to selftests so it doesn't fall over
   when 1G pages are made the hugetlbfs default.

--

Memory Protection Keys for User pages (pkeys) is a CPU feature
which will first appear on Skylake Servers, but will also be
supported on future non-server parts.  It provides a mechanism
for enforcing page-based protections, but without requiring
modification of the page tables when an application changes
wishes to change permissions.

Patches to implement execute-only mapping support using pkeys
were merged in to 4.6.  But, to do anything else useful with
pkeys, an application needs to be able to set the pkey field in
the PTE (obviously has to be done in-kernel) and make changes to
the "rights" register (using unprivileged instructions).

An application also needs to have an an allocator for the keys
themselves.  If two different parts of an application both want
to protect their data with pkeys, they first need to know which
key to use for their individual purposes.

This set introduces 5 system calls, in 3 logical groups:

1. PTE pkey setting (sys_pkey_mprotect(), patches #1-3)
2. Key allocation (sys_pkey_alloc() / sys_pkey_free(), patch #4)
3. Rights register manipulation (sys_pkey_set/get(), patch #5)

These patches build on top of "core" pkeys support already in
4.6.

I have manpages written for some of these syscalls, and have
had multiple rounds of reviews on the manpages list.

This set is also available here:

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v037

I've written a set of unit tests for these interfaces, which is
available as the last patch in the series and integrated in to
kselftests.

Note: this is based on a plain 4.6 kernel and will have a minor
merge conflict in the x86 selftests makefile with the new
MPX selftest if those get merged first.

=== diffstat ===

Dave Hansen (9):
      x86, pkeys: add fault handling for PF_PK page fault bit
      mm: implement new pkey_mprotect() system call
      x86, pkeys: make mprotect_key() mask off additional vm_flags
      x86: wire up mprotect_key() system call
      x86, pkeys: allocation/free syscalls
      x86, pkeys: add pkey set/get syscalls
      generic syscalls: wire up memory protection keys syscalls
      pkeys: add details of system call use to Documentation/
      x86, pkeys: add self-tests

 Documentation/x86/protection-keys.txt         |   63 +
 arch/alpha/include/uapi/asm/mman.h            |    5 +
 arch/mips/include/uapi/asm/mman.h             |    5 +
 arch/parisc/include/uapi/asm/mman.h           |    5 +
 arch/x86/entry/syscalls/syscall_32.tbl        |    5 +
 arch/x86/entry/syscalls/syscall_64.tbl        |    5 +
 arch/x86/include/asm/mmu.h                    |    8 +
 arch/x86/include/asm/mmu_context.h            |   25 +-
 arch/x86/include/asm/pkeys.h                  |   80 +-
 arch/x86/kernel/fpu/xstate.c                  |   73 +-
 arch/x86/mm/fault.c                           |    9 +
 arch/x86/mm/pkeys.c                           |   38 +-
 arch/xtensa/include/uapi/asm/mman.h           |    5 +
 include/linux/pkeys.h                         |   39 +-
 include/uapi/asm-generic/mman-common.h        |    5 +
 include/uapi/asm-generic/unistd.h             |   12 +-
 mm/mprotect.c                                 |  134 +-
 tools/testing/selftests/x86/Makefile          |    3 +-
 tools/testing/selftests/x86/pkey-helpers.h    |  187 +++
 tools/testing/selftests/x86/protection_keys.c | 1283 +++++++++++++++++
 20 files changed, 1958 insertions(+), 31 deletions(-)

Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
Cc: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/9] x86, pkeys: add fault handling for PF_PK page fault bit
  2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
@ 2016-06-07 20:47 ` Dave Hansen
  2016-06-07 20:47 ` [PATCH 2/9] mm: implement new pkey_mprotect() system call Dave Hansen
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-07 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

PF_PK means that a memory access violated the protection key
access restrictions.  It is unconditionally an access_error()
because the permissions set on the VMA don't matter (the PKRU
value overrides it), and we never "resolve" PK faults (like
how a COW can "resolve write fault).

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/arch/x86/mm/fault.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-105-add-pk-to-fault arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-105-add-pk-to-fault	2016-06-07 13:22:18.529937509 -0700
+++ b/arch/x86/mm/fault.c	2016-06-07 13:22:18.534937739 -0700
@@ -1112,6 +1112,15 @@ access_error(unsigned long error_code, s
 {
 	/* This is only called for the current mm, so: */
 	bool foreign = false;
+
+	/*
+	 * Read or write was blocked by protection keys.  This is
+	 * always an unconditional error and can never result in
+	 * a follow-up action to resolve the fault, like a COW.
+	 */
+	if (error_code & PF_PK)
+		return 1;
+
 	/*
 	 * Make sure to check the VMA so that we do not perform
 	 * faults just to hit a PF_PK as soon as we fill in a
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/9] mm: implement new pkey_mprotect() system call
  2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
  2016-06-07 20:47 ` [PATCH 1/9] x86, pkeys: add fault handling for PF_PK page fault bit Dave Hansen
@ 2016-06-07 20:47 ` Dave Hansen
  2016-06-07 20:47 ` [PATCH 3/9] x86, pkeys: make mprotect_key() mask off additional vm_flags Dave Hansen
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-07 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

pkey_mprotect() is just like mprotect, except it also takes a
protection key as an argument.  On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

	int real_prot = PROT_READ|PROT_WRITE;
	pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set.

We settled on 'unsigned long' for the type of the key here.  We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.

Semantically, we have a bit of a problem if we combine this
syscall with our previously-introduced execute-only support:
What do we do when we mix execute-only pkey use with
pkey_mprotect() use?  For instance:

	pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
	mprotect(ptr, PAGE_SIZE, PROT_EXEC);  // set pkey=X_ONLY_PKEY?
	mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?

To solve that, we make the plain-mprotect()-initiated execute-only
support only apply to VMAs that have the default protection key (0)
set on them.

Proposed semantics:
1. protection key 0 is special and represents the default,
   unassigned protection key.  It is always allocated.
2. mprotect() never affects a mapping's pkey_mprotect()-assigned
   protection key. A protection key of 0 (even if set explicitly)
   represents an unassigned protection key.
   2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
       key may or may not result in a mapping with execute-only
       properties.  pkey_mprotect() plus pkey_set() on all threads
       should be used to _guarantee_ execute-only semantics.
3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
   kernel will internally attempt to allocate and dedicate a
   protection key for the purpose of execute-only mappings.  This
   may not be possible in cases where there are no free protection
   keys available.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/x86/include/asm/mmu_context.h |   15 ++++++++++-----
 b/arch/x86/include/asm/pkeys.h       |   11 +++++++++--
 b/arch/x86/kernel/fpu/xstate.c       |   15 ++++++++++++++-
 b/arch/x86/mm/pkeys.c                |    2 +-
 b/mm/mprotect.c                      |   27 +++++++++++++++++++++++----
 5 files changed, 57 insertions(+), 13 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-110-syscalls-mprotect_pkey arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-110-syscalls-mprotect_pkey	2016-06-07 13:22:18.950956916 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2016-06-07 13:22:18.960957377 -0700
@@ -4,6 +4,7 @@
 #include <asm/desc.h>
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
+#include <linux/pkeys.h>
 
 #include <trace/events/tlb.h>
 
@@ -195,16 +196,20 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 static inline int vma_pkey(struct vm_area_struct *vma)
 {
-	u16 pkey = 0;
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
 				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
-	pkey = (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
-#endif
-	return pkey;
+
+	return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
+}
+#else
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+	return 0;
 }
+#endif
 
 static inline bool __pkru_allows_pkey(u16 pkey, bool write)
 {
diff -puN arch/x86/include/asm/pkeys.h~pkeys-110-syscalls-mprotect_pkey arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-110-syscalls-mprotect_pkey	2016-06-07 13:22:18.952957008 -0700
+++ b/arch/x86/include/asm/pkeys.h	2016-06-07 13:22:18.961957423 -0700
@@ -1,7 +1,12 @@
 #ifndef _ASM_X86_PKEYS_H
 #define _ASM_X86_PKEYS_H
 
-#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)
+#define PKEY_DEDICATED_EXECUTE_ONLY 15
+/*
+ * Consider the PKEY_DEDICATED_EXECUTE_ONLY key unavailable.
+ */
+#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? \
+		PKEY_DEDICATED_EXECUTE_ONLY : 1)
 
 extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
@@ -10,7 +15,6 @@ extern int arch_set_user_pkey_access(str
  * Try to dedicate one of the protection keys to be used as an
  * execute-only protection key.
  */
-#define PKEY_DEDICATED_EXECUTE_ONLY 15
 extern int __execute_only_pkey(struct mm_struct *mm);
 static inline int execute_only_pkey(struct mm_struct *mm)
 {
@@ -31,4 +35,7 @@ static inline int arch_override_mprotect
 	return __arch_override_mprotect_pkey(vma, prot, pkey);
 }
 
+extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+
 #endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-110-syscalls-mprotect_pkey arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-110-syscalls-mprotect_pkey	2016-06-07 13:22:18.954957100 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2016-06-07 13:22:18.961957423 -0700
@@ -871,7 +871,7 @@ out:
  * not modfiy PKRU *itself* here, only the XSAVE state that will
  * be restored in to PKRU when we return back to userspace.
  */
-int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val)
 {
 	struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
@@ -930,3 +930,16 @@ int arch_set_user_pkey_access(struct tas
 
 	return 0;
 }
+
+/*
+ * When setting a userspace-provided value, we need to ensure
+ * that it is valid.  The __ version can get used by
+ * kernel-internal uses like the execute-only support.
+ */
+int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val)
+{
+	if (!validate_pkey(pkey))
+		return -EINVAL;
+	return __arch_set_user_pkey_access(tsk, pkey, init_val);
+}
diff -puN arch/x86/mm/pkeys.c~pkeys-110-syscalls-mprotect_pkey arch/x86/mm/pkeys.c
--- a/arch/x86/mm/pkeys.c~pkeys-110-syscalls-mprotect_pkey	2016-06-07 13:22:18.955957146 -0700
+++ b/arch/x86/mm/pkeys.c	2016-06-07 13:22:18.961957423 -0700
@@ -38,7 +38,7 @@ int __execute_only_pkey(struct mm_struct
 		return PKEY_DEDICATED_EXECUTE_ONLY;
 	}
 	preempt_enable();
-	ret = arch_set_user_pkey_access(current, PKEY_DEDICATED_EXECUTE_ONLY,
+	ret = __arch_set_user_pkey_access(current, PKEY_DEDICATED_EXECUTE_ONLY,
 			PKEY_DISABLE_ACCESS);
 	/*
 	 * If the PKRU-set operation failed somehow, just return
diff -puN mm/mprotect.c~pkeys-110-syscalls-mprotect_pkey mm/mprotect.c
--- a/mm/mprotect.c~pkeys-110-syscalls-mprotect_pkey	2016-06-07 13:22:18.957957238 -0700
+++ b/mm/mprotect.c	2016-06-07 13:22:18.962957469 -0700
@@ -352,8 +352,11 @@ fail:
 	return error;
 }
 
-SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
-		unsigned long, prot)
+/*
+ * pkey==-1 when doing a legacy mprotect()
+ */
+static int do_mprotect_pkey(unsigned long start, size_t len,
+		unsigned long prot, int pkey)
 {
 	unsigned long nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
@@ -409,7 +412,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 
 	for (nstart = start ; ; ) {
 		unsigned long newflags;
-		int pkey = arch_override_mprotect_pkey(vma, prot, -1);
+		int new_vma_pkey;
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
@@ -417,7 +420,8 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 		if (rier && (vma->vm_flags & VM_MAYEXEC))
 			prot |= PROT_EXEC;
 
-		newflags = calc_vm_prot_bits(prot, pkey);
+		new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
+		newflags = calc_vm_prot_bits(prot, new_vma_pkey);
 		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
@@ -454,3 +458,18 @@ out:
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
+
+SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot)
+{
+	return do_mprotect_pkey(start, len, prot, -1);
+}
+
+SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot, int, pkey)
+{
+	if (!validate_pkey(pkey))
+		return -EINVAL;
+
+	return do_mprotect_pkey(start, len, prot, pkey);
+}
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 3/9] x86, pkeys: make mprotect_key() mask off additional vm_flags
  2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
  2016-06-07 20:47 ` [PATCH 1/9] x86, pkeys: add fault handling for PF_PK page fault bit Dave Hansen
  2016-06-07 20:47 ` [PATCH 2/9] mm: implement new pkey_mprotect() system call Dave Hansen
@ 2016-06-07 20:47 ` Dave Hansen
  2016-06-07 20:47 ` [PATCH 4/9] x86: wire up mprotect_key() system call Dave Hansen
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-07 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
Three of those bits: READ/WRITE/EXEC get translated directly in to
vma->vm_flags by calc_vm_prot_bits().  If a bit is unset in
mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
during the mprotect() call.

We do this clearing today by first calculating the VMA flags we
want set, then clearing the ones we do not want to inherit from
the original VMA:

	vm_flags = calc_vm_prot_bits(prot, key);
	...
	newflags = vm_flags;
	newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));

However, we *also* want to mask off the original VMA's vm_flags in
which we store the protection key.

To do that, this patch adds a new macro:

	ARCH_VM_PKEY_FLAGS

which allows the architecture to specify additional bits that it would
like cleared.  We use that to ensure that the VM_PKEY_BIT* bits get
cleared.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/x86/include/asm/pkeys.h |    2 ++
 b/include/linux/pkeys.h        |    1 +
 b/mm/mprotect.c                |   11 ++++++++++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pkeys.h~pkeys-112-mask-off-correct-vm_flags arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-112-mask-off-correct-vm_flags	2016-06-07 13:22:19.466980702 -0700
+++ b/arch/x86/include/asm/pkeys.h	2016-06-07 13:22:19.473981025 -0700
@@ -38,4 +38,6 @@ static inline int arch_override_mprotect
 extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
 
+#define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
+
 #endif /*_ASM_X86_PKEYS_H */
diff -puN include/linux/pkeys.h~pkeys-112-mask-off-correct-vm_flags include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-112-mask-off-correct-vm_flags	2016-06-07 13:22:19.468980794 -0700
+++ b/include/linux/pkeys.h	2016-06-07 13:22:19.473981025 -0700
@@ -16,6 +16,7 @@
 #define execute_only_pkey(mm) (0)
 #define arch_override_mprotect_pkey(vma, prot, pkey) (0)
 #define PKEY_DEDICATED_EXECUTE_ONLY 0
+#define ARCH_VM_PKEY_FLAGS 0
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 /*
diff -puN mm/mprotect.c~pkeys-112-mask-off-correct-vm_flags mm/mprotect.c
--- a/mm/mprotect.c~pkeys-112-mask-off-correct-vm_flags	2016-06-07 13:22:19.469980840 -0700
+++ b/mm/mprotect.c	2016-06-07 13:22:19.473981025 -0700
@@ -411,6 +411,7 @@ static int do_mprotect_pkey(unsigned lon
 		prev = vma;
 
 	for (nstart = start ; ; ) {
+		unsigned long mask_off_old_flags;
 		unsigned long newflags;
 		int new_vma_pkey;
 
@@ -420,9 +421,17 @@ static int do_mprotect_pkey(unsigned lon
 		if (rier && (vma->vm_flags & VM_MAYEXEC))
 			prot |= PROT_EXEC;
 
+		/*
+		 * Each mprotect() call explicitly passes r/w/x permissions.
+		 * If a permission is not passed to mprotect(), it must be
+		 * cleared from the VMA.
+		 */
+		mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC |
+					ARCH_VM_PKEY_FLAGS;
+
 		new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
 		newflags = calc_vm_prot_bits(prot, new_vma_pkey);
-		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
+		newflags |= (vma->vm_flags & ~mask_off_old_flags);
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
 		if ((newflags & ~(newflags >> 4)) & (VM_READ | VM_WRITE | VM_EXEC)) {
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 4/9] x86: wire up mprotect_key() system call
  2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
                   ` (2 preceding siblings ...)
  2016-06-07 20:47 ` [PATCH 3/9] x86, pkeys: make mprotect_key() mask off additional vm_flags Dave Hansen
@ 2016-06-07 20:47 ` Dave Hansen
  2016-06-07 20:47 ` [PATCH 5/9] x86, pkeys: allocation/free syscalls Dave Hansen
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-07 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This is all that we need to get the new system call itself
working on x86.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 2 files changed, 2 insertions(+)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-114-x86-mprotect_key arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-114-x86-mprotect_key	2016-06-07 13:22:19.935002276 -0700
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2016-06-07 13:22:19.940002506 -0700
@@ -386,3 +386,4 @@
 377	i386	copy_file_range		sys_copy_file_range
 378	i386	preadv2			sys_preadv2			compat_sys_preadv2
 379	i386	pwritev2		sys_pwritev2			compat_sys_pwritev2
+380	i386	pkey_mprotect		sys_pkey_mprotect
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-114-x86-mprotect_key arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-114-x86-mprotect_key	2016-06-07 13:22:19.936002322 -0700
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2016-06-07 13:22:19.940002506 -0700
@@ -335,6 +335,7 @@
 326	common	copy_file_range		sys_copy_file_range
 327	64	preadv2			sys_preadv2
 328	64	pwritev2		sys_pwritev2
+329	common	pkey_mprotect		sys_pkey_mprotect
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 5/9] x86, pkeys: allocation/free syscalls
  2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
                   ` (3 preceding siblings ...)
  2016-06-07 20:47 ` [PATCH 4/9] x86: wire up mprotect_key() system call Dave Hansen
@ 2016-06-07 20:47 ` Dave Hansen
  2016-06-07 20:47 ` [PATCH 6/9] x86, pkeys: add pkey set/get syscalls Dave Hansen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-07 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This patch adds two new system calls:

	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
	int pkey_free(int pkey);

These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors.  The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid.  A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect() (or the forthcoming
get/set syscalls).

These system calls are also very important given the kernel's use
of pkeys to implement execute-only support.  These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel.

The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey.  For
instance:

	pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.  This keeps userspace from
needing to have knowledge about manipulating PKRU with the
RDPKRU/WRPKRU instructions.  Userspace is still free to use these
instructions as it wishes, but this facility ensures it is no
longer required.

The kernel does _not_ enforce that this interface must be used for
changes to PKRU, even for keys it does not control.

The kernel does not prevent pkey_free() from successfully freeing
in-use pkeys (those still assigned to a memory range by
pkey_mprotect()).  It would be expensive to implement the checks
for this, so we instead say, "Just don't do it" since sane
software will never do it anyway.

This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings).  Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.

Note that we have to make changes to all of the architectures
that do not use mman-common.h because we use the new
PKEY_DENY_ACCESS/WRITE macros in arch-independent code.

1. PKRU is the Protection Key Rights User register.  It is a
   usermode-accessible register that controls whether writes
   and/or access to each individual pkey is allowed or denied.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/alpha/include/uapi/asm/mman.h     |    5 +
 b/arch/mips/include/uapi/asm/mman.h      |    5 +
 b/arch/parisc/include/uapi/asm/mman.h    |    5 +
 b/arch/x86/entry/syscalls/syscall_32.tbl |    2 
 b/arch/x86/entry/syscalls/syscall_64.tbl |    2 
 b/arch/x86/include/asm/mmu.h             |    8 +++
 b/arch/x86/include/asm/mmu_context.h     |   10 +++
 b/arch/x86/include/asm/pkeys.h           |   79 ++++++++++++++++++++++++++++---
 b/arch/x86/kernel/fpu/xstate.c           |    3 +
 b/arch/x86/mm/pkeys.c                    |   38 +++++++++++---
 b/arch/xtensa/include/uapi/asm/mman.h    |    5 +
 b/include/linux/pkeys.h                  |   30 +++++++++--
 b/include/uapi/asm-generic/mman-common.h |    5 +
 b/mm/mprotect.c                          |   55 +++++++++++++++++++++
 14 files changed, 231 insertions(+), 21 deletions(-)

diff -puN arch/alpha/include/uapi/asm/mman.h~pkeys-116-syscalls-allocation arch/alpha/include/uapi/asm/mman.h
--- a/arch/alpha/include/uapi/asm/mman.h~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.386023066 -0700
+++ b/arch/alpha/include/uapi/asm/mman.h	2016-06-07 13:22:20.412024264 -0700
@@ -78,4 +78,9 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define PKEY_DISABLE_ACCESS	0x1
+#define PKEY_DISABLE_WRITE	0x2
+#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
+				 PKEY_DISABLE_WRITE)
+
 #endif /* __ALPHA_MMAN_H__ */
diff -puN arch/mips/include/uapi/asm/mman.h~pkeys-116-syscalls-allocation arch/mips/include/uapi/asm/mman.h
--- a/arch/mips/include/uapi/asm/mman.h~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.388023158 -0700
+++ b/arch/mips/include/uapi/asm/mman.h	2016-06-07 13:22:20.412024264 -0700
@@ -105,4 +105,9 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define PKEY_DISABLE_ACCESS	0x1
+#define PKEY_DISABLE_WRITE	0x2
+#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
+				 PKEY_DISABLE_WRITE)
+
 #endif /* _ASM_MMAN_H */
diff -puN arch/parisc/include/uapi/asm/mman.h~pkeys-116-syscalls-allocation arch/parisc/include/uapi/asm/mman.h
--- a/arch/parisc/include/uapi/asm/mman.h~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.389023204 -0700
+++ b/arch/parisc/include/uapi/asm/mman.h	2016-06-07 13:22:20.412024264 -0700
@@ -75,4 +75,9 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define PKEY_DISABLE_ACCESS	0x1
+#define PKEY_DISABLE_WRITE	0x2
+#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
+				 PKEY_DISABLE_WRITE)
+
 #endif /* __PARISC_MMAN_H__ */
diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-116-syscalls-allocation arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.391023296 -0700
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2016-06-07 13:22:20.413024311 -0700
@@ -387,3 +387,5 @@
 378	i386	preadv2			sys_preadv2			compat_sys_preadv2
 379	i386	pwritev2		sys_pwritev2			compat_sys_pwritev2
 380	i386	pkey_mprotect		sys_pkey_mprotect
+381	i386	pkey_alloc		sys_pkey_alloc
+382	i386	pkey_free		sys_pkey_free
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-116-syscalls-allocation arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.393023389 -0700
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2016-06-07 13:22:20.413024311 -0700
@@ -336,6 +336,8 @@
 327	64	preadv2			sys_preadv2
 328	64	pwritev2		sys_pwritev2
 329	common	pkey_mprotect		sys_pkey_mprotect
+330	common	pkey_alloc		sys_pkey_alloc
+331	common	pkey_free		sys_pkey_free
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-116-syscalls-allocation arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.394023435 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2016-06-07 13:22:20.413024311 -0700
@@ -108,7 +108,16 @@ static inline void enter_lazy_tlb(struct
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+	#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
+		/* pkey 0 is the default and always allocated */
+		mm->context.pkey_allocation_map = 0x1;
+		/* -1 means unallocated or invalid */
+		mm->context.execute_only_pkey = -1;
+	}
+	#endif
 	init_new_context_ldt(tsk, mm);
+
 	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
@@ -263,5 +272,4 @@ static inline bool arch_pte_access_permi
 {
 	return __pkru_allows_pkey(pte_flags_pkey(pte_flags(pte)), write);
 }
-
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/mmu.h~pkeys-116-syscalls-allocation arch/x86/include/asm/mmu.h
--- a/arch/x86/include/asm/mmu.h~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.397023573 -0700
+++ b/arch/x86/include/asm/mmu.h	2016-06-07 13:22:20.414024357 -0700
@@ -23,6 +23,14 @@ typedef struct {
 	const struct vdso_image *vdso_image;	/* vdso image in use */
 
 	atomic_t perf_rdpmc_allowed;	/* nonzero if rdpmc is allowed */
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/*
+	 * One bit per protection key says whether userspace can
+	 * use it or not.  protected by mmap_sem.
+	 */
+	u16 pkey_allocation_map;
+	s16 execute_only_pkey;
+#endif
 } mm_context_t;
 
 #ifdef CONFIG_SMP
diff -puN arch/x86/include/asm/pkeys.h~pkeys-116-syscalls-allocation arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.398023619 -0700
+++ b/arch/x86/include/asm/pkeys.h	2016-06-07 13:22:20.414024357 -0700
@@ -1,12 +1,7 @@
 #ifndef _ASM_X86_PKEYS_H
 #define _ASM_X86_PKEYS_H
 
-#define PKEY_DEDICATED_EXECUTE_ONLY 15
-/*
- * Consider the PKEY_DEDICATED_EXECUTE_ONLY key unavailable.
- */
-#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? \
-		PKEY_DEDICATED_EXECUTE_ONLY : 1)
+#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)
 
 extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
@@ -40,4 +35,76 @@ extern int __arch_set_user_pkey_access(s
 
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
 
+#define mm_pkey_allocation_map(mm)	(mm->context.pkey_allocation_map)
+#define mm_set_pkey_allocated(mm, pkey) do {		\
+	mm_pkey_allocation_map(mm) |= (1 << pkey);	\
+} while (0)
+#define mm_set_pkey_free(mm, pkey) do {			\
+	mm_pkey_allocation_map(mm) &= ~(1 << pkey);	\
+} while (0)
+
+/*
+ * This is called from mprotect_pkey().
+ *
+ * Returns true if the protection keys is valid.
+ */
+static inline bool validate_pkey(int pkey)
+{
+	if (pkey < 0)
+		return false;
+	return (pkey < arch_max_pkey());
+}
+
+static inline
+bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
+{
+	if (!validate_pkey(pkey))
+		return true;
+
+	return mm_pkey_allocation_map(mm) & (1 << pkey);
+}
+
+static inline
+int mm_pkey_alloc(struct mm_struct *mm)
+{
+	int all_pkeys_mask = ((1 << arch_max_pkey()) - 1);
+	int ret;
+
+	/*
+	 * Are we out of pkeys?  We must handle this specially
+	 * because ffz() behavior is undefined if there are no
+	 * zeros.
+	 */
+	if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
+		return -1;
+
+	ret = ffz(mm_pkey_allocation_map(mm));
+
+	mm_set_pkey_allocated(mm, ret);
+
+	return ret;
+}
+
+static inline
+int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+	/*
+	 * pkey 0 is special, always allocated and can never
+	 * be freed.
+	 */
+	if (!pkey || !validate_pkey(pkey))
+		return -EINVAL;
+	if (!mm_pkey_is_allocated(mm, pkey))
+		return -EINVAL;
+
+	mm_set_pkey_free(mm, pkey);
+
+	return 0;
+}
+
+extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+
 #endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-116-syscalls-allocation arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.400023711 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2016-06-07 13:22:20.415024403 -0700
@@ -5,6 +5,7 @@
  */
 #include <linux/compat.h>
 #include <linux/cpu.h>
+#include <linux/mman.h>
 #include <linux/pkeys.h>
 
 #include <asm/fpu/api.h>
@@ -778,6 +779,7 @@ const void *get_xsave_field_ptr(int xsav
 	return get_xsave_addr(&fpu->state.xsave, xsave_state);
 }
 
+#ifdef CONFIG_ARCH_HAS_PKEYS
 
 /*
  * Set xfeatures (aka XSTATE_BV) bit for a feature that we want
@@ -943,3 +945,4 @@ int arch_set_user_pkey_access(struct tas
 		return -EINVAL;
 	return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
+#endif /* CONFIG_ARCH_HAS_PKEYS */
diff -puN arch/x86/mm/pkeys.c~pkeys-116-syscalls-allocation arch/x86/mm/pkeys.c
--- a/arch/x86/mm/pkeys.c~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.402023804 -0700
+++ b/arch/x86/mm/pkeys.c	2016-06-07 13:22:20.415024403 -0700
@@ -21,8 +21,19 @@
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
+	bool need_to_set_mm_pkey = false;
+	int execute_only_pkey = mm->context.execute_only_pkey;
 	int ret;
 
+	/* Do we need to assign a pkey for mm's execute-only maps? */
+	if (execute_only_pkey == -1) {
+		/* Go allocate one to use, which might fail */
+		execute_only_pkey = mm_pkey_alloc(mm);
+		if (!validate_pkey(execute_only_pkey))
+			return -1;
+		need_to_set_mm_pkey = true;
+	}
+
 	/*
 	 * We do not want to go through the relatively costly
 	 * dance to set PKRU if we do not need to.  Check it
@@ -32,22 +43,33 @@ int __execute_only_pkey(struct mm_struct
 	 * can make fpregs inactive.
 	 */
 	preempt_disable();
-	if (fpregs_active() &&
-	    !__pkru_allows_read(read_pkru(), PKEY_DEDICATED_EXECUTE_ONLY)) {
+	if (!need_to_set_mm_pkey &&
+	    fpregs_active() &&
+	    !__pkru_allows_read(read_pkru(), execute_only_pkey)) {
 		preempt_enable();
-		return PKEY_DEDICATED_EXECUTE_ONLY;
+		return execute_only_pkey;
 	}
 	preempt_enable();
-	ret = __arch_set_user_pkey_access(current, PKEY_DEDICATED_EXECUTE_ONLY,
+
+	/*
+	 * Set up PKRU so that it denies access for everything
+	 * other than execution.
+	 */
+	ret = __arch_set_user_pkey_access(current, execute_only_pkey,
 			PKEY_DISABLE_ACCESS);
 	/*
 	 * If the PKRU-set operation failed somehow, just return
 	 * 0 and effectively disable execute-only support.
 	 */
-	if (ret)
-		return 0;
+	if (ret) {
+		mm_set_pkey_free(mm, execute_only_pkey);
+		return -1;
+	}
 
-	return PKEY_DEDICATED_EXECUTE_ONLY;
+	/* We got one, store it and use it from here on out */
+	if (need_to_set_mm_pkey)
+		mm->context.execute_only_pkey = execute_only_pkey;
+	return execute_only_pkey;
 }
 
 static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma)
@@ -55,7 +77,7 @@ static inline bool vma_is_pkey_exec_only
 	/* Do this check first since the vm_flags should be hot */
 	if ((vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) != VM_EXEC)
 		return false;
-	if (vma_pkey(vma) != PKEY_DEDICATED_EXECUTE_ONLY)
+	if (vma_pkey(vma) != vma->vm_mm->context.execute_only_pkey)
 		return false;
 
 	return true;
diff -puN arch/xtensa/include/uapi/asm/mman.h~pkeys-116-syscalls-allocation arch/xtensa/include/uapi/asm/mman.h
--- a/arch/xtensa/include/uapi/asm/mman.h~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.404023896 -0700
+++ b/arch/xtensa/include/uapi/asm/mman.h	2016-06-07 13:22:20.415024403 -0700
@@ -117,4 +117,9 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define PKEY_DISABLE_ACCESS	0x1
+#define PKEY_DISABLE_WRITE	0x2
+#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
+				 PKEY_DISABLE_WRITE)
+
 #endif /* _XTENSA_MMAN_H */
diff -puN include/linux/pkeys.h~pkeys-116-syscalls-allocation include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.405023942 -0700
+++ b/include/linux/pkeys.h	2016-06-07 13:22:20.416024449 -0700
@@ -4,11 +4,6 @@
 #include <linux/mm_types.h>
 #include <asm/mmu_context.h>
 
-#define PKEY_DISABLE_ACCESS	0x1
-#define PKEY_DISABLE_WRITE	0x2
-#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
-				 PKEY_DISABLE_WRITE)
-
 #ifdef CONFIG_ARCH_HAS_PKEYS
 #include <asm/pkeys.h>
 #else /* ! CONFIG_ARCH_HAS_PKEYS */
@@ -17,7 +12,6 @@
 #define arch_override_mprotect_pkey(vma, prot, pkey) (0)
 #define PKEY_DEDICATED_EXECUTE_ONLY 0
 #define ARCH_VM_PKEY_FLAGS 0
-#endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 /*
  * This is called from mprotect_pkey().
@@ -31,4 +25,28 @@ static inline bool validate_pkey(int pke
 	return (pkey < arch_max_pkey());
 }
 
+static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
+{
+	return (pkey == 0);
+}
+
+static inline int mm_pkey_alloc(struct mm_struct *mm)
+{
+	return -1;
+}
+
+static inline int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+	WARN_ONCE(1, "free of protection key when disabled");
+	return -EINVAL;
+}
+
+static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+			unsigned long init_val)
+{
+	return 0;
+}
+
+#endif /* ! CONFIG_ARCH_HAS_PKEYS */
+
 #endif /* _LINUX_PKEYS_H */
diff -puN include/uapi/asm-generic/mman-common.h~pkeys-116-syscalls-allocation include/uapi/asm-generic/mman-common.h
--- a/include/uapi/asm-generic/mman-common.h~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.407024034 -0700
+++ b/include/uapi/asm-generic/mman-common.h	2016-06-07 13:22:20.416024449 -0700
@@ -72,4 +72,9 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define PKEY_DISABLE_ACCESS	0x1
+#define PKEY_DISABLE_WRITE	0x2
+#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
+				 PKEY_DISABLE_WRITE)
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff -puN mm/mprotect.c~pkeys-116-syscalls-allocation mm/mprotect.c
--- a/mm/mprotect.c~pkeys-116-syscalls-allocation	2016-06-07 13:22:20.408024080 -0700
+++ b/mm/mprotect.c	2016-06-07 13:22:20.416024449 -0700
@@ -23,11 +23,13 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
+#include <linux/pkeys.h>
 #include <linux/ksm.h>
 #include <linux/pkeys.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
+#include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
 
 #include "internal.h"
@@ -385,6 +387,14 @@ static int do_mprotect_pkey(unsigned lon
 	if (down_write_killable(&current->mm->mmap_sem))
 		return -EINTR;
 
+	/*
+	 * If userspace did not allocate the pkey, do not let
+	 * them use it here.
+	 */
+	error = -EINVAL;
+	if ((pkey != -1) && !mm_pkey_is_allocated(current->mm, pkey))
+		goto out;
+
 	vma = find_vma(current->mm, start);
 	error = -ENOMEM;
 	if (!vma)
@@ -482,3 +492,48 @@ SYSCALL_DEFINE4(pkey_mprotect, unsigned
 
 	return do_mprotect_pkey(start, len, prot, pkey);
 }
+
+SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
+{
+	int pkey;
+	int ret;
+
+	/* No flags supported yet. */
+	if (flags)
+		return -EINVAL;
+	/* check for unsupported init values */
+	if (init_val & ~PKEY_ACCESS_MASK)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	pkey = mm_pkey_alloc(current->mm);
+
+	ret = -ENOSPC;
+	if (pkey == -1)
+		goto out;
+
+	ret = arch_set_user_pkey_access(current, pkey, init_val);
+	if (ret) {
+		mm_pkey_free(current->mm, pkey);
+		goto out;
+	}
+	ret = pkey;
+out:
+	up_write(&current->mm->mmap_sem);
+	return ret;
+}
+
+SYSCALL_DEFINE1(pkey_free, int, pkey)
+{
+	int ret;
+
+	down_write(&current->mm->mmap_sem);
+	ret = mm_pkey_free(current->mm, pkey);
+	up_write(&current->mm->mmap_sem);
+
+	/*
+	 * We could provie warnings or errors if any VMA still
+	 * has the pkey set here.
+	 */
+	return ret;
+}
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
                   ` (4 preceding siblings ...)
  2016-06-07 20:47 ` [PATCH 5/9] x86, pkeys: allocation/free syscalls Dave Hansen
@ 2016-06-07 20:47 ` Dave Hansen
  2016-06-07 20:47 ` [PATCH 7/9] generic syscalls: wire up memory protection keys syscalls Dave Hansen
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-07 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This establishes two more system calls for protection key management:

	unsigned long pkey_get(int pkey);
	int pkey_set(int pkey, unsigned long access_rights);

The return value from pkey_get() and the 'access_rights' passed
to pkey_set() are the same format: a bitmask containing
PKEY_DENY_WRITE and/or PKEY_DENY_ACCESS, or nothing set at all.

These can replace userspace's direct use of the new rdpkru/wrpkru
instructions.

With current hardware, the kernel can not enforce that it has
control over a given key.  But, this at least allows the kernel
to indicate to userspace that userspace does not control a given
protection key.  This makes it more likely that situations like
using a pkey after sys_pkey_free() can be detected.

The kernel does _not_ enforce that this interface must be used for
changes to PKRU, whether or not a key has been "allocated".

This syscall interface could also theoretically be replaced with a
pair of vsyscalls.  The vsyscalls would just call WRPKRU/RDPKRU
directly in situations where they are drop-in equivalents for
what the kernel would be doing.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    2 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 b/arch/x86/include/asm/pkeys.h           |    4 +-
 b/arch/x86/kernel/fpu/xstate.c           |   55 +++++++++++++++++++++++++++++--
 b/include/linux/pkeys.h                  |    8 ++++
 b/mm/mprotect.c                          |   41 +++++++++++++++++++++++
 6 files changed, 109 insertions(+), 3 deletions(-)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-118-syscalls-set-get arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-118-syscalls-set-get	2016-06-07 13:22:21.150058284 -0700
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2016-06-07 13:22:21.162058838 -0700
@@ -389,3 +389,5 @@
 380	i386	pkey_mprotect		sys_pkey_mprotect
 381	i386	pkey_alloc		sys_pkey_alloc
 382	i386	pkey_free		sys_pkey_free
+383	i386	pkey_get		sys_pkey_get
+384	i386	pkey_set		sys_pkey_set
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-118-syscalls-set-get arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-118-syscalls-set-get	2016-06-07 13:22:21.152058377 -0700
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2016-06-07 13:22:21.162058838 -0700
@@ -338,6 +338,8 @@
 329	common	pkey_mprotect		sys_pkey_mprotect
 330	common	pkey_alloc		sys_pkey_alloc
 331	common	pkey_free		sys_pkey_free
+332	common	pkey_get		sys_pkey_get
+333	common	pkey_set		sys_pkey_set
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/asm/pkeys.h~pkeys-118-syscalls-set-get arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-118-syscalls-set-get	2016-06-07 13:22:21.153058423 -0700
+++ b/arch/x86/include/asm/pkeys.h	2016-06-07 13:22:21.163058884 -0700
@@ -56,7 +56,7 @@ static inline bool validate_pkey(int pke
 }
 
 static inline
-bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
+bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
 	if (!validate_pkey(pkey))
 		return true;
@@ -107,4 +107,6 @@ extern int arch_set_user_pkey_access(str
 extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
 
+extern unsigned long arch_get_user_pkey_access(struct task_struct *tsk,
+		int pkey);
 #endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-118-syscalls-set-get arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-118-syscalls-set-get	2016-06-07 13:22:21.155058515 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2016-06-07 13:22:21.163058884 -0700
@@ -690,7 +690,7 @@ void fpu__resume_cpu(void)
  *
  * Note: does not work for compacted buffers.
  */
-void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+static void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
 {
 	int feature_nr = fls64(xstate_feature_mask) - 1;
 
@@ -864,6 +864,7 @@ out:
 
 #define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
 #define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+#define PKRU_INIT_STATE	0
 
 /*
  * This will go out and modify the XSAVE buffer so that PKRU is
@@ -882,6 +883,9 @@ int __arch_set_user_pkey_access(struct t
 	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
 	u32 new_pkru_bits = 0;
 
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -EINVAL;
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
 	 * set if we enable XSAVE and we enable PKU in XCR0.
@@ -907,7 +911,7 @@ int __arch_set_user_pkey_access(struct t
 	 * state.
 	 */
 	if (!old_pkru_state)
-		new_pkru_state.pkru = 0;
+		new_pkru_state.pkru = PKRU_INIT_STATE;
 	else
 		new_pkru_state.pkru = old_pkru_state->pkru;
 
@@ -945,4 +949,51 @@ int arch_set_user_pkey_access(struct tas
 		return -EINVAL;
 	return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
+
+/*
+ * Figures out what the rights are currently for 'pkey'.
+ * Converts from PKRU's format to the user-visible PKEY_DISABLE_*
+ * format.
+ */
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	struct fpu *fpu = &current->thread.fpu;
+	u32 pkru_reg;
+	int ret = 0;
+
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -1;
+	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
+		return -1;
+	/*
+	 * The contents of PKRU itself are invalid.  Consult the
+	 * task's XSAVE buffer for PKRU contents.  This is much
+	 * more expensive than reading PKRU directly, but should
+	 * be rare or impossible with eagerfpu mode.
+	 */
+	if (!fpu->fpregs_active) {
+		struct xregs_state *xsave = &fpu->state.xsave;
+		struct pkru_state *pkru_state =
+			get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+		/*
+		 * PKRU is in its init state and not present in
+		 * the buffer in a saved form.
+		 */
+		if (!pkru_state)
+			return PKRU_INIT_STATE;
+
+		return pkru_state->pkru;
+	}
+	/*
+	 * Consult the user register directly.
+	 */
+	pkru_reg = read_pkru();
+	if (!__pkru_allows_read(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_ACCESS;
+	if (!__pkru_allows_write(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_WRITE;
+
+	return ret;
+}
 #endif /* CONFIG_ARCH_HAS_PKEYS */
diff -puN include/linux/pkeys.h~pkeys-118-syscalls-set-get include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-118-syscalls-set-get	2016-06-07 13:22:21.157058607 -0700
+++ b/include/linux/pkeys.h	2016-06-07 13:22:21.164058930 -0700
@@ -44,6 +44,14 @@ static inline int mm_pkey_free(struct mm
 static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 			unsigned long init_val)
 {
+	return -EINVAL;
+}
+
+static inline
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	if (pkey)
+		return -1;
 	return 0;
 }
 
diff -puN mm/mprotect.c~pkeys-118-syscalls-set-get mm/mprotect.c
--- a/mm/mprotect.c~pkeys-118-syscalls-set-get	2016-06-07 13:22:21.159058699 -0700
+++ b/mm/mprotect.c	2016-06-07 13:22:21.164058930 -0700
@@ -537,3 +537,44 @@ SYSCALL_DEFINE1(pkey_free, int, pkey)
 	 */
 	return ret;
 }
+
+SYSCALL_DEFINE2(pkey_get, int, pkey, unsigned long, flags)
+{
+	unsigned long ret = 0;
+
+	if (flags)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_get_user_pkey_access(current, pkey);
+
+	return ret;
+}
+
+SYSCALL_DEFINE3(pkey_set, int, pkey, unsigned long, access_rights,
+		unsigned long, flags)
+{
+	unsigned long ret = 0;
+
+	if (flags)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_set_user_pkey_access(current, pkey, access_rights);
+
+	return ret;
+}
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 7/9] generic syscalls: wire up memory protection keys syscalls
  2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
                   ` (5 preceding siblings ...)
  2016-06-07 20:47 ` [PATCH 6/9] x86, pkeys: add pkey set/get syscalls Dave Hansen
@ 2016-06-07 20:47 ` Dave Hansen
  2016-06-07 21:25   ` Arnd Bergmann
  2016-06-07 20:47 ` [PATCH 8/9] pkeys: add details of system call use to Documentation/ Dave Hansen
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-06-07 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen, arnd


From: Dave Hansen <dave.hansen@linux.intel.com>

These new syscalls are implemented as generic code, so enable
them for architectures like arm64 which use the generic syscall
table.

According to Arnd:

	Even if the support is x86 specific for the forseeable
	future, it may be good to reserve the number just in
	case.  The other architecture specific syscall lists are
	usually left to the individual arch maintainers, most a
	lot of the newer architectures share this table.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
---

 b/include/uapi/asm-generic/unistd.h |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff -puN include/uapi/asm-generic/unistd.h~pkeys-119-syscalls-generic include/uapi/asm-generic/unistd.h
--- a/include/uapi/asm-generic/unistd.h~pkeys-119-syscalls-generic	2016-06-07 13:22:21.703083776 -0700
+++ b/include/uapi/asm-generic/unistd.h	2016-06-07 13:22:21.707083961 -0700
@@ -724,9 +724,19 @@ __SYSCALL(__NR_copy_file_range, sys_copy
 __SC_COMP(__NR_preadv2, sys_preadv2, compat_sys_preadv2)
 #define __NR_pwritev2 287
 __SC_COMP(__NR_pwritev2, sys_pwritev2, compat_sys_pwritev2)
+#define __NR_pkey_mprotect 288
+__SYSCALL(__NR_pkey_mprotect, sys_pkey_mprotect)
+#define __NR_pkey_alloc 289
+__SYSCALL(__NR_pkey_alloc,    sys_pkey_alloc)
+#define __NR_pkey_free 290
+__SYSCALL(__NR_pkey_free,     sys_pkey_free)
+#define __NR_pkey_get 291
+__SYSCALL(__NR_pkey_get,      sys_pkey_get)
+#define __NR_pkey_set 292
+__SYSCALL(__NR_pkey_set,      sys_pkey_set)
 
 #undef __NR_syscalls
-#define __NR_syscalls 288
+#define __NR_syscalls 293
 
 /*
  * All syscalls below here should go away really,
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 8/9] pkeys: add details of system call use to Documentation/
  2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
                   ` (6 preceding siblings ...)
  2016-06-07 20:47 ` [PATCH 7/9] generic syscalls: wire up memory protection keys syscalls Dave Hansen
@ 2016-06-07 20:47 ` Dave Hansen
  2016-06-07 20:47 ` [PATCH 9/9] x86, pkeys: add self-tests Dave Hansen
  2016-06-08  9:23 ` [PATCH 0/9] [v2] System Calls for Memory Protection Keys Michael Kerrisk (man-pages)
  9 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-07 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This spells out all of the pkey-related system calls that we have
and provides some example code fragments to demonstrate how we
expect them to be used.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/Documentation/x86/protection-keys.txt |   63 ++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff -puN Documentation/x86/protection-keys.txt~pkeys-120-syscall-docs Documentation/x86/protection-keys.txt
--- a/Documentation/x86/protection-keys.txt~pkeys-120-syscall-docs	2016-06-07 13:22:22.121103045 -0700
+++ b/Documentation/x86/protection-keys.txt	2016-06-07 13:22:22.124103183 -0700
@@ -18,6 +18,69 @@ even though there is theoretically space
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
+=========================== Syscalls ===========================
+
+There are 5 system calls which directly interact with pkeys:
+
+	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
+	int pkey_free(int pkey);
+	int pkey_mprotect(unsigned long start, size_t len,
+			  unsigned long prot, int pkey);
+	unsigned long pkey_get(int pkey);
+	int pkey_set(int pkey, unsigned long access_rights);
+
+Before a pkey can be used, it must first be allocated with
+pkey_alloc().  An application may either call pkey_set() or the
+WRPKRU instruction directly in order to change access permissions
+to memory covered with a key.
+
+	int real_prot = PROT_READ|PROT_WRITE;
+	pkey = pkey_alloc(0, PKEY_DENY_WRITE);
+	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
+	... application runs here
+
+Now, if the application needs to update the data at 'ptr', it can
+gain access, do the update, then remove its write access:
+
+	pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
+	*ptr = foo; // assign something
+	pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
+
+Now when it frees the memory, it will also free the pkey since it
+is no longer in use:
+
+	munmap(ptr, PAGE_SIZE);
+	pkey_free(pkey);
+
+=========================== Behavior ===========================
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+	mprotect(ptr, size, PROT_NONE);
+	something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+	pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
+	pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
+	something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+	*ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+	read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
+
 =========================== Config Option ===========================
 
 This config option adds approximately 1.5kb of text. and 50 bytes of
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 9/9] x86, pkeys: add self-tests
  2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
                   ` (7 preceding siblings ...)
  2016-06-07 20:47 ` [PATCH 8/9] pkeys: add details of system call use to Documentation/ Dave Hansen
@ 2016-06-07 20:47 ` Dave Hansen
  2016-06-08  9:23 ` [PATCH 0/9] [v2] System Calls for Memory Protection Keys Michael Kerrisk (man-pages)
  9 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-07 20:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen, shuahkh


From: Dave Hansen <dave.hansen@linux.intel.com>

This code should be a good demonstration of how to use the new
system calls as well as how to use protection keys in general.

This code shows how to:
1. Manipulate the Protection Keys Rights User (PKRU) register with
   sys_pkey_get/set()
2. Set a protection key on memory
3. Fetch and/or modify PKRU from the signal XSAVE state
4. Read the kernel-provided protection key in the siginfo
5. Set up an execute-only mapping

There are currently 13 tests:

        test_read_of_write_disabled_region
        test_read_of_access_disabled_region
        test_write_of_write_disabled_region
        test_write_of_access_disabled_region
        test_kernel_write_of_access_disabled_region
        test_kernel_write_of_write_disabled_region
        test_kernel_gup_of_access_disabled_region
        test_kernel_gup_write_to_write_disabled_region
        test_executing_on_unreadable_memory
        test_ptrace_of_child
        test_pkey_syscalls_on_non_allocated_pkey
        test_pkey_syscalls_bad_args
	test_pkey_alloc_exhaust

Each of the tests is run with plain memory (via mmap(MAP_ANON)),
transparent huge pages, and hugetlb.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: shuahkh@osg.samsung.com
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/tools/testing/selftests/x86/Makefile          |    3 
 b/tools/testing/selftests/x86/pkey-helpers.h    |  187 +++
 b/tools/testing/selftests/x86/protection_keys.c | 1283 ++++++++++++++++++++++++
 3 files changed, 1472 insertions(+), 1 deletion(-)

diff -puN tools/testing/selftests/x86/Makefile~pkeys-130-selftests tools/testing/selftests/x86/Makefile
--- a/tools/testing/selftests/x86/Makefile~pkeys-130-selftests	2016-06-07 13:22:35.302710652 -0700
+++ b/tools/testing/selftests/x86/Makefile	2016-06-07 13:22:35.305710791 -0700
@@ -5,7 +5,8 @@ include ../lib.mk
 .PHONY: all all_32 all_64 warn_32bit_failure clean
 
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt ptrace_syscall \
-			check_initial_reg_state sigreturn ldt_gdt iopl
+			check_initial_reg_state sigreturn ldt_gdt iopl	\
+			protection_keys
 TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
diff -puN /dev/null tools/testing/selftests/x86/pkey-helpers.h
--- /dev/null	2016-04-04 09:40:43.435149254 -0700
+++ b/tools/testing/selftests/x86/pkey-helpers.h	2016-06-07 13:22:35.306710837 -0700
@@ -0,0 +1,187 @@
+#ifndef _PKEYS_HELPER_H
+#define _PKEYS_HELPER_H
+#define _GNU_SOURCE
+#include <string.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+
+#define NR_PKEYS 16
+
+#ifndef DEBUG_LEVEL
+#define DEBUG_LEVEL 0
+#endif
+#define dprintf_level(level, args...) do { if(level <= DEBUG_LEVEL) printf(args); } while(0)
+#define dprintf0(args...) dprintf_level(0, args)
+#define dprintf1(args...) dprintf_level(1, args)
+#define dprintf2(args...) dprintf_level(2, args)
+#define dprintf3(args...) dprintf_level(3, args)
+#define dprintf4(args...) dprintf_level(4, args)
+
+extern unsigned int shadow_pkru;
+static inline unsigned int __rdpkru(void)
+{
+        unsigned int eax, edx;
+	unsigned int ecx = 0;
+	unsigned int pkru;
+
+        asm volatile(".byte 0x0f,0x01,0xee\n\t"
+                     : "=a" (eax), "=d" (edx)
+		     : "c" (ecx));
+	pkru = eax;
+	return pkru;
+}
+
+static inline unsigned int _rdpkru(int line)
+{
+	unsigned int pkru = __rdpkru();
+	dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n", line, pkru, shadow_pkru);
+	assert(pkru == shadow_pkru);
+	return pkru;
+}
+
+#define rdpkru() _rdpkru(__LINE__)
+
+static inline void __wrpkru(unsigned int pkru)
+{
+        unsigned int eax = pkru;
+	unsigned int ecx = 0;
+	unsigned int edx = 0;
+
+return;
+        asm volatile(".byte 0x0f,0x01,0xef\n\t"
+                     : : "a" (eax), "c" (ecx), "d" (edx));
+	assert(pkru == __rdpkru());
+}
+
+static inline void wrpkru(unsigned int pkru)
+{
+	dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+	// will do the shadow check for us:
+	rdpkru();
+	__wrpkru(pkru);
+	shadow_pkru = pkru;
+	dprintf4("%s(%08x) pkru: %08x\n", __func__, pkru, __rdpkru());
+}
+
+/*
+ * These are technically racy. since something could
+ * change PKRU between the read and the write.
+ */
+static inline void __pkey_access_allow(int pkey, int do_allow)
+{
+	unsigned int pkru = rdpkru();
+	int bit = pkey * 2;
+
+	if (do_allow)
+		pkru &= (1<<bit);
+	else
+		pkru |= (1<<bit);
+
+	dprintf4("pkru now: %08x\n", rdpkru());
+	wrpkru(pkru);
+}
+
+static inline void __pkey_write_allow(int pkey, int do_allow_write)
+{
+	long pkru = rdpkru();
+	int bit = pkey * 2 + 1;
+
+	if (do_allow_write)
+		pkru &= (1<<bit);
+	else
+		pkru |= (1<<bit);
+
+	wrpkru(pkru);
+	dprintf4("pkru now: %08x\n", rdpkru());
+}
+
+#define PROT_PKEY0     0x10            /* protection key value (bit 0) */
+#define PROT_PKEY1     0x20            /* protection key value (bit 1) */
+#define PROT_PKEY2     0x40            /* protection key value (bit 2) */
+#define PROT_PKEY3     0x80            /* protection key value (bit 3) */
+
+#define PAGE_SIZE 4096
+#define MB	(1<<20)
+
+static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
+                                unsigned int *ecx, unsigned int *edx)
+{
+	/* ecx is often an input as well as an output. */
+	asm volatile(
+		"cpuid;"
+		: "=a" (*eax),
+		  "=b" (*ebx),
+		  "=c" (*ecx),
+		  "=d" (*edx)
+		: "0" (*eax), "2" (*ecx));
+}
+
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx) */
+#define X86_FEATURE_PKU        (1<<3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE      (1<<4) /* OS Protection Keys Enable */
+
+static inline int cpu_has_pku(void)
+{
+	unsigned int eax;
+	unsigned int ebx;
+	unsigned int ecx;
+	unsigned int edx;
+	eax = 0x7;
+	ecx = 0x0;
+	__cpuid(&eax, &ebx, &ecx, &edx);
+
+	if (!(ecx & X86_FEATURE_PKU)) {
+		dprintf2("cpu does not have PKU\n");
+		return 0;
+	}
+	if (!(ecx & X86_FEATURE_OSPKE)) {
+		dprintf2("cpu does not have OSPKE\n");
+		return 0;
+	}
+	return 1;
+}
+
+#define XSTATE_PKRU_BIT	(9)
+#define XSTATE_PKRU	0x200
+
+int pkru_xstate_offset(void)
+{
+	unsigned int eax;
+	unsigned int ebx;
+	unsigned int ecx;
+	unsigned int edx;
+	int xstate_offset;
+	int xstate_size;
+	unsigned long XSTATE_CPUID = 0xd;
+	int leaf;
+
+	// assume that XSTATE_PKRU is set in XCR0
+	leaf = XSTATE_PKRU_BIT;
+	{
+		eax = XSTATE_CPUID;
+		// 0x2 !??! from setup_xstate_features() in the kernel
+		ecx = leaf;
+		__cpuid(&eax, &ebx, &ecx, &edx);
+
+		//printf("leaf[%d] offset: %d size: %d\n", leaf, ebx, eax);
+		if (leaf == XSTATE_PKRU_BIT) {
+			xstate_offset = ebx;
+			xstate_size = eax;
+		}
+	}
+
+	if (xstate_size== 0) {
+		printf("could not find size/offset of PKRU in xsave state\n");
+		return 0;
+	}
+
+	return xstate_offset;
+}
+
+#endif /* _PKEYS_HELPER_H */
diff -puN /dev/null tools/testing/selftests/x86/protection_keys.c
--- /dev/null	2016-04-04 09:40:43.435149254 -0700
+++ b/tools/testing/selftests/x86/protection_keys.c	2016-06-07 13:22:35.307710883 -0700
@@ -0,0 +1,1283 @@
+/*
+ * Tests x86 Memory Protection Keys (see Documentation/x86/protection-keys.txt)
+ *
+ * There are examples in here of:
+ *  * how to set protection keys on memory
+ *  * how to set/clear bits in PKRU (the rights register)
+ *  * how to handle SEGV_PKRU signals and extract pkey-relevant
+ *    information from the siginfo
+ *
+ * Things to add:
+ *	make sure KSM and KSM COW breaking works
+ *	prefault pages in at malloc, or not
+ *	protect MPX bounds tables with protection keys?
+ *	make sure VMA splitting/merging is working correctly
+ *	OOMs can destroy mm->mmap (see exit_mmap()), so make sure it is immune to pkeys
+ *	look for pkey "leaks" where it is still set on a VMA but "freed" back to the kernel
+ *	do a plain mprotect() to a mprotect_pkey() area and make sure the pkey sticks
+ *
+ * Compile like this:
+ * 	gcc      -o protection_keys    -O2 -g -std=gnu99 -pthread -Wall protection_keys.c -lrt -ldl -lm
+ *	gcc -m32 -o protection_keys_32 -O2 -g -std=gnu99 -pthread -Wall protection_keys.c -lrt -ldl -lm
+ */
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/futex.h>
+#include <sys/time.h>
+#include <sys/syscall.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+#include "pkey-helpers.h"
+
+int iteration_nr = 1;
+int test_nr;
+
+unsigned int shadow_pkru;
+
+#define HPAGE_SIZE	(1UL<<21)
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+#define ALIGN(x, align_to)    (((x) + ((align_to)-1)) & ~((align_to)-1))
+#define ALIGN_PTR(p, ptr_align_to)    ((typeof(p))ALIGN((unsigned long)(p), ptr_align_to))
+#define __stringify_1(x...)     #x
+#define __stringify(x...)       __stringify_1(x)
+
+#define PTR_ERR_ENOTSUP ((void *)-ENOTSUP)
+
+extern void abort_hooks(void);
+#define pkey_assert(condition) do {		\
+	if (!(condition)) {			\
+		printf("assert() at %s::%d test_nr: %d iteration: %d\n", \
+				__FILE__, __LINE__,	\
+				test_nr, iteration_nr);	\
+		perror("errno at assert");	\
+		abort_hooks();			\
+		assert(condition);		\
+	}					\
+} while (0)
+#define raw_assert(cond) assert(cond)
+
+void cat_into_file(char *str, char *file)
+{
+	int fd = open(file, O_RDWR);
+	int ret;
+
+	dprintf2("%s(): writing '%s' to '%s'\n", __func__, str, file);
+	/*
+	 * these need to be raw because they are called under
+	 * pkey_assert()
+	 */
+	raw_assert(fd >= 0);
+	ret = write(fd, str, strlen(str));
+	if (ret != strlen(str)) {
+		perror("write to file failed");
+		fprintf(stderr, "filename: '%s' str: '%s'\n", file, str);
+		raw_assert(0);
+	}
+	close(fd);
+}
+
+#if CONTROL_TRACING > 0
+static int warned_tracing = 0;
+int tracing_root_ok(void)
+{
+	if (geteuid() != 0) {
+		if (!warned_tracing)
+			fprintf(stderr, "WARNING: not run as root, can not do tracing control\n");
+		warned_tracing = 1;
+		return 0;
+	}
+	return 1;
+}
+#endif
+
+void tracing_on(void)
+{
+#if CONTROL_TRACING > 0
+	char pidstr[32];
+
+	if (!tracing_root_ok())
+		return;
+
+	sprintf(pidstr, "%d", getpid());
+	//cat_into_file("20000", "/sys/kernel/debug/tracing/buffer_size_kb");
+	cat_into_file("0", "/sys/kernel/debug/tracing/tracing_on");
+	cat_into_file("\n", "/sys/kernel/debug/tracing/trace");
+	if (1) {
+		cat_into_file("function_graph", "/sys/kernel/debug/tracing/current_tracer");
+		cat_into_file("1", "/sys/kernel/debug/tracing/options/funcgraph-proc");
+	} else {
+		cat_into_file("nop", "/sys/kernel/debug/tracing/current_tracer");
+	}
+	cat_into_file(pidstr, "/sys/kernel/debug/tracing/set_ftrace_pid");
+	cat_into_file("1", "/sys/kernel/debug/tracing/tracing_on");
+	dprintf1("enabled tracing\n");
+#endif
+}
+
+void tracing_off(void)
+{
+#if CONTROL_TRACING > 0
+	if (!tracing_root_ok())
+		return;
+	cat_into_file("0", "/sys/kernel/debug/tracing/tracing_on");
+#endif
+}
+
+void abort_hooks(void)
+{
+	fprintf(stderr, "running %s()...\n", __func__);
+	tracing_off();
+#ifdef SLEEP_ON_ABORT
+	sleep(SLEEP_ON_ABORT);
+#endif
+}
+
+void lots_o_noops(void)
+{
+	dprintf1("running %s()\n", __func__);
+	asm(".rept 65536 ; nopl 0x7eeeeeee(%eax) ; .endr");
+	dprintf1("%s() done\n", __func__);
+}
+
+// I'm addicted to the kernel types
+#define  u8 uint8_t
+#define u16 uint16_t
+#define u32 uint32_t
+#define u64 uint64_t
+
+#ifdef __i386__
+#define SYS_mprotect_key 380
+#define SYS_pkey_alloc	 381
+#define SYS_pkey_free	 382
+#define SYS_pkey_get	 383
+#define SYS_pkey_set	 384
+#define REG_IP_IDX REG_EIP
+#define si_pkey_offset 0x18
+#else
+#define SYS_mprotect_key 329
+#define SYS_pkey_alloc	 330
+#define SYS_pkey_free	 331
+#define SYS_pkey_get	 332
+#define SYS_pkey_set	 333
+#define REG_IP_IDX REG_RIP
+#define si_pkey_offset 0x20
+#endif
+
+void dump_mem(void *dumpme, int len_bytes)
+{
+	char *c = (void *)dumpme;
+	int i;
+	for (i = 0; i < len_bytes; i+= sizeof(u64)) {
+		u64 *ptr = (u64 *)(c + i);
+		dprintf1("dump[%03d][@%p]: %016jx\n", i, ptr, *ptr);
+	}
+}
+
+#define __SI_FAULT      (3 << 16)
+#define SEGV_BNDERR     (__SI_FAULT|3)  /* failed address bound checks */
+#define SEGV_PKUERR     (__SI_FAULT|4)
+
+static char *si_code_str(int si_code)
+{
+	if (si_code & SEGV_MAPERR)
+		return "SEGV_MAPERR";
+	if (si_code & SEGV_ACCERR)
+		return "SEGV_ACCERR";
+	if (si_code & SEGV_BNDERR)
+		return "SEGV_BNDERR";
+	if (si_code & SEGV_PKUERR)
+		return "SEGV_PKUERR";
+	return "UNKNOWN";
+}
+
+int pkru_faults = 0;
+int last_si_pkey = -1;
+void signal_handler(int signum, siginfo_t* si, void* vucontext)
+{
+	ucontext_t* uctxt = vucontext;
+	int trapno;
+	unsigned long ip;
+	char *fpregs;
+	u32 *pkru_ptr;
+	u64 si_pkey;
+	u32* si_pkey_ptr;
+	int pkru_offset;
+
+	dprintf1(">>>>===============SIGSEGV============================\n");
+	dprintf1("%s()::%d, pkru: 0x%x shadow: %x\n", __func__, __LINE__, __rdpkru(), shadow_pkru);
+
+	trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
+	ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
+	fpregset_t fpregset = uctxt->uc_mcontext.fpregs;
+	fpregs = (void *)fpregset;
+
+	dprintf2("%s() trapno: %d ip: 0x%lx info->si_code: %s/%d\n", __func__, trapno, ip,
+			si_code_str(si->si_code), si->si_code);
+#ifdef __i386__
+	/*
+	 * 32-bit has some extra padding so that userspace can tell whether
+	 * the XSTATE header is present in addition to the "legacy" FPU
+	 * state.  We just assume that it is here.
+	 */
+	fpregs += 0x70;
+#endif
+	pkru_offset = pkru_xstate_offset();
+	pkru_ptr = (void *)(&fpregs[pkru_offset]);
+
+	dprintf1("siginfo: %p\n", si);
+	dprintf1(" fpregs: %p\n", fpregs);
+	/*
+	 * If we got a PKRU fault, we *HAVE* to have at least one bit set in
+	 * here.
+	 */
+	dprintf1("pkru_xstate_offset: %d\n", pkru_xstate_offset());
+	//dump_mem(pkru_ptr - 128, 256);
+	pkey_assert(*pkru_ptr);
+
+	si_pkey_ptr = (u32 *)(((u8 *)si) + si_pkey_offset);
+	dprintf1("si_pkey_ptr: %p\n", si_pkey_ptr);
+	dump_mem(si_pkey_ptr - 8, 24);
+	si_pkey = *si_pkey_ptr;
+	pkey_assert(si_pkey < NR_PKEYS);
+	last_si_pkey = si_pkey;
+
+	if ((si->si_code == SEGV_MAPERR) ||
+	    (si->si_code == SEGV_ACCERR) ||
+	    (si->si_code == SEGV_BNDERR)) {
+		printf("non-PK si_code, exiting...\n");
+		exit(4);
+	}
+
+	//printf("pkru_xstate_offset(): %d\n", pkru_xstate_offset());
+	dprintf1("signal pkru from xsave: %08x\n", *pkru_ptr);
+	// need __ version so we do not do shadow_pkru checking
+	dprintf1("signal pkru from  pkru: %08x\n", __rdpkru());
+	dprintf1("si_pkey from siginfo: %jx\n", si_pkey);
+	*(u64 *)pkru_ptr = 0x00000000;
+	dprintf1("WARNING: set PRKU=0 to allow faulting instruction to continue\n");
+	pkru_faults++;
+	dprintf1("<<<<==================================================\n");
+	return;
+	if (trapno == 14) {
+		fprintf(stderr,
+			"ERROR: In signal handler, page fault, trapno = %d, ip = %016lx\n",
+			trapno, ip);
+		fprintf(stderr, "si_addr %p\n", si->si_addr);
+		fprintf(stderr, "REG_ERR: %lx\n", (unsigned long)uctxt->uc_mcontext.gregs[REG_ERR]);
+		//sleep(999);
+		exit(1);
+	} else {
+		fprintf(stderr,"unexpected trap %d! at 0x%lx\n", trapno, ip);
+		fprintf(stderr, "si_addr %p\n", si->si_addr);
+		fprintf(stderr, "REG_ERR: %lx\n", (unsigned long)uctxt->uc_mcontext.gregs[REG_ERR]);
+		exit(2);
+	}
+}
+
+int wait_all_children()
+{
+        int status;
+        return waitpid(-1, &status, 0);
+}
+
+void sig_chld(int x)
+{
+        dprintf2("[%d] SIGCHLD: %d\n", getpid(), x);
+}
+
+void setup_sigsegv_handler()
+{
+	int r,rs;
+	struct sigaction newact;
+	struct sigaction oldact;
+
+	/* #PF is mapped to sigsegv */
+	int signum  = SIGSEGV;
+
+	newact.sa_handler = 0;   /* void(*)(int)*/
+	newact.sa_sigaction = signal_handler; /* void (*)(int, siginfo_t*, void*) */
+
+	/*sigset_t - signals to block while in the handler */
+	/* get the old signal mask. */
+	rs = sigprocmask(SIG_SETMASK, 0, &newact.sa_mask);
+	pkey_assert(rs == 0);
+
+	/* call sa_sigaction, not sa_handler*/
+	newact.sa_flags = SA_SIGINFO;
+
+	newact.sa_restorer = 0;  /* void(*)(), obsolete */
+	r = sigaction(signum, &newact, &oldact);
+	r = sigaction(SIGALRM, &newact, &oldact);
+	pkey_assert(r == 0);
+}
+
+void setup_handlers(void)
+{
+	signal(SIGCHLD, &sig_chld);
+	setup_sigsegv_handler();
+}
+
+pid_t fork_lazy_child(void)
+{
+	pid_t forkret;
+
+	forkret = fork();
+	pkey_assert(forkret >= 0);
+	dprintf3("[%d] fork() ret: %d\n", getpid(), forkret);
+
+	if (!forkret) {
+		/* in the child */
+		while (1) {
+			dprintf1("child sleeping...\n");
+			sleep(30);
+		}
+	}
+	return forkret;
+}
+
+void davecmp(void *_a, void *_b, int len)
+{
+	int i;
+	unsigned long *a = _a;
+	unsigned long *b = _b;
+	for (i = 0; i < len / sizeof(*a); i++) {
+		if (a[i] == b[i])
+			continue;
+
+		dprintf3("[%3d]: a: %016lx b: %016lx\n", i, a[i], b[i]);
+	}
+}
+
+void dumpit(char *f)
+{
+	int fd = open(f, O_RDONLY);
+	char buf[100];
+	int nr_read;
+
+	dprintf2("maps fd: %d\n", fd);
+	do {
+		nr_read = read(fd, &buf[0], sizeof(buf));
+		write(1, buf, nr_read);
+	} while (nr_read > 0);
+	close(fd);
+}
+
+#define PKEY_DISABLE_ACCESS    0x1
+#define PKEY_DISABLE_WRITE     0x2
+
+int sys_pkey_get(int pkey, unsigned long flags)
+{
+	int ret = syscall(SYS_pkey_get, pkey, flags);
+	dprintf1("%s(pkey=%d, flags=%lx) = %x / %d\n", __func__, pkey, flags, ret, ret);
+	return ret;
+}
+
+int sys_pkey_set(int pkey, unsigned long rights, unsigned long flags)
+{
+	int ret = syscall(SYS_pkey_set, pkey, rights, flags);
+	dprintf3("%s(pkey=%d, rights=%lx, flags=%lx) = %x pkru now: %x\n", __func__,
+			pkey, rights, flags, ret, __rdpkru());
+	return ret;
+}
+
+void pkey_disable_set(int pkey, int flags)
+{
+	unsigned long syscall_flags = 0;
+	int ret;
+	int pkey_rights = sys_pkey_get(pkey, syscall_flags);
+	u32 orig_pkru = rdpkru();
+
+	pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+
+	dprintf1("%s(%d) sys_pkey_get(%d): %x\n", __func__, pkey, pkey, pkey_rights);
+	pkey_assert(pkey_rights >= 0) ;
+
+	pkey_rights |= flags;
+
+	ret = sys_pkey_set(pkey, pkey_rights, syscall_flags);
+	/*pkru and flags have the same format */
+	shadow_pkru |= flags << (pkey * 2);
+	dprintf1("%s(%d) shadow: 0x%x\n", __func__, pkey, shadow_pkru);
+
+	pkey_assert(ret >= 0);
+
+	pkey_rights = sys_pkey_get(pkey, syscall_flags);
+	dprintf1("%s(%d) sys_pkey_get(%d): %x\n", __func__, pkey, pkey, pkey_rights);
+
+	dprintf1("%s(%d) pkru: 0x%x\n", __func__, pkey, rdpkru());
+	if (flags)
+		pkey_assert(rdpkru() > orig_pkru);
+}
+
+void pkey_disable_clear(int pkey, int flags)
+{
+	unsigned long syscall_flags = 0;
+	int ret;
+	int pkey_rights = sys_pkey_get(pkey, syscall_flags);
+	u32 orig_pkru = rdpkru();
+
+	pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+
+	dprintf1("%s(%d) sys_pkey_get(%d): %x\n", __func__, pkey, pkey, pkey_rights);
+	pkey_assert(pkey_rights >= 0) ;
+
+	pkey_rights |= flags;
+
+	ret = syscall(SYS_pkey_set, pkey, pkey_rights);
+	/* pkru and flags have the same format */
+	shadow_pkru &= ~(flags << (pkey * 2));
+	pkey_assert(ret >= 0);
+
+	pkey_rights = sys_pkey_get(pkey, syscall_flags);
+	dprintf1("%s(%d) sys_pkey_get(%d): %x\n", __func__, pkey, pkey, pkey_rights);
+
+	dprintf1("%s(%d) pkru: 0x%x\n", __func__, pkey, rdpkru());
+	if (flags)
+		assert(rdpkru() > orig_pkru);
+}
+
+void pkey_write_allow(int pkey)
+{
+	pkey_disable_clear(pkey, PKEY_DISABLE_WRITE);
+}
+void pkey_write_deny(int pkey)
+{
+	pkey_disable_set(pkey, PKEY_DISABLE_WRITE);
+}
+void pkey_access_allow(int pkey)
+{
+	pkey_disable_clear(pkey, PKEY_DISABLE_ACCESS);
+}
+void pkey_access_deny(int pkey)
+{
+	pkey_disable_set(pkey, PKEY_DISABLE_ACCESS);
+}
+
+int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot, unsigned long pkey)
+{
+	int sret;
+
+	dprintf2("%s(0x%p, %zx, prot=%lx, pkey=%lx)\n", __func__,
+			ptr, size, orig_prot, pkey);
+
+	errno = 0;
+	sret = syscall(SYS_mprotect_key, ptr, size, orig_prot, pkey);
+	if (errno) {
+		dprintf2("SYS_mprotect_key sret: %d\n", sret);
+		dprintf2("SYS_mprotect_key prot: 0x%lx\n", orig_prot);
+		dprintf2("SYS_mprotect_key failed, errno: %d\n", errno);
+		if (DEBUG_LEVEL >= 2)
+			perror("SYS_mprotect_pkey");
+	}
+	return sret;
+}
+
+int sys_pkey_alloc(unsigned long flags, unsigned long init_val)
+{
+	int ret = syscall(SYS_pkey_alloc, flags, init_val);
+	dprintf1("%s(flags=%lx, init_val=%lx) syscall ret: %d errno: %d\n", __func__,
+			flags, init_val, ret, errno);
+	return ret;
+}
+
+int alloc_pkey(void)
+{
+	int ret;
+	unsigned long init_val = 0x0;
+
+	dprintf1("alloc_pkey()::%d, pkru: 0x%x shadow: %x\n", __LINE__, __rdpkru(), shadow_pkru);
+	ret = sys_pkey_alloc(0, init_val);
+	/*
+	 * pkey_alloc() sets PKRU, so we need to reflect it in
+	 * shadow_pkru:
+	 */
+	if (ret) {
+		// clear both the bits:
+		shadow_pkru &= ~(0x3      << (ret * 2));
+		// move the new state in from init_val
+		// (remember, we cheated and init_val == pkru format)
+		shadow_pkru |=  (init_val << (ret * 2));
+	}
+	dprintf1("alloc_pkey()::%d, pkru: 0x%x shadow: %x\n", __LINE__, __rdpkru(), shadow_pkru);
+	dprintf1("alloc_pkey()::%d errno: %d\n", __LINE__, errno);
+	rdpkru(); // for shadow checking
+	return ret;
+}
+
+int sys_pkey_free(unsigned long pkey)
+{
+	int ret = syscall(SYS_pkey_free, pkey);
+	dprintf1("%s(pkey=%ld) syscall ret: %d\n", __func__, pkey, ret);
+	return ret;
+}
+
+/*
+ * I had a bug where pkey bits could be set by mprotect() but
+ * not cleared.  This ensures we get lots of random bit sets
+ * and clears on the vma and pte pkey bits.
+ */
+int alloc_random_pkey(void)
+{
+	int max_nr_pkey_allocs;
+	int ret;
+	int i;
+	int alloced_pkeys[NR_PKEYS];
+	int nr_alloced = 0;
+	int random_index;
+	memset(alloced_pkeys, 0, sizeof(alloced_pkeys));
+
+	/* allocate every possible key and make a note of which ones we got */
+	max_nr_pkey_allocs = NR_PKEYS;
+	max_nr_pkey_allocs = 1;
+	for (i = 0; i < max_nr_pkey_allocs; i++) {
+		int new_pkey = alloc_pkey();
+		if (new_pkey < 0)
+			break;
+		alloced_pkeys[nr_alloced++] = new_pkey;
+	}
+
+	pkey_assert(nr_alloced > 0);
+	/* select a random one out of the allocated ones */
+	random_index = rand() % nr_alloced;
+	ret = alloced_pkeys[random_index];
+	/* now zero it out so we don't free it next */
+	alloced_pkeys[random_index] = 0;
+
+	/* go through the allocated ones that we did not want and free them */
+	for (i = 0; i < nr_alloced; i++) {
+		int free_ret;
+		if (!alloced_pkeys[i])
+			continue;
+		free_ret = sys_pkey_free(alloced_pkeys[i]);
+		pkey_assert(!free_ret);
+	}
+	return ret;
+}
+
+int mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot, unsigned long pkey)
+{
+	int nr_iterations = 0; // random() % 100;
+	int ret;
+
+	while (0) {
+		int rpkey = alloc_random_pkey();
+		ret = sys_mprotect_pkey(ptr, size, orig_prot, pkey);
+		dprintf1("sys_mprotect_pkey(%p, %zx, prot=0x%lx, pkey=%ld) ret: %d\n",
+				ptr, size, orig_prot, pkey, ret);
+		if (nr_iterations-- < 0)
+			break;
+
+		sys_pkey_free(rpkey);
+	}
+	pkey_assert(pkey < NR_PKEYS);
+
+	ret = sys_mprotect_pkey(ptr, size, orig_prot, pkey);
+	dprintf1("mprotect_pkey(%p, %zx, prot=0x%lx, pkey=%ld) ret: %d\n", ptr, size, orig_prot, pkey, ret);
+	pkey_assert(!ret);
+	return ret;
+}
+
+struct pkey_malloc_record {
+	void *ptr;
+	long size;
+};
+struct pkey_malloc_record *pkey_malloc_records;
+long nr_pkey_malloc_records;
+void record_pkey_malloc(void *ptr, long size)
+{
+	long i;
+	struct pkey_malloc_record *rec = NULL;
+
+	for (i = 0; i < nr_pkey_malloc_records; i++) {
+		rec = &pkey_malloc_records[i];
+		// find a free record
+		if (rec)
+			break;
+	}
+	if (!rec) {
+		// every record is full
+		size_t old_nr_records = nr_pkey_malloc_records;
+		size_t new_nr_records = (nr_pkey_malloc_records * 2 + 1);
+		size_t new_size = new_nr_records * sizeof(struct pkey_malloc_record);
+		dprintf2("new_nr_records: %zd\n", new_nr_records);
+		dprintf2("new_size: %zd\n", new_size);
+		pkey_malloc_records = realloc(pkey_malloc_records, new_size);
+		pkey_assert(pkey_malloc_records != NULL);
+		rec = &pkey_malloc_records[nr_pkey_malloc_records];
+		// realloc() does not initalize memory, so zero it from
+		// the first new record all the way to the end.
+		for (i = 0; i < new_nr_records - old_nr_records; i++)
+			memset(rec + i, 0, sizeof(*rec));
+	}
+	dprintf3("filling malloc record[%d/%p]: {%p, %ld}\n",
+		(int)(rec - pkey_malloc_records), rec, ptr, size);
+	rec->ptr = ptr;
+	rec->size = size;
+	nr_pkey_malloc_records++;
+}
+
+void free_pkey_malloc(void *ptr)
+{
+	long i;
+	int ret;
+	dprintf3("%s(%p)\n", __func__, ptr);
+	for (i = 0; i < nr_pkey_malloc_records; i++) {
+		struct pkey_malloc_record *rec = &pkey_malloc_records[i];
+		dprintf4("looking for ptr %p at record[%ld/%p]: {%p, %ld}\n",
+				ptr, i, rec, rec->ptr, rec->size);
+		if ((ptr <  rec->ptr) ||
+		    (ptr >= rec->ptr + rec->size))
+			continue;
+
+		dprintf3("found ptr %p at record[%ld/%p]: {%p, %ld}\n",
+				ptr, i, rec, rec->ptr, rec->size);
+		nr_pkey_malloc_records--;
+		ret = munmap(rec->ptr, rec->size);
+		dprintf3("munmap ret: %d\n", ret);
+		pkey_assert(!ret);
+		dprintf3("clearing rec->ptr, rec: %p\n", rec);
+		rec->ptr = NULL;
+		dprintf3("done clearing rec->ptr, rec: %p\n", rec);
+		return;
+	}
+	pkey_assert(false);
+}
+
+
+void *malloc_pkey_with_mprotect(long size, int prot, u16 pkey)
+{
+	void *ptr;
+	int ret;
+
+	rdpkru();
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	pkey_assert(pkey < NR_PKEYS);
+	ptr = mmap(NULL, size, prot, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+	ret = mprotect_pkey((void *)ptr, PAGE_SIZE, prot, pkey);
+	pkey_assert(!ret);
+	record_pkey_malloc(ptr, size);
+	rdpkru();
+
+	dprintf1("%s() for pkey %d @ %p\n", __func__, pkey, ptr);
+	return ptr;
+}
+
+void *malloc_pkey_anon_huge(long size, int prot, u16 pkey)
+{
+	int ret;
+	void *ptr;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	// Guarantee we can fit at least one huge page in the resulting
+	// allocation by allocating space for 2:
+	size = ALIGN(size, HPAGE_SIZE * 2);
+	ptr = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+	record_pkey_malloc(ptr, size);
+	mprotect_pkey(ptr, size, prot, pkey);
+
+	dprintf1("unaligned ptr: %p\n", ptr);
+	ptr = ALIGN_PTR(ptr, HPAGE_SIZE);
+	dprintf1("  aligned ptr: %p\n", ptr);
+	ret = madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE);
+	dprintf1("MADV_HUGEPAGE ret: %d\n", ret);
+	ret = madvise(ptr, HPAGE_SIZE, MADV_WILLNEED);
+	dprintf1("MADV_WILLNEED ret: %d\n", ret);
+	memset(ptr, 0, HPAGE_SIZE);
+
+	dprintf1("mmap()'d thp for pkey %d @ %p\n", pkey, ptr);
+	return ptr;
+}
+
+int hugetlb_setup_ok = 0;
+#define GET_NR_HUGE_PAGES 10
+void setup_hugetlbfs(void)
+{
+	int err;
+	int fd;
+	int validated_nr_pages;
+	int i;
+	char buf[] = "123";
+
+	if (geteuid() != 0) {
+		fprintf(stderr, "WARNING: not run as root, can not do hugetlb test\n");
+		return;
+	}
+
+	cat_into_file(__stringify(GET_NR_HUGE_PAGES), "/proc/sys/vm/nr_hugepages");
+
+	/*
+	 * Now go make sure that we got the pages and that they
+	 * are 2M pages.  Someone might have made 1G the default.
+	 */
+	fd = open("/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages", O_RDONLY);
+	if (fd < 0) {
+		perror("opening sysfs 2M hugetlb config");
+		return;
+	}
+
+	// -1 to guarantee leaving the trailing \0
+	err = read(fd, buf, sizeof(buf)-1);
+	close(fd);
+	if (err <= 0) {
+		perror("reading sysfs 2M hugetlb config");
+		return;
+	}
+
+	if (atoi(buf) != GET_NR_HUGE_PAGES) {
+		fprintf(stderr, "could not confirm 2M pages, got: '%s' expected %d\n",
+			buf, GET_NR_HUGE_PAGES);
+		return;
+	}
+
+	hugetlb_setup_ok = 1;
+}
+
+void *malloc_pkey_hugetlb(long size, int prot, u16 pkey)
+{
+	void *ptr;
+	int flags = MAP_ANONYMOUS|MAP_PRIVATE|MAP_HUGETLB;
+
+	if (!hugetlb_setup_ok)
+		return PTR_ERR_ENOTSUP;
+
+	dprintf1("doing %s(%ld, %x, %x)\n", __func__, size, prot, pkey);
+	size = ALIGN(size, HPAGE_SIZE * 2);
+	pkey_assert(pkey < NR_PKEYS);
+	ptr = mmap(NULL, size, PROT_NONE, flags, -1, 0);
+	pkey_assert(ptr != (void *)-1);
+	mprotect_pkey(ptr, size, prot, pkey);
+
+	record_pkey_malloc(ptr, size);
+
+	dprintf1("mmap()'d hugetlbfs for pkey %d @ %p\n", pkey, ptr);
+	return ptr;
+}
+
+void *malloc_pkey_mmap_dax(long size, int prot, u16 pkey)
+{
+	void *ptr;
+	int fd;
+
+	dprintf1("doing %s(size=%ld, prot=0x%x, pkey=%d)\n", __func__, size, prot, pkey);
+	pkey_assert(pkey < NR_PKEYS);
+	fd = open("/dax/foo", O_RDWR);
+	pkey_assert(fd >= 0);
+
+	ptr = mmap(0, size, prot, MAP_SHARED, fd, 0);
+	pkey_assert(ptr != (void *)-1);
+
+	mprotect_pkey(ptr, size, prot, pkey);
+
+	record_pkey_malloc(ptr, size);
+
+	dprintf1("mmap()'d for pkey %d @ %p\n", pkey, ptr);
+	close(fd);
+	return ptr;
+}
+
+//void *malloc_pkey_with_mprotect(long size, int prot, u16 pkey)
+void *(*pkey_malloc[])(long size, int prot, u16 pkey) = {
+
+	malloc_pkey_with_mprotect,
+	malloc_pkey_anon_huge,
+	malloc_pkey_hugetlb,
+// can not do direct with the mprotect_pkey() API
+//	malloc_pkey_mmap_direct,
+//	malloc_pkey_mmap_dax,
+};
+
+void *malloc_pkey(long size, int prot, u16 pkey)
+{
+	void *ret;
+	static int malloc_type = 0;
+	int nr_malloc_types = ARRAY_SIZE(pkey_malloc);
+
+	pkey_assert(pkey < NR_PKEYS);
+
+	while (1) {
+		pkey_assert(malloc_type < nr_malloc_types);
+
+		ret = pkey_malloc[malloc_type](size, prot, pkey);
+		pkey_assert(ret != (void *)-1);
+
+		malloc_type++;
+		if (malloc_type >= nr_malloc_types)
+			malloc_type = (random()%nr_malloc_types);
+
+		/* try again if the malloc_type we tried is unsupported */
+		if (ret == PTR_ERR_ENOTSUP)
+			continue;
+
+		break;
+	}
+
+	dprintf3("%s(%ld, prot=%x, pkey=%x) returning: %p\n", __func__, size, prot, pkey, ret);
+	return ret;
+}
+
+int last_pkru_faults = 0;
+void expected_pk_fault(int pkey)
+{
+	dprintf2("%s(): last_pkru_faults: %d pkru_faults: %d\n",
+			__func__, last_pkru_faults, pkru_faults);
+	dprintf2("%s(%d): last_si_pkey: %d\n", __func__, pkey, last_si_pkey);
+	pkey_assert(last_pkru_faults + 1 == pkru_faults);
+	pkey_assert(last_si_pkey == pkey);
+	/*
+	 * The signal handler shold have cleared out PKRU to let the
+	 * test program continue.  We now have to restore it.
+	 */
+	if (__rdpkru() != 0) {
+		pkey_assert(0);
+	}
+	__wrpkru(shadow_pkru);
+	dprintf1("%s() set PKRU=%x to restore state after signal nuked it\n",
+			__func__, shadow_pkru);
+	last_pkru_faults = pkru_faults;
+	last_si_pkey = -1;
+}
+
+void do_not_expect_pk_fault(void)
+{
+	pkey_assert(last_pkru_faults == pkru_faults);
+}
+
+int test_fds[10] = { -1 };
+int nr_test_fds;
+void __save_test_fd(int fd)
+{
+	pkey_assert(fd >= 0);
+	pkey_assert(nr_test_fds < ARRAY_SIZE(test_fds));
+	test_fds[nr_test_fds] = fd;
+	nr_test_fds++;
+}
+
+int get_test_read_fd(void)
+{
+	int test_fd = open("/etc/passwd", O_RDONLY);
+	__save_test_fd(test_fd);
+	return test_fd;
+}
+
+void close_test_fds(void)
+{
+	int i;
+
+	for (i = 0; i < nr_test_fds; i++) {
+		if (test_fds[i] < 0)
+			continue;
+		close(test_fds[i]);
+		test_fds[i] = -1;
+	}
+	nr_test_fds = 0;
+}
+
+#define barrier() __asm__ __volatile__("": : :"memory")
+__attribute__((noinline)) int read_ptr(int *ptr)
+{
+	/*
+	 * Keep GCC from optimizing this away somehow
+	 */
+	barrier();
+	return *ptr;
+}
+
+void test_read_of_write_disabled_region(int *ptr, u16 pkey)
+{
+	int ptr_contents;
+	dprintf1("disabling write access to PKEY[1], doing read\n");
+	pkey_write_deny(pkey);
+	ptr_contents = read_ptr(ptr);
+	dprintf1("*ptr: %d\n", ptr_contents);
+	dprintf1("\n");
+}
+void test_read_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	int ptr_contents;
+	rdpkru();
+	dprintf1("disabling access to PKEY[%02d], doing read @ %p\n", pkey, ptr);
+	pkey_access_deny(pkey);
+	ptr_contents = read_ptr(ptr);
+	dprintf1("*ptr: %d\n", ptr_contents);
+	expected_pk_fault(pkey);
+}
+void test_write_of_write_disabled_region(int *ptr, u16 pkey)
+{
+	dprintf1("disabling write access to PKEY[%02d], doing write\n", pkey);
+	pkey_write_deny(pkey);
+	*ptr = __LINE__;
+	expected_pk_fault(pkey);
+}
+void test_write_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	dprintf1("disabling access to PKEY[%02d], doing write\n", pkey);
+	pkey_access_deny(pkey);
+	*ptr = __LINE__;
+	expected_pk_fault(pkey);
+}
+void test_kernel_write_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	int ret;
+	int test_fd = get_test_read_fd();
+
+	dprintf1("disabling access to PKEY[%02d], having kernel read() to buffer\n", pkey);
+	pkey_access_deny(pkey);
+	ret = read(test_fd, ptr, 1);
+	dprintf1("read ret: %d\n", ret);
+	pkey_assert(ret);
+}
+void test_kernel_write_of_write_disabled_region(int *ptr, u16 pkey)
+{
+	int ret;
+	int test_fd = get_test_read_fd();
+
+	pkey_write_deny(pkey);
+	ret = read(test_fd, ptr, 100);
+	dprintf1("read ret: %d\n", ret);
+	if (ret < 0 && (DEBUG_LEVEL > 0))
+		perror("read");
+	pkey_assert(ret);
+}
+
+void test_kernel_gup_of_access_disabled_region(int *ptr, u16 pkey)
+{
+	int pipe_ret, vmsplice_ret;
+	struct iovec iov;
+	int pipe_fds[2];
+
+	pipe_ret = pipe(pipe_fds);
+
+	pkey_assert(pipe_ret == 0);
+	dprintf1("disabling access to PKEY[%02d], having kernel vmsplice from buffer\n", pkey);
+	pkey_access_deny(pkey);
+	iov.iov_base = ptr;
+	iov.iov_len = PAGE_SIZE;
+	vmsplice_ret = vmsplice(pipe_fds[1], &iov, 1, SPLICE_F_GIFT);
+	dprintf1("vmsplice() ret: %d\n", vmsplice_ret);
+	pkey_assert(vmsplice_ret == -1);
+
+	close(pipe_fds[0]);
+	close(pipe_fds[1]);
+}
+
+void test_kernel_gup_write_to_write_disabled_region(int *ptr, u16 pkey)
+{
+	int ignored = 0xdada;
+	int futex_ret;
+	int some_int = __LINE__;
+
+	dprintf1("disabling write to PKEY[%02d], doing futex gunk in buffer\n", pkey);
+	*ptr = some_int;
+	pkey_write_deny(pkey);
+	futex_ret = syscall(SYS_futex, ptr, FUTEX_WAIT, some_int-1, NULL, &ignored, ignored);
+	if (DEBUG_LEVEL > 0)
+		perror("futex");
+	dprintf1("futex() ret: %d\n", futex_ret);
+	//pkey_assert(vmsplice_ret == -1);
+}
+
+/* Assumes that all pkeys other than 'pkey' are unallocated */
+void test_pkey_syscalls_on_non_allocated_pkey(int *ptr, u16 pkey)
+{
+	int err;
+	int i;
+
+	/* Note: 0 is the default pkey, so don't mess with it */
+	for (i = 1; i < NR_PKEYS; i++) {
+		if (pkey == i)
+			continue;
+
+		dprintf1("trying get/set/free to non-allocated pkey: %2d\n", i);
+		err = sys_pkey_free(i);
+		pkey_assert(err);
+
+		err = sys_pkey_get(i, 0);
+		pkey_assert(err < 0);
+
+		err = sys_pkey_free(i);
+		pkey_assert(err);
+
+		err = sys_mprotect_pkey(ptr, PAGE_SIZE, PROT_READ, i);
+		pkey_assert(err);
+	}
+}
+
+/* Assumes that all pkeys other than 'pkey' are unallocated */
+void test_pkey_syscalls_bad_args(int *ptr, u16 pkey)
+{
+	int err;
+	int bad_flag = (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE) + 1;
+	int bad_pkey = NR_PKEYS+99;
+
+	err = sys_pkey_get(bad_pkey, bad_flag);
+	pkey_assert(err < 0);
+
+	/* pass a known-invalid pkey in: */
+	err = sys_mprotect_pkey(ptr, PAGE_SIZE, PROT_READ, bad_pkey);
+	pkey_assert(err);
+}
+
+/* Assumes that all pkeys other than 'pkey' are unallocated */
+void test_pkey_alloc_exhaust(int *ptr, u16 pkey)
+{
+	unsigned long flags;
+	unsigned long init_val;
+	int err;
+	int allocated_pkeys[NR_PKEYS] = {0};
+	int nr_allocated_pkeys = 0;
+	int i;
+
+	for (i = 0; i < NR_PKEYS*2; i++) {
+		int new_pkey;
+		dprintf1("%s() alloc loop: %d\n", __func__, i);
+		flags = 0;
+		init_val = 0;
+		pkey_assert(!flags);
+		pkey_assert(!init_val);
+		new_pkey = sys_pkey_alloc(flags, init_val);
+		dprintf2("%s() errno: %d ENOSPC: %d\n", __func__, errno, ENOSPC);
+		if ((new_pkey == -1) && (errno == ENOSPC)) {
+			dprintf2("%s() failed to allocate pkey after %d tries with ENOSPC\n",
+				__func__, nr_allocated_pkeys);
+			break;
+		}
+		pkey_assert(nr_allocated_pkeys < NR_PKEYS);
+		allocated_pkeys[nr_allocated_pkeys++] = new_pkey;
+	}
+
+	dprintf1("%s()::%d\n", __func__, __LINE__);
+
+	/*
+	 * ensure it did not reach the end of the loop without
+	 * failure:
+	 * */
+	pkey_assert(i < NR_PKEYS*2);
+
+	/*
+	 * There are 16 pkeys suported in hardware.  One is taken
+	 * up for the default (0) and another can be taken up by
+	 * an execute-only mapping.  Ensure that we can allocate
+	 * at least 14 (16-2).
+	 */
+	pkey_assert(i >= NR_PKEYS-2);
+
+	for (i = 0; i < nr_allocated_pkeys; i++) {
+		err = sys_pkey_free(allocated_pkeys[i]);
+		pkey_assert(!err);
+	}
+}
+
+void test_ptrace_of_child(int *ptr, u16 pkey)
+{
+	__attribute__((__unused__)) int peek_result;
+	pid_t child_pid;
+	void *ignored = 0;
+	long ret;
+	int status;
+	/*
+	 * This is the "control" for our little expermient.  Make sure
+	 * we can always access it when ptracing.
+	 */
+	int *plain_ptr_unaligned = malloc(HPAGE_SIZE);
+	int *plain_ptr = ALIGN_PTR(plain_ptr_unaligned, PAGE_SIZE);
+
+	/*
+	 * Fork a child which is an exact copy of this process, of course.
+	 * That means we can do all of our tests via ptrace() and then plain
+	 * memory access and ensure they work differently.
+	 */
+	child_pid = fork_lazy_child();
+	dprintf1("[%d] child pid: %d\n", getpid(), child_pid);
+
+	ret = ptrace(PTRACE_ATTACH, child_pid, ignored, ignored);
+	if (ret)
+		perror("attach");
+	dprintf1("[%d] attach ret: %ld %d\n", getpid(), ret, __LINE__);
+	pkey_assert(ret != -1);
+	ret = waitpid(child_pid, &status, WUNTRACED);
+	if ((ret != child_pid) || !(WIFSTOPPED(status)) ) {
+		fprintf(stderr, "weird waitpid result %ld stat %x\n", ret, status);
+		pkey_assert(0);
+	}
+	dprintf2("waitpid ret: %ld\n", ret);
+	dprintf2("waitpid status: %d\n", status);
+
+	pkey_access_deny(pkey);
+	pkey_write_deny(pkey);
+
+	/* Write access, untested for now:
+	ret = ptrace(PTRACE_POKEDATA, child_pid, peek_at, data);
+	pkey_assert(ret != -1);
+	dprintf1("poke at %p: %ld\n", peek_at, ret);
+	*/
+
+	/*
+	 * Try to access the pkey-protected "ptr" via ptrace:
+	 */
+	ret = ptrace(PTRACE_PEEKDATA, child_pid, ptr, ignored);
+	/* expect it to work, without an error: */
+	pkey_assert(ret != -1);
+	/* Now access from the current task, and expect an exception: */
+	peek_result = read_ptr(ptr);
+	expected_pk_fault(pkey);
+
+	/*
+	 * Try to access the NON-pkey-protected "ptr" via ptrace:
+	 */
+	ret = ptrace(PTRACE_PEEKDATA, child_pid, plain_ptr, ignored);
+	/* expect it to work, without an error: */
+	pkey_assert(ret != -1);
+	/* Now access from the current task, and expect NO exception: */
+	peek_result = read_ptr(ptr);
+	do_not_expect_pk_fault();
+
+	ret = ptrace(PTRACE_DETACH, child_pid, ignored, 0);
+	pkey_assert(ret != -1);
+
+	ret = kill(child_pid, SIGKILL);
+	pkey_assert(ret != -1);
+
+	free(plain_ptr_unaligned);
+}
+
+void test_executing_on_unreadable_memory(int *ptr, u16 pkey)
+{
+	void *p1;
+	int ptr_contents;
+	int ret;
+
+	p1 = ALIGN_PTR(&lots_o_noops, PAGE_SIZE);
+
+	madvise(p1, PAGE_SIZE, MADV_DONTNEED);
+	lots_o_noops();
+	ptr_contents = read_ptr(p1);
+	dprintf2("ptr (%p) contents@%d: %x\n", p1, __LINE__, ptr_contents);
+
+	ret = mprotect_pkey(p1, PAGE_SIZE, PROT_EXEC, (u64)pkey);
+	pkey_assert(!ret);
+	pkey_access_deny(pkey);
+
+	dprintf2("pkru: %x\n", rdpkru());
+
+	/*
+	 * Make sure this is an *instruction* fault
+	 */
+	madvise(p1, PAGE_SIZE, MADV_DONTNEED);
+	lots_o_noops();
+	do_not_expect_pk_fault();
+	ptr_contents = read_ptr(p1);
+	dprintf2("ptr (%p) contents@%d: %x\n", p1, __LINE__, ptr_contents);
+	expected_pk_fault(pkey);
+}
+
+void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
+{
+	int size = PAGE_SIZE;
+	int sret;
+
+	if (cpu_has_pku()) {
+		dprintf1("SKIP: %s: no CPU support\n", __func__);
+		return;
+	}
+
+	sret = syscall(SYS_mprotect_key, ptr, size, PROT_READ, pkey);
+	pkey_assert(sret < 0);
+}
+
+void (*pkey_tests[])(int *ptr, u16 pkey) = {
+	test_read_of_write_disabled_region,
+	test_read_of_access_disabled_region,
+	test_write_of_write_disabled_region,
+	test_write_of_access_disabled_region,
+	test_kernel_write_of_access_disabled_region,
+	test_kernel_write_of_write_disabled_region,
+	test_kernel_gup_of_access_disabled_region,
+	test_kernel_gup_write_to_write_disabled_region,
+	test_executing_on_unreadable_memory,
+	test_ptrace_of_child,
+	test_pkey_syscalls_on_non_allocated_pkey,
+	test_pkey_syscalls_bad_args,
+	test_pkey_alloc_exhaust,
+};
+
+void run_tests_once(void)
+{
+	int *ptr;
+	int prot = PROT_READ|PROT_WRITE;
+
+	for (test_nr = 0; test_nr < ARRAY_SIZE(pkey_tests); test_nr++) {
+		int pkey;
+		int orig_pkru_faults = pkru_faults;
+
+		dprintf1("======================\n");
+		dprintf1("test %d starting...\n", test_nr);
+
+		tracing_on();
+		pkey = alloc_random_pkey();
+		dprintf1("test %d starting with pkey: %d\n", test_nr, pkey);
+		ptr = malloc_pkey(PAGE_SIZE, prot, pkey);
+		//dumpit("/proc/self/maps");
+		pkey_tests[test_nr](ptr, pkey);
+		//sleep(999);
+		dprintf1("freeing test memory: %p\n", ptr);
+		free_pkey_malloc(ptr);
+		sys_pkey_free(pkey);
+
+		dprintf1("pkru_faults: %d\n", pkru_faults);
+		dprintf1("orig_pkru_faults: %d\n", orig_pkru_faults);
+
+		tracing_off();
+		close_test_fds();
+
+		printf("test %2d PASSED (itertation %d)\n", test_nr, iteration_nr);
+		dprintf1("======================\n\n");
+	}
+	iteration_nr++;
+}
+
+int main()
+{
+	int nr_iterations = 22;
+	setup_handlers();
+
+	printf("has pku: %d\n", cpu_has_pku());
+
+	if (!cpu_has_pku()) {
+		int size = PAGE_SIZE;
+		int *ptr;
+
+		printf("running PKEY tests for unsupported CPU/OS\n");
+
+		ptr  = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+		assert(ptr != (void *)-1);
+		test_mprotect_pkey_on_unsupported_cpu(ptr, 1);
+		exit(0);
+	}
+
+	printf("pkru: %x\n", rdpkru());
+	pkey_assert(!rdpkru());
+	setup_hugetlbfs();
+
+	while (nr_iterations-- > 0)
+		run_tests_once();
+
+	printf("done (all tests OK)\n");
+	return 0;
+}
+
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/9] generic syscalls: wire up memory protection keys syscalls
  2016-06-07 20:47 ` [PATCH 7/9] generic syscalls: wire up memory protection keys syscalls Dave Hansen
@ 2016-06-07 21:25   ` Arnd Bergmann
  0 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2016-06-07 21:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, linux-api, linux-arch, linux-mm, torvalds,
	akpm, dave.hansen

On Tuesday, June 7, 2016 1:47:25 PM CEST Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> These new syscalls are implemented as generic code, so enable
> them for architectures like arm64 which use the generic syscall
> table.
> 
> According to Arnd:
> 
>         Even if the support is x86 specific for the forseeable
>         future, it may be good to reserve the number just in
>         case.  The other architecture specific syscall lists are
>         usually left to the individual arch maintainers, most a
>         lot of the newer architectures share this table.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Arnd Bergmann <arnd@arndb.de>

Acked-by: Arnd Bergmann <arnd@arndb.de>

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/9] [v2] System Calls for Memory Protection Keys
  2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
                   ` (8 preceding siblings ...)
  2016-06-07 20:47 ` [PATCH 9/9] x86, pkeys: add self-tests Dave Hansen
@ 2016-06-08  9:23 ` Michael Kerrisk (man-pages)
  2016-06-08 17:35   ` Dave Hansen
  9 siblings, 1 reply; 41+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-06-08  9:23 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: mtk.manpages, x86, linux-api, linux-arch, linux-mm, torvalds, akpm, arnd

On 06/07/2016 10:47 PM, Dave Hansen wrote:
> Are there any concerns with merging these into the x86 tree so
> that they go upstream for 4.8?  

I believe we still don't have up-to-date man pages, right?
Best from my POV to send them out in parallel with the 
implementation.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/9] [v2] System Calls for Memory Protection Keys
  2016-06-08  9:23 ` [PATCH 0/9] [v2] System Calls for Memory Protection Keys Michael Kerrisk (man-pages)
@ 2016-06-08 17:35   ` Dave Hansen
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-08 17:35 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages), linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm, arnd

On 06/08/2016 02:23 AM, Michael Kerrisk (man-pages) wrote:
> On 06/07/2016 10:47 PM, Dave Hansen wrote:
>> > Are there any concerns with merging these into the x86 tree so
>> > that they go upstream for 4.8?  
> I believe we still don't have up-to-date man pages, right?
> Best from my POV to send them out in parallel with the 
> implementation.

I performed all the fixups and sent two dry-runs of the emails last week
instead of sending them out for real.  I will send them immediately.
Sorry for the delay.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-09  8:37           ` Ingo Molnar
  2016-07-11  4:25             ` Andy Lutomirski
  2016-07-18 18:02             ` Dave Hansen
@ 2016-07-18 20:12             ` Dave Hansen
  2 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-07-18 20:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, linux-kernel, x86, linux-api, linux-arch, linux-mm,
	torvalds, akpm, dave.hansen, arnd, hughd, viro, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra

On 07/09/2016 01:37 AM, Ingo Molnar wrote:
>  - There are 16 pkey indices on x86 currently. We already use index 15 for the 
>    true PROT_EXEC implementation. Let's set aside another pkey index for the 
>    kernel's potential future use (index 14), and clear it explicitly in the 
>    FPU context on every context switch if CONFIG_X86_DEBUG_FPU is enabled to make 
>    sure it remains unallocated.

After mulling over this for a week: I really don't think we want to
pre-reserve any keys.

The one bit of consistent feedback I've heard from the folks that are
going to use this is that they want more keys.  The current code does
not reserve any keys (except 0 of course).

Virtually every feature that we've talked about adding on top of this
_requires_ having this allocation mechanism and kernel knowledge of
which keys are in use.  Even having kernel-reserved keys requires
telling userspace about allocation, whether we use pkey_mprotect() for
it or not.

I'd like to resubmit this set with pkey_get/set() removed, but with
pkey_alloc/free() still in place.

Why are folks so sensitive to the number of keys?  There are a few modes
this will get used in when folks want more pkeys that hardware provides.
 Both of them are harmed in a meaningful way if we take some of their
keys away.  Here's how they will use pkeys:
1. As an accelerator for existing mprotect()-provided guarantees.
   Let's say you want 100 pieces of data, but you only get 15 pkeys in
   hardware.  The app picks the 15 most-frequently accessed pieces, and
   uses pkeys to control access to them, and uses mprotect() for the
   other 85.  Each pkey you remove means a smaller "working set"
   covered by pkeys, more mprotect()s and lower performance.
2. To provide stronger access guarantees.  Let's say you have 100
   pieces of data.  To access a given piece of data, you hash its key
   in to a pkey: hash(92)%NR_PKEYS.  Accessing a random bit of data,
   you have a 1/NR_PKEYS chance of a hash collision and access to data
   you should not have access to.  Fewer keys means more data
   corruption.  Losing 2/16 keys means a 15% greater chance of a
   collision on a given access.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-09  8:37           ` Ingo Molnar
  2016-07-11  4:25             ` Andy Lutomirski
@ 2016-07-18 18:02             ` Dave Hansen
  2016-07-18 20:12             ` Dave Hansen
  2 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-07-18 18:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, linux-kernel, x86, linux-api, linux-arch, linux-mm,
	torvalds, akpm, dave.hansen, arnd, hughd, viro, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra

On 07/09/2016 01:37 AM, Ingo Molnar wrote:
>    I.e. this pattern:
> 
>      ret = pkey_mprotect(NULL, PAGE_SIZE, real_prot, pkey);
> 
>    ... would validate the pkey and we'd return -EOPNOTSUPP for pkey that is not 
>    available? This would allow maximum future flexibility as it would not define 
>    kernel allocated pkeys as a 'range'.

Isn't this  multiplexing an otherwise straightforward system call?  In
addition to providing pkey assignment to memory, it would also being
used to pass pkey allocation information independently from any use for
memory assignment.

The complexity of the ABI comes from its behavior, not from the raw
number of system calls that are needed to implement it.  IOW, this makes
the ABI *more* complicated.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-18  4:43                             ` Andy Lutomirski
@ 2016-07-18  9:56                               ` Ingo Molnar
  0 siblings, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2016-07-18  9:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Thomas Gleixner, Dave Hansen, Al Viro, X86 ML,
	Hugh Dickins, Andrew Morton, Linux API, Mel Gorman,
	Linus Torvalds, linux-arch, linux-mm, Arnd Bergmann,
	Peter Zijlstra, linux-kernel, H. Peter Anvin


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Thu, Jul 14, 2016 at 1:07 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >> On Wed, Jul 13, 2016 at 12:56 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >> >
> >> > * Andy Lutomirski <luto@amacapital.net> wrote:
> >> >
> >> >> > If we push a PKRU value into a thread between the rdpkru() and wrpkru(), we'll
> >> >> > lose the content of that "push".  I'm not sure there's any way to guarantee
> >> >> > this with a user-controlled register.
> >> >>
> >> >> We could try to insist that user code uses some vsyscall helper that tracks
> >> >> which bits are as-yet-unassigned.  That's quite messy, though.
> >> >
> >> > Actually, if we turned the vDSO into something more like a minimal user-space
> >> > library with the ability to run at process startup as well to prepare stuff
> >> > then it's painful to get right only *once*, and there will be tons of other
> >> > areas where a proper per thread data storage on the user-space side would be
> >> > immensely useful!
> >>
> >> Doing this could be tricky: how exactly is the vDSO supposed to find per-thread
> >> data without breaking existing glibc?
> >
> > So I think the way this could be done is by allocating it itself. The vDSO vma
> > itself is 'external' to glibc as well to begin with - this would be a small
> > extension to that concept.
> 
> But how does the vdso code find it?  FS and GS are both spoken for by existing 
> userspace.

Minimally relinking itself on a per mm basis?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-14  8:07                           ` Ingo Molnar
@ 2016-07-18  4:43                             ` Andy Lutomirski
  2016-07-18  9:56                               ` Ingo Molnar
  0 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-07-18  4:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, Thomas Gleixner, Dave Hansen, Al Viro, X86 ML,
	Hugh Dickins, Andrew Morton, Linux API, Mel Gorman,
	Linus Torvalds, linux-arch, linux-mm, Arnd Bergmann,
	Peter Zijlstra, linux-kernel, H. Peter Anvin

On Thu, Jul 14, 2016 at 1:07 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Wed, Jul 13, 2016 at 12:56 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> >
>> > * Andy Lutomirski <luto@amacapital.net> wrote:
>> >
>> >> > If we push a PKRU value into a thread between the rdpkru() and wrpkru(), we'll
>> >> > lose the content of that "push".  I'm not sure there's any way to guarantee
>> >> > this with a user-controlled register.
>> >>
>> >> We could try to insist that user code uses some vsyscall helper that tracks
>> >> which bits are as-yet-unassigned.  That's quite messy, though.
>> >
>> > Actually, if we turned the vDSO into something more like a minimal user-space
>> > library with the ability to run at process startup as well to prepare stuff
>> > then it's painful to get right only *once*, and there will be tons of other
>> > areas where a proper per thread data storage on the user-space side would be
>> > immensely useful!
>>
>> Doing this could be tricky: how exactly is the vDSO supposed to find per-thread
>> data without breaking existing glibc?
>
> So I think the way this could be done is by allocating it itself. The vDSO vma
> itself is 'external' to glibc as well to begin with - this would be a small
> extension to that concept.

But how does the vdso code find it?  FS and GS are both spoken for by
existing userspace.

--Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-13 18:43                         ` Andy Lutomirski
@ 2016-07-14  8:07                           ` Ingo Molnar
  2016-07-18  4:43                             ` Andy Lutomirski
  0 siblings, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2016-07-14  8:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Thomas Gleixner, Dave Hansen, Al Viro, X86 ML,
	Hugh Dickins, Andrew Morton, Linux API, Mel Gorman,
	Linus Torvalds, linux-arch, linux-mm, Arnd Bergmann,
	Peter Zijlstra, linux-kernel, H. Peter Anvin


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Wed, Jul 13, 2016 at 12:56 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >> > If we push a PKRU value into a thread between the rdpkru() and wrpkru(), we'll
> >> > lose the content of that "push".  I'm not sure there's any way to guarantee
> >> > this with a user-controlled register.
> >>
> >> We could try to insist that user code uses some vsyscall helper that tracks
> >> which bits are as-yet-unassigned.  That's quite messy, though.
> >
> > Actually, if we turned the vDSO into something more like a minimal user-space 
> > library with the ability to run at process startup as well to prepare stuff 
> > then it's painful to get right only *once*, and there will be tons of other 
> > areas where a proper per thread data storage on the user-space side would be 
> > immensely useful!
> 
> Doing this could be tricky: how exactly is the vDSO supposed to find per-thread 
> data without breaking existing glibc?

So I think the way this could be done is by allocating it itself. The vDSO vma 
itself is 'external' to glibc as well to begin with - this would be a small 
extension to that concept.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-13  7:56                       ` Ingo Molnar
@ 2016-07-13 18:43                         ` Andy Lutomirski
  2016-07-14  8:07                           ` Ingo Molnar
  0 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-07-13 18:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, Thomas Gleixner, Dave Hansen, Al Viro, X86 ML,
	Hugh Dickins, Andrew Morton, Linux API, Mel Gorman,
	Linus Torvalds, linux-arch, linux-mm, Arnd Bergmann,
	Peter Zijlstra, linux-kernel, H. Peter Anvin

On Wed, Jul 13, 2016 at 12:56 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> > If we push a PKRU value into a thread between the rdpkru() and wrpkru(), we'll
>> > lose the content of that "push".  I'm not sure there's any way to guarantee
>> > this with a user-controlled register.
>>
>> We could try to insist that user code uses some vsyscall helper that tracks
>> which bits are as-yet-unassigned.  That's quite messy, though.
>
> Actually, if we turned the vDSO into something more like a minimal user-space
> library with the ability to run at process startup as well to prepare stuff then
> it's painful to get right only *once*, and there will be tons of other areas where
> a proper per thread data storage on the user-space side would be immensely useful!

Doing this could be tricky: how exactly is the vDSO supposed to find
per-thread data without breaking existing glibc?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-12 16:32                     ` Andy Lutomirski
  2016-07-12 17:12                       ` Dave Hansen
@ 2016-07-13  7:56                       ` Ingo Molnar
  2016-07-13 18:43                         ` Andy Lutomirski
  1 sibling, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2016-07-13  7:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Thomas Gleixner, Dave Hansen, Al Viro, X86 ML,
	Hugh Dickins, Andrew Morton, Linux API, Mel Gorman,
	Linus Torvalds, linux-arch, linux-mm, Arnd Bergmann,
	Peter Zijlstra, linux-kernel, H. Peter Anvin


* Andy Lutomirski <luto@amacapital.net> wrote:

> > If we push a PKRU value into a thread between the rdpkru() and wrpkru(), we'll 
> > lose the content of that "push".  I'm not sure there's any way to guarantee 
> > this with a user-controlled register.
> 
> We could try to insist that user code uses some vsyscall helper that tracks 
> which bits are as-yet-unassigned.  That's quite messy, though.

Actually, if we turned the vDSO into something more like a minimal user-space 
library with the ability to run at process startup as well to prepare stuff then 
it's painful to get right only *once*, and there will be tons of other areas where 
a proper per thread data storage on the user-space side would be immensely useful!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-12 17:12                       ` Dave Hansen
@ 2016-07-12 22:55                         ` Andy Lutomirski
  0 siblings, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-07-12 22:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Dave Hansen, Al Viro, X86 ML, Hugh Dickins,
	Andrew Morton, Linux API, Ingo Molnar, Mel Gorman,
	Linus Torvalds, linux-arch, linux-mm, Arnd Bergmann,
	Peter Zijlstra, linux-kernel, H. Peter Anvin

On Tue, Jul 12, 2016 at 10:12 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 07/12/2016 09:32 AM, Andy Lutomirski wrote:
>> I think it's more or less impossible to get sensible behavior passing
>> pkey != 0 data to legacy functions.  If you call:
>>
>> void frob(struct foo *p);
>>
>> If frob in turn passes p to a thread, what PKRU is it supposed to use?
>
> The thread inheritance of PKRU can be nice.  It actually gives things a
> good chance of working if you can control PKRU before clone().  I'd
> describe the semantics like this:
>
>         PKRU values are inherited at the time of a clone() system
>         call.  Threads unaware of protection keys may work on
>         protection-key-protected data as long as PKRU is set up in
>         advance of the clone() and never needs to be changed inside the
>         thread.
>
>         If a thread is created before PKRU is set appropriately, the
>         thread may not be able to act on protection-key-protected data.

Given the apparent need for seccomp's TSYNC, I'm a bit nervous that
this will be restrictive to a problematic degree.

>
> Otherwise, the semantics are simpler, but they basically give threads no
> chance of ever working:
>
>         Threads unaware of protection keys and which can not manage
>         PKRU may not operate on data where a non-zero key has been
>         passed to pkey_mprotect().
>
> It isn't clear to me that one of these is substantially better than the
> other.  It's fairly easy in either case for an app that cares to get the
> behavior of the other.
>
> But, one is clearly easier to implement in the kernel. :)
>
>>>> So how is user code supposed lock down all of its threads?
>>>>
>>>> seccomp has TSYNC for this, but I don't think that PKRU allows
>>>> something like that.
>>>
>>> I'm not sure this is possible for PKRU.  Think of a simple PKRU
>>> manipulation in userspace:
>>>
>>>         pkru = rdpkru();
>>>         pkru |= PKEY_DENY_ACCESS<<key*2;
>>>         wrpkru(pkru);
>>>
>>> If we push a PKRU value into a thread between the rdpkru() and wrpkru(),
>>> we'll lose the content of that "push".  I'm not sure there's any way to
>>> guarantee this with a user-controlled register.
>>
>> We could try to insist that user code uses some vsyscall helper that
>> tracks which bits are as-yet-unassigned.  That's quite messy, though.
>
> Yeah, doable, but not without some new data going out to userspace, plus
> the vsyscall code itself.
>
>> We could also arbitrarily partition the key space into
>> initially-wide-open, initially-read-only, and initially-no-access and
>> let pkey_alloc say which kind it wants.
>
> The point is still that wrpkru destroyed the 'push' operation.  You
> always end up with a PKRU that (at least temporarily) ignored the 'push'.
>

Not with my partitioning proposal.  We'd never asynchronously modify
another thread's state -- we'd start start with a mask that gives us a
good chance of having the initial state always be useful.  To be
completely precise, the initial state would be something like:

0 = all access, 1 (PROT_EXEC) = deny read and write, 2-11: deny read
and write, 12-21: deny write, 22-31: all access

Then pkru_alloc would take a parameter giving the requested initial
state, and it would only work if a key with that initial state is
available.

If we went with the vdso approach, the API could look like:

pkru_state_t prev = pkru_push(mask, value);

...

pkru_pop(prev); // or pkru_pop(mask, prev)?

This doesn't fundamentally require the vdso, except that implementing
bitwise operations on PKRU can't be done atomically with RDPKRU /
WRPKRU.  Grr.  This also falls apart pretty badly when sigreturn
happens, so I don't think I like this approach.

--Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-12 16:32                     ` Andy Lutomirski
@ 2016-07-12 17:12                       ` Dave Hansen
  2016-07-12 22:55                         ` Andy Lutomirski
  2016-07-13  7:56                       ` Ingo Molnar
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-07-12 17:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Dave Hansen, Al Viro, X86 ML, Hugh Dickins,
	Andrew Morton, Linux API, Ingo Molnar, Mel Gorman,
	Linus Torvalds, linux-arch, linux-mm, Arnd Bergmann,
	Peter Zijlstra, linux-kernel, H. Peter Anvin

On 07/12/2016 09:32 AM, Andy Lutomirski wrote:
> I think it's more or less impossible to get sensible behavior passing
> pkey != 0 data to legacy functions.  If you call:
> 
> void frob(struct foo *p);
> 
> If frob in turn passes p to a thread, what PKRU is it supposed to use?

The thread inheritance of PKRU can be nice.  It actually gives things a
good chance of working if you can control PKRU before clone().  I'd
describe the semantics like this:

	PKRU values are inherited at the time of a clone() system
	call.  Threads unaware of protection keys may work on
	protection-key-protected data as long as PKRU is set up in
	advance of the clone() and never needs to be changed inside the
	thread.

	If a thread is created before PKRU is set appropriately, the
	thread may not be able to act on protection-key-protected data.

Otherwise, the semantics are simpler, but they basically give threads no
chance of ever working:

	Threads unaware of protection keys and which can not manage
	PKRU may not operate on data where a non-zero key has been
	passed to pkey_mprotect().

It isn't clear to me that one of these is substantially better than the
other.  It's fairly easy in either case for an app that cares to get the
behavior of the other.

But, one is clearly easier to implement in the kernel. :)

>>> So how is user code supposed lock down all of its threads?
>>>
>>> seccomp has TSYNC for this, but I don't think that PKRU allows
>>> something like that.
>>
>> I'm not sure this is possible for PKRU.  Think of a simple PKRU
>> manipulation in userspace:
>>
>>         pkru = rdpkru();
>>         pkru |= PKEY_DENY_ACCESS<<key*2;
>>         wrpkru(pkru);
>>
>> If we push a PKRU value into a thread between the rdpkru() and wrpkru(),
>> we'll lose the content of that "push".  I'm not sure there's any way to
>> guarantee this with a user-controlled register.
> 
> We could try to insist that user code uses some vsyscall helper that
> tracks which bits are as-yet-unassigned.  That's quite messy, though.

Yeah, doable, but not without some new data going out to userspace, plus
the vsyscall code itself.

> We could also arbitrarily partition the key space into
> initially-wide-open, initially-read-only, and initially-no-access and
> let pkey_alloc say which kind it wants.

The point is still that wrpkru destroyed the 'push' operation.  You
always end up with a PKRU that (at least temporarily) ignored the 'push'.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-11 15:48                   ` Dave Hansen
@ 2016-07-12 16:32                     ` Andy Lutomirski
  2016-07-12 17:12                       ` Dave Hansen
  2016-07-13  7:56                       ` Ingo Molnar
  0 siblings, 2 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-07-12 16:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Dave Hansen, Al Viro, X86 ML, Hugh Dickins,
	Andrew Morton, Linux API, Ingo Molnar, Mel Gorman,
	Linus Torvalds, linux-arch, linux-mm, Arnd Bergmann,
	Peter Zijlstra, linux-kernel, H. Peter Anvin

On Jul 11, 2016 8:48 AM, "Dave Hansen" <dave.hansen@intel.com> wrote:
>
> On 07/11/2016 07:45 AM, Andy Lutomirski wrote:
> > On Mon, Jul 11, 2016 at 7:34 AM, Dave Hansen <dave@sr71.net> wrote:
> >> Should we instead just recommend to userspace that they lock down access
> >> to keys by default in all threads as a best practice?
> >
> > Is that really better than doing it in-kernel?  My concern is that
> > we'll find library code that creates a thread, and that code could run
> > before the pkey-aware part of the program even starts running.
>
> Yeah, so let's assume we have some pkey-unaware thread.  The upside of a
> scheme where the kernel preemptively (and transparently to the thread)
> locks down PKRU is that the thread can't go corrupting any non-zero-pkey
> structures that came from other threads.
>
> But, the downside is that the thread can not access any non-zero-pkey
> structures without taking some kind of action with PKRU.  That obviously
> won't happen since the thread is pkeys-unaware to begin with.  Would
> that break these libraries unless everything using pkeys knows to only
> share pkey=0 data with those threads?
>

Yes, but at least for the cases I can think of, that's probably a good
thing.  OTOH, I can see cases where you want everyone to be able to
read but only specific code paths to be able to write.

I think it's more or less impossible to get sensible behavior passing
pkey != 0 data to legacy functions.  If you call:

void frob(struct foo *p);

If frob in turn passes p to a thread, what PKRU is it supposed to use?

> > So how is user code supposed lock down all of its threads?
> >
> > seccomp has TSYNC for this, but I don't think that PKRU allows
> > something like that.
>
> I'm not sure this is possible for PKRU.  Think of a simple PKRU
> manipulation in userspace:
>
>         pkru = rdpkru();
>         pkru |= PKEY_DENY_ACCESS<<key*2;
>         wrpkru(pkru);
>
> If we push a PKRU value into a thread between the rdpkru() and wrpkru(),
> we'll lose the content of that "push".  I'm not sure there's any way to
> guarantee this with a user-controlled register.

We could try to insist that user code uses some vsyscall helper that
tracks which bits are as-yet-unassigned.  That's quite messy, though.

We could also arbitrarily partition the key space into
initially-wide-open, initially-read-only, and initially-no-access and
let pkey_alloc say which kind it wants.

--Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-12  7:13                   ` Ingo Molnar
@ 2016-07-12 15:39                     ` Dave Hansen
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-07-12 15:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, linux-arch, Thomas Gleixner, linux-mm,
	Mel Gorman, Linus Torvalds, Andrew Morton, Linux API,
	Arnd Bergmann, linux-kernel, Al Viro, Peter Zijlstra,
	Hugh Dickins, H. Peter Anvin, X86 ML, Dave Hansen

On 07/12/2016 12:13 AM, Ingo Molnar wrote:
>> > Remember, PKRU is just a *bitmap*.  The only place keys are stored is in the 
>> > page tables.
> A pkey is an index *and* a protection mask. So by representing it as a bitmask we 
> lose per thread information. This is what I meant by 'incomplete shadowing' - for 
> example the debug code couldn't work: if we cleared a pkey in a task we wouldn't 
> know what to restore it to with the current data structures, right?

Right.  I actually have some code to do the shadowing that I wrote to
explore how to do different PKRU values in signal handlers.  The code
only shadowed the keys that were currently allocated, and used the
(mm-wide) allocation map to figure that out.  It did not have a separate
per-thread concept of which parts of PKRU need to be shadowed.

It essentially populated the shadow value on all pkru_set() calls.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-11 14:28                 ` Dave Hansen
@ 2016-07-12  7:13                   ` Ingo Molnar
  2016-07-12 15:39                     ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2016-07-12  7:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, linux-arch, Thomas Gleixner, linux-mm,
	Mel Gorman, Linus Torvalds, Andrew Morton, Linux API,
	Arnd Bergmann, linux-kernel, Al Viro, Peter Zijlstra,
	Hugh Dickins, H. Peter Anvin, X86 ML, Dave Hansen


* Dave Hansen <dave@sr71.net> wrote:

> On 07/11/2016 12:35 AM, Ingo Molnar wrote:
> > * Andy Lutomirski <luto@amacapital.net> wrote:
> > mprotect_pkey()'s effects are per MM, but the system calls related to managing the 
> > keys (alloc/free/get/set) are fundamentally per CPU.
> > 
> > Here's an example of how this could matter to applications:
> > 
> >  - 'writer thread' gets a RW- key into index 1 to a specific data area
> >  - a pool of 'reader threads' may get the same pkey index 1 R-- to read the data 
> >    area.
> > 
> > Same page tables, same index, two protections and two purposes.
> > 
> > With a global, per MM allocation of keys we'd have to use two indices: index 1 and 2.
> 
> I'm not sure how this would work.  A piece of data mapped at only one virtual 
> address can have only one key associated with it.

Yeah, indeed, got myself confused there - but the actual protection bits are per 
CPU (per task).

> Remember, PKRU is just a *bitmap*.  The only place keys are stored is in the 
> page tables.

A pkey is an index *and* a protection mask. So by representing it as a bitmask we 
lose per thread information. This is what I meant by 'incomplete shadowing' - for 
example the debug code couldn't work: if we cleared a pkey in a task we wouldn't 
know what to restore it to with the current data structures, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-11 14:45                 ` Andy Lutomirski
@ 2016-07-11 15:48                   ` Dave Hansen
  2016-07-12 16:32                     ` Andy Lutomirski
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-07-11 15:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, linux-arch, Thomas Gleixner, linux-mm, Mel Gorman,
	Linus Torvalds, Andrew Morton, Linux API, Arnd Bergmann,
	linux-kernel, Al Viro, Peter Zijlstra, Hugh Dickins,
	H. Peter Anvin, X86 ML, Dave Hansen

On 07/11/2016 07:45 AM, Andy Lutomirski wrote:
> On Mon, Jul 11, 2016 at 7:34 AM, Dave Hansen <dave@sr71.net> wrote:
>> Should we instead just recommend to userspace that they lock down access
>> to keys by default in all threads as a best practice?
> 
> Is that really better than doing it in-kernel?  My concern is that
> we'll find library code that creates a thread, and that code could run
> before the pkey-aware part of the program even starts running. 

Yeah, so let's assume we have some pkey-unaware thread.  The upside of a
scheme where the kernel preemptively (and transparently to the thread)
locks down PKRU is that the thread can't go corrupting any non-zero-pkey
structures that came from other threads.

But, the downside is that the thread can not access any non-zero-pkey
structures without taking some kind of action with PKRU.  That obviously
won't happen since the thread is pkeys-unaware to begin with.  Would
that break these libraries unless everything using pkeys knows to only
share pkey=0 data with those threads?

> So how is user code supposed lock down all of its threads?
> 
> seccomp has TSYNC for this, but I don't think that PKRU allows 
> something like that.

I'm not sure this is possible for PKRU.  Think of a simple PKRU
manipulation in userspace:

	pkru = rdpkru();
	pkru |= PKEY_DENY_ACCESS<<key*2;
	wrpkru(pkru);

If we push a PKRU value into a thread between the rdpkru() and wrpkru(),
we'll lose the content of that "push".  I'm not sure there's any way to
guarantee this with a user-controlled register.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-11  7:35               ` Ingo Molnar
  2016-07-11 14:28                 ` Dave Hansen
@ 2016-07-11 14:50                 ` Andy Lutomirski
  1 sibling, 0 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-07-11 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-arch, Thomas Gleixner, linux-mm, Mel Gorman,
	Linus Torvalds, Andrew Morton, Linux API, Dave Hansen,
	Arnd Bergmann, linux-kernel, Al Viro, Peter Zijlstra,
	Hugh Dickins, H. Peter Anvin, X86 ML, Dave Hansen

On Mon, Jul 11, 2016 at 12:35 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> On Jul 9, 2016 1:37 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
>> >
>> >
>> > * Dave Hansen <dave@sr71.net> wrote:
>> >
>> > > On 07/08/2016 12:18 AM, Ingo Molnar wrote:
>> > >
>> > > > So the question is, what is user-space going to do? Do any glibc patches
>> > > > exist? How are the user-space library side APIs going to look like?
>> > >
>> > > My goal at the moment is to get folks enabled to the point that they can start
>> > > modifying apps to use pkeys without having to patch their kernels.
>> > >  I don't have confidence that we can design good high-level userspace interfaces
>> > > without seeing some real apps try to use the low-level ones and seeing how they
>> > > struggle.
>> > >
>> > > I had some glibc code to do the pkey alloc/free operations, but those aren't
>> > > necessary if we're doing it in the kernel.  Other than getting the syscall
>> > > wrappers in place, I don't have any immediate plans to do anything in glibc.
>> > >
>> > > Was there something you were expecting to see?
>> >
>> > Yeah, so (as you probably guessed!) I'm starting to have second thoughts about the
>> > complexity of the alloc/free/set/get interface I suggested, and Mel's review
>> > certainly strengthened that feeling.
>> >
>> > I have two worries:
>> >
>> > 1)
>> >
>> > A technical worry I have is that the 'pkey allocation interface' does not seem to
>> > be taking the per thread property of pkeys into account - while that property
>> > would be useful for apps. That is a limitation that seems unjustified.
>> >
>> > The reason for this is that we are storing the key allocation bitmap in struct_mm,
>> > in mm->context.pkey_allocation_map - while we should be storing it in task_struct
>> > or thread_info.
>>
>> Huh?  Doesn't this have to be per mm?  Sure, PKRU is per thread, but
>> the page tables are shared.
>
> But the keys are not shared, and they carry meaningful per thread information.
>
> mprotect_pkey()'s effects are per MM, but the system calls related to managing the
> keys (alloc/free/get/set) are fundamentally per CPU.
>
> Here's an example of how this could matter to applications:
>
>  - 'writer thread' gets a RW- key into index 1 to a specific data area
>  - a pool of 'reader threads' may get the same pkey index 1 R-- to read the data
>    area.

Sure, but this means you allocate index 1 once and then use it in both
threads.  If you allocate separately in each thread, nothing
guarantees you'll get the same index both times, and if you don't then
the code doesn't work.

>
>> There are still two issues that I think we need to address, though:
>>
>> 1. Signal delivery shouldn't unconditionally clear PKRU.  That's what
>> the current patches do, and it's unsafe.  I'd rather set PKRU to the
>> maximally locked down state on signal delivery (except for the
>> PROT_EXEC key), although that might cause its own set of problems.
>
> Right now the historic pattern for signal handlers is that they safely and
> transparently stack on top of existing FPU related resources and do a save/restore
> of them. In that sense saving+clearing+restoring the pkeys state would be the
> correct approach that follows that pattern. There are two extra considerations:
>
> - If we think of pkeys as a temporary register that can be used to access/unaccess
>   normally unaccessible memory regions then this makes sense, in fact it's more
>   secure: signal handlers cannot accidentally stomp on an encryption key or on a
>   database area, unless they intentionally gain access to them.
>

That how I think I would think of them, but for this to be fully safe,
we'd want to lock them down in signal handlers by default, which is
what I'm suggesting.

--Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-11 14:34               ` Dave Hansen
@ 2016-07-11 14:45                 ` Andy Lutomirski
  2016-07-11 15:48                   ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2016-07-11 14:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, linux-arch, Thomas Gleixner, linux-mm, Mel Gorman,
	Linus Torvalds, Andrew Morton, Linux API, Arnd Bergmann,
	linux-kernel, Al Viro, Peter Zijlstra, Hugh Dickins,
	H. Peter Anvin, X86 ML, Dave Hansen

On Mon, Jul 11, 2016 at 7:34 AM, Dave Hansen <dave@sr71.net> wrote:
> On 07/10/2016 09:25 PM, Andy Lutomirski wrote:
>> 2. When thread A allocates a pkey, how does it lock down thread B?
>>
>> #2 could be addressed by using fully-locked-down as the initial state
>> post-exec() and copying the state on clone().  Dave, are there any
>> cases in practice where one thread would allocate a pkey and want
>> other threads to immediately have access to the memory with that key?
>
> The only one I can think of is a model where pkeys are used more in a
> "denial" mode rather than an "allow" mode.
>
> For instance, perhaps you don't want to modify your app to use pkeys,
> except for a small routine where you handle untrusted user data.  You
> would, in that routine, deny access to a bunch of keys, but otherwise
> allow access to all so you didn't have to change any other parts of the app.
>
> Should we instead just recommend to userspace that they lock down access
> to keys by default in all threads as a best practice?

Is that really better than doing it in-kernel?  My concern is that
we'll find library code that creates a thread, and that code could run
before the pkey-aware part of the program even starts running.  So how
is user code supposed lock down all of its threads?

seccomp has TSYNC for this, but I don't think that PKRU allows
something like that.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-11  4:25             ` Andy Lutomirski
  2016-07-11  7:35               ` Ingo Molnar
@ 2016-07-11 14:34               ` Dave Hansen
  2016-07-11 14:45                 ` Andy Lutomirski
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-07-11 14:34 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: linux-arch, Thomas Gleixner, linux-mm, Mel Gorman,
	Linus Torvalds, Andrew Morton, Linux API, Arnd Bergmann,
	linux-kernel, Al Viro, Peter Zijlstra, Hugh Dickins,
	H. Peter Anvin, X86 ML, Dave Hansen

On 07/10/2016 09:25 PM, Andy Lutomirski wrote:
> 2. When thread A allocates a pkey, how does it lock down thread B?
> 
> #2 could be addressed by using fully-locked-down as the initial state
> post-exec() and copying the state on clone().  Dave, are there any
> cases in practice where one thread would allocate a pkey and want
> other threads to immediately have access to the memory with that key?

The only one I can think of is a model where pkeys are used more in a
"denial" mode rather than an "allow" mode.

For instance, perhaps you don't want to modify your app to use pkeys,
except for a small routine where you handle untrusted user data.  You
would, in that routine, deny access to a bunch of keys, but otherwise
allow access to all so you didn't have to change any other parts of the app.

Should we instead just recommend to userspace that they lock down access
to keys by default in all threads as a best practice?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-11  7:35               ` Ingo Molnar
@ 2016-07-11 14:28                 ` Dave Hansen
  2016-07-12  7:13                   ` Ingo Molnar
  2016-07-11 14:50                 ` Andy Lutomirski
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-07-11 14:28 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: linux-arch, Thomas Gleixner, linux-mm, Mel Gorman,
	Linus Torvalds, Andrew Morton, Linux API, Arnd Bergmann,
	linux-kernel, Al Viro, Peter Zijlstra, Hugh Dickins,
	H. Peter Anvin, X86 ML, Dave Hansen

On 07/11/2016 12:35 AM, Ingo Molnar wrote:
> * Andy Lutomirski <luto@amacapital.net> wrote:
> mprotect_pkey()'s effects are per MM, but the system calls related to managing the 
> keys (alloc/free/get/set) are fundamentally per CPU.
> 
> Here's an example of how this could matter to applications:
> 
>  - 'writer thread' gets a RW- key into index 1 to a specific data area
>  - a pool of 'reader threads' may get the same pkey index 1 R-- to read the data 
>    area.
> 
> Same page tables, same index, two protections and two purposes.
> 
> With a global, per MM allocation of keys we'd have to use two indices: index 1 and 2.

I'm not sure how this would work.  A piece of data mapped at only one
virtual address can have only one key associated with it.  For a data
area, you would need to indicate between threads which key they needed
in order to access the data.  Both threads need to agree on the virtual
address *and* the key used for access.

Remember, PKRU is just a *bitmap*.  The only place keys are stored is in
the page tables.

Here's how this ends up looking in practice when we have an initializer,
a reader and a writer:

	/* allocator: */
	pkey = pkey_alloc();
	data = mmap(PAGE_SIZE, PROT_NONE, ...);
	pkey_mprotect(data, PROT_WRITE|PROT_READ, pkey);
	metadata[data].pkey = pkey;

	/* reader */
	pkey_set(metadata[data].pkey, PKEY_DENY_WRITE);
	readerfoo = *data;
	pkey_set(metadata[data].pkey, PKEY_DENY_WRITE|ACCESS);

	/* writer */
	pkey_set(metadata[data].pkey, 0); /* 0 == deny nothing */
	*data = bar;
	pkey_set(metadata[data].pkey, PKEY_DENY_WRITE|ACCESS);


I'm also not sure what the indexes are that you're referring to.

> Depending on how scarce the index space turns out to be making the key indices per 
> thread is probably the right model.

Yeah, I'm totally confused about what you mean by indexes.

>> There are still two issues that I think we need to address, though:
>>
>> 1. Signal delivery shouldn't unconditionally clear PKRU.  That's what
>> the current patches do, and it's unsafe.  I'd rather set PKRU to the
>> maximally locked down state on signal delivery (except for the
>> PROT_EXEC key), although that might cause its own set of problems.
> 
> Right now the historic pattern for signal handlers is that they safely and 
> transparently stack on top of existing FPU related resources and do a save/restore 
> of them. In that sense saving+clearing+restoring the pkeys state would be the 
> correct approach that follows that pattern. There are two extra considerations:
> 
> - If we think of pkeys as a temporary register that can be used to access/unaccess 
>   normally unaccessible memory regions then this makes sense, in fact it's more 
>   secure: signal handlers cannot accidentally stomp on an encryption key or on a
>   database area, unless they intentionally gain access to them.
> 
> - If we think of pkeys as permanent memory mappings that enhance existing MM
>   permissions then it would be correct to let them leak into signal handler state. 
>   The globl true-PROT_EXEC key would fall into this category.
> 
> So I agree, mostly: the correct approach is to save+clear+restore the first 14 
> pkey indices, and to leave alone the two 'global' indices.

The current scheme is the most permissive, but it has an important
property: it's the most _flexible_.  You can implement almost any scheme
you want in userspace on top of it.  The first userspace instruction of
the handler could easily be WRKRU to fully lock down access in whatever
scheme a program wants.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-11  4:25             ` Andy Lutomirski
@ 2016-07-11  7:35               ` Ingo Molnar
  2016-07-11 14:28                 ` Dave Hansen
  2016-07-11 14:50                 ` Andy Lutomirski
  2016-07-11 14:34               ` Dave Hansen
  1 sibling, 2 replies; 41+ messages in thread
From: Ingo Molnar @ 2016-07-11  7:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-arch, Thomas Gleixner, linux-mm, Mel Gorman,
	Linus Torvalds, Andrew Morton, Linux API, Dave Hansen,
	Arnd Bergmann, linux-kernel, Al Viro, Peter Zijlstra,
	Hugh Dickins, H. Peter Anvin, X86 ML, Dave Hansen


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Jul 9, 2016 1:37 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
> >
> >
> > * Dave Hansen <dave@sr71.net> wrote:
> >
> > > On 07/08/2016 12:18 AM, Ingo Molnar wrote:
> > >
> > > > So the question is, what is user-space going to do? Do any glibc patches
> > > > exist? How are the user-space library side APIs going to look like?
> > >
> > > My goal at the moment is to get folks enabled to the point that they can start
> > > modifying apps to use pkeys without having to patch their kernels.
> > >  I don't have confidence that we can design good high-level userspace interfaces
> > > without seeing some real apps try to use the low-level ones and seeing how they
> > > struggle.
> > >
> > > I had some glibc code to do the pkey alloc/free operations, but those aren't
> > > necessary if we're doing it in the kernel.  Other than getting the syscall
> > > wrappers in place, I don't have any immediate plans to do anything in glibc.
> > >
> > > Was there something you were expecting to see?
> >
> > Yeah, so (as you probably guessed!) I'm starting to have second thoughts about the
> > complexity of the alloc/free/set/get interface I suggested, and Mel's review
> > certainly strengthened that feeling.
> >
> > I have two worries:
> >
> > 1)
> >
> > A technical worry I have is that the 'pkey allocation interface' does not seem to
> > be taking the per thread property of pkeys into account - while that property
> > would be useful for apps. That is a limitation that seems unjustified.
> >
> > The reason for this is that we are storing the key allocation bitmap in struct_mm,
> > in mm->context.pkey_allocation_map - while we should be storing it in task_struct
> > or thread_info.
> 
> Huh?  Doesn't this have to be per mm?  Sure, PKRU is per thread, but
> the page tables are shared.

But the keys are not shared, and they carry meaningful per thread information.

mprotect_pkey()'s effects are per MM, but the system calls related to managing the 
keys (alloc/free/get/set) are fundamentally per CPU.

Here's an example of how this could matter to applications:

 - 'writer thread' gets a RW- key into index 1 to a specific data area
 - a pool of 'reader threads' may get the same pkey index 1 R-- to read the data 
   area.

Same page tables, same index, two protections and two purposes.

With a global, per MM allocation of keys we'd have to use two indices: index 1 and 2.

Depending on how scarce the index space turns out to be making the key indices per 
thread is probably the right model.

> There are still two issues that I think we need to address, though:
> 
> 1. Signal delivery shouldn't unconditionally clear PKRU.  That's what
> the current patches do, and it's unsafe.  I'd rather set PKRU to the
> maximally locked down state on signal delivery (except for the
> PROT_EXEC key), although that might cause its own set of problems.

Right now the historic pattern for signal handlers is that they safely and 
transparently stack on top of existing FPU related resources and do a save/restore 
of them. In that sense saving+clearing+restoring the pkeys state would be the 
correct approach that follows that pattern. There are two extra considerations:

- If we think of pkeys as a temporary register that can be used to access/unaccess 
  normally unaccessible memory regions then this makes sense, in fact it's more 
  secure: signal handlers cannot accidentally stomp on an encryption key or on a
  database area, unless they intentionally gain access to them.

- If we think of pkeys as permanent memory mappings that enhance existing MM
  permissions then it would be correct to let them leak into signal handler state. 
  The globl true-PROT_EXEC key would fall into this category.

So I agree, mostly: the correct approach is to save+clear+restore the first 14 
pkey indices, and to leave alone the two 'global' indices.

> 2. When thread A allocates a pkey, how does it lock down thread B?

So see above, I think the temporary key space should be per thread, so there would 
be no inter thread interactions: each thread is responsible for its own key 
management (via per thread management data in the library that implments it).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-09  8:37           ` Ingo Molnar
@ 2016-07-11  4:25             ` Andy Lutomirski
  2016-07-11  7:35               ` Ingo Molnar
  2016-07-11 14:34               ` Dave Hansen
  2016-07-18 18:02             ` Dave Hansen
  2016-07-18 20:12             ` Dave Hansen
  2 siblings, 2 replies; 41+ messages in thread
From: Andy Lutomirski @ 2016-07-11  4:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-arch, Thomas Gleixner, linux-mm, Mel Gorman,
	Linus Torvalds, Andrew Morton, Linux API, Dave Hansen,
	Arnd Bergmann, linux-kernel, Al Viro, Peter Zijlstra,
	Hugh Dickins, H. Peter Anvin, X86 ML, Dave Hansen

On Jul 9, 2016 1:37 AM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * Dave Hansen <dave@sr71.net> wrote:
>
> > On 07/08/2016 12:18 AM, Ingo Molnar wrote:
> >
> > > So the question is, what is user-space going to do? Do any glibc patches
> > > exist? How are the user-space library side APIs going to look like?
> >
> > My goal at the moment is to get folks enabled to the point that they can start
> > modifying apps to use pkeys without having to patch their kernels.
> >  I don't have confidence that we can design good high-level userspace interfaces
> > without seeing some real apps try to use the low-level ones and seeing how they
> > struggle.
> >
> > I had some glibc code to do the pkey alloc/free operations, but those aren't
> > necessary if we're doing it in the kernel.  Other than getting the syscall
> > wrappers in place, I don't have any immediate plans to do anything in glibc.
> >
> > Was there something you were expecting to see?
>
> Yeah, so (as you probably guessed!) I'm starting to have second thoughts about the
> complexity of the alloc/free/set/get interface I suggested, and Mel's review
> certainly strengthened that feeling.
>
> I have two worries:
>
> 1)
>
> A technical worry I have is that the 'pkey allocation interface' does not seem to
> be taking the per thread property of pkeys into account - while that property
> would be useful for apps. That is a limitation that seems unjustified.
>
> The reason for this is that we are storing the key allocation bitmap in struct_mm,
> in mm->context.pkey_allocation_map - while we should be storing it in task_struct
> or thread_info.

Huh?  Doesn't this have to be per mm?  Sure, PKRU is per thread, but
the page tables are shared.

> 2)
>
> My main worry is that it appears at this stage that we are still pretty far away
> from completely shadowing the hardware pkey state in the kernel - and without that
> we cannot really force user-space to use the 'proper' APIs. They can just use the
> raw instructions, condition them on a CPUID and be done with it: everything can be
> organized in user-space.
>

My vote would be to keep the allocation mechanism but get rid of pkey_set.

Also, I think the debug poisoning feature is overcomplicated.  Let's
just forbid mprotect_key with an unallocated key.

There are still two issues that I think we need to address, though:

1. Signal delivery shouldn't unconditionally clear PKRU.  That's what
the current patches do, and it's unsafe.  I'd rather set PKRU to the
maximally locked down state on signal delivery (except for the
PROT_EXEC key), although that might cause its own set of problems.

2. When thread A allocates a pkey, how does it lock down thread B?

#2 could be addressed by using fully-locked-down as the initial state
post-exec() and copying the state on clone().  Dave, are there any
cases in practice where one thread would allocate a pkey and want
other threads to immediately have access to the memory with that key?

I find myself wondering whether we should stop using XSAVE for PKRU
sooner rather than later.  If we do anything like the above, we
completely lose the init optimization, and the code would be a good
deal simpler if we switched PKRU directly in switch_to and could
therefore treat it like a normal register everywhere else.

--Andy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-08 16:32         ` Dave Hansen
@ 2016-07-09  8:37           ` Ingo Molnar
  2016-07-11  4:25             ` Andy Lutomirski
                               ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Ingo Molnar @ 2016-07-09  8:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, linux-kernel, x86, linux-api, linux-arch, linux-mm,
	torvalds, akpm, dave.hansen, arnd, hughd, viro, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra


* Dave Hansen <dave@sr71.net> wrote:

> On 07/08/2016 12:18 AM, Ingo Molnar wrote:
>
> > So the question is, what is user-space going to do? Do any glibc patches 
> > exist? How are the user-space library side APIs going to look like?
> 
> My goal at the moment is to get folks enabled to the point that they can start 
> modifying apps to use pkeys without having to patch their kernels.
>  I don't have confidence that we can design good high-level userspace interfaces 
> without seeing some real apps try to use the low-level ones and seeing how they 
> struggle.
> 
> I had some glibc code to do the pkey alloc/free operations, but those aren't 
> necessary if we're doing it in the kernel.  Other than getting the syscall 
> wrappers in place, I don't have any immediate plans to do anything in glibc.
> 
> Was there something you were expecting to see?

Yeah, so (as you probably guessed!) I'm starting to have second thoughts about the 
complexity of the alloc/free/set/get interface I suggested, and Mel's review 
certainly strengthened that feeling.

I have two worries:

1)

A technical worry I have is that the 'pkey allocation interface' does not seem to 
be taking the per thread property of pkeys into account - while that property 
would be useful for apps. That is a limitation that seems unjustified.

The reason for this is that we are storing the key allocation bitmap in struct_mm, 
in mm->context.pkey_allocation_map - while we should be storing it in task_struct 
or thread_info.

We could solve this by moving the allocation bitmap to the task struct, but:

2)

My main worry is that it appears at this stage that we are still pretty far away 
from completely shadowing the hardware pkey state in the kernel - and without that 
we cannot really force user-space to use the 'proper' APIs. They can just use the 
raw instructions, condition them on a CPUID and be done with it: everything can be 
organized in user-space.

Furthermore, implementing it in a high performance fashion would be pretty complex 
- at minimum we'd have to register a per thread read-write user-space data area 
where the kernel could store pkeys management data so that vsyscalls can access it 
... None of that facility exists today.

And without vsyscall optimizations user-space might legitimately use its own 
implementation for performance reasons and we'd end up with twice the complexity 
and a largely unused piece of kernel infrastructure ...

So how about the following minimalistic approach instead, to get the ball rolling 
without making ABI decisions we might regret:

 - There are 16 pkey indices on x86 currently. We already use index 15 for the 
   true PROT_EXEC implementation. Let's set aside another pkey index for the 
   kernel's potential future use (index 14), and clear it explicitly in the 
   FPU context on every context switch if CONFIG_X86_DEBUG_FPU is enabled to make 
   sure it remains unallocated.

 - Expose just the new mprotect_pkey() system call to install a pkey index into 
   the page tables - but we let user-space organize its key allocations.

 - Give user-space an idea about limits:

     "ALL THESE WORLDS ARE YOURS—EXCEPT EUROPA ATTEMPT NO LANDING THERE"

   Ooops, wrong one. Lets try this instead:

     Expose the current maximum user-space usable pkeys index in some
     programmatically accessible fashion. Maybe mprotect_pkey() could reject a 
     permanently allocated kernel pkey index via a distinctive error code?

   I.e. this pattern:

     ret = pkey_mprotect(NULL, PAGE_SIZE, real_prot, pkey);

   ... would validate the pkey and we'd return -EOPNOTSUPP for pkey that is not 
   available? This would allow maximum future flexibility as it would not define 
   kernel allocated pkeys as a 'range'.

 - ... and otherwise leave the remaining 14 pkey indices for user-space to manage.

If in the future user-space pkeys usage grows to such a level that kernel 
arbitration becomes desirable then we can still implement the get/set/alloc/free 
system calls as well: the first use of those system calls would switch on the 
kernel's pkey management facilities and from that point on user-space is supposed 
to use the published system calls only. Applications using pkey instructions 
directly would still work just fine: they'd never use the new system calls.

I.e. we can actually keep a bigger ABI flexibility by introducing the simplest 
possible ABI at this stage. Maybe user-space usage of this hardware feature will 
never grow beyond that simple ABI - in which case we've saved quite a bit of 
ongoing maintenance complexity...

And yes, I realize that we've come a full round since the very first version of 
this patch set, but I think the extra hoops were still worth it, because the 
true-PROT_EXEC feature came out of it which is very useful IMHO. But my more 
complex pkey management syscall ideas don't seem to be all that useful anymore.

So what do you think about this direction? This would simplify the patch set quite 
a bit and would touch very little MM code beyond the mprotect_pkey() bits.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-08  7:18       ` Ingo Molnar
  2016-07-08 16:32         ` Dave Hansen
@ 2016-07-08 19:26         ` Dave Hansen
  1 sibling, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-07-08 19:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, linux-kernel, x86, linux-api, linux-arch, linux-mm,
	torvalds, akpm, dave.hansen, arnd, hughd, viro

On 07/08/2016 12:18 AM, Ingo Molnar wrote:
> So my hope was that we'd also grow some debugging features: such as a periodic 
> watchdog timer clearing all non-allocated pkeys of a task and re-setting them to 
> their (kernel-)known values and thus forcing user-space to coordinate key 
> allocation/freeing.

I'm glad you mentioned this.  I've explicitly called out this behavior
in the manpages now, or at least called out the fact that the kernel
will not preserve PKRU contents for unallocated keys.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-08  7:18       ` Ingo Molnar
@ 2016-07-08 16:32         ` Dave Hansen
  2016-07-09  8:37           ` Ingo Molnar
  2016-07-08 19:26         ` Dave Hansen
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-07-08 16:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, linux-kernel, x86, linux-api, linux-arch, linux-mm,
	torvalds, akpm, dave.hansen, arnd, hughd, viro

On 07/08/2016 12:18 AM, Ingo Molnar wrote:
> So the question is, what is user-space going to do? Do any glibc
> patches exist? How are the user-space library side APIs going to look
> like?

My goal at the moment is to get folks enabled to the point that they can
start modifying apps to use pkeys without having to patch their kernels.
 I don't have confidence that we can design good high-level userspace
interfaces without seeing some real apps try to use the low-level ones
and seeing how they struggle.

I had some glibc code to do the pkey alloc/free operations, but those
aren't necessary if we're doing it in the kernel.  Other than getting
the syscall wrappers in place, I don't have any immediate plans to do
anything in glibc.

Was there something you were expecting to see?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-07 17:33     ` Dave Hansen
  2016-07-08  7:18       ` Ingo Molnar
@ 2016-07-08 10:22       ` Mel Gorman
  1 sibling, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2016-07-08 10:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, linux-api, linux-arch, linux-mm, torvalds,
	akpm, dave.hansen, arnd, hughd, viro

On Thu, Jul 07, 2016 at 10:33:00AM -0700, Dave Hansen wrote:
> On 07/07/2016 07:45 AM, Mel Gorman wrote:
> > On Thu, Jul 07, 2016 at 05:47:28AM -0700, Dave Hansen wrote:
> >> > 
> >> > From: Dave Hansen <dave.hansen@linux.intel.com>
> >> > 
> >> > This establishes two more system calls for protection key management:
> >> > 
> >> > 	unsigned long pkey_get(int pkey);
> >> > 	int pkey_set(int pkey, unsigned long access_rights);
> >> > 
> >> > The return value from pkey_get() and the 'access_rights' passed
> >> > to pkey_set() are the same format: a bitmask containing
> >> > PKEY_DENY_WRITE and/or PKEY_DENY_ACCESS, or nothing set at all.
> >> > 
> >> > These can replace userspace's direct use of the new rdpkru/wrpkru
> >> > instructions.
> ...
> > This one feels like something that can or should be implemented in
> > glibc.
> 
> I generally agree, except that glibc doesn't have any visibility into
> whether a pkey is currently valid or not.
> 

Well, it could if it tracked the pkey_alloc/pkey_free calls too. I accept
that's not perfect as nothing prevents the syscalls being used directly.

> > Applications that frequently get
> > called will get hammed into the ground with serialisation on mmap_sem
> > not to mention the cost of the syscall entry/exit.
> 
> I think we can do both of them without mmap_sem, as long as we resign
> ourselves to this just being fundamentally racy (which it is already, I
> think).  But, is it worth performance-tuning things that we don't expect
> performance-sensitive apps to be using in the first place?  They'll just
> use the RDPKRU/WRPKRU instructions directly.
> 

I accept the premature optimisation arguement but I think it'll eventually
bite us. Why this red-flagged for me was because so many people have
complained about just system call overhead when using particular types of
hardware -- DAX springs to mind with the MAP_PMEM_AWARE discussions. Using
mmap_sem means that pkey operations stop parallel faults, mmaps and so on. If
the applications that care are trying to minimise page table operations,
TLB flushes and so on, they might not be that happy if parallel faults
are stalled.

I think whether you serialise pkey_get/pkey_set operations or not, it's
going to be inherently racy with different sized windows. A sequence counter
would be sufficient to protect it to prevent partial reads. If userspace
cares about the race, then userspace is going to have to serialise its
threads access to the keys anyway.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-07 17:33     ` Dave Hansen
@ 2016-07-08  7:18       ` Ingo Molnar
  2016-07-08 16:32         ` Dave Hansen
  2016-07-08 19:26         ` Dave Hansen
  2016-07-08 10:22       ` Mel Gorman
  1 sibling, 2 replies; 41+ messages in thread
From: Ingo Molnar @ 2016-07-08  7:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, linux-kernel, x86, linux-api, linux-arch, linux-mm,
	torvalds, akpm, dave.hansen, arnd, hughd, viro


* Dave Hansen <dave@sr71.net> wrote:

> > Applications that frequently get called will get hammed into the ground with 
> > serialisation on mmap_sem not to mention the cost of the syscall entry/exit.
>
> I think we can do both of them without mmap_sem, as long as we resign ourselves 
> to this just being fundamentally racy (which it is already, I think).  But, is 
> it worth performance-tuning things that we don't expect performance-sensitive 
> apps to be using in the first place?  They'll just use the RDPKRU/WRPKRU 
> instructions directly.
>
> Ingo, do you still feel strongly that these syscalls (pkey_set/get()) should be 
> included?  Of the 5, they're definitely the two with the weakest justification.

Firstly, I'd like to thank Mel for the review, having this kind of high level 
interface discussion was exactly what I was hoping for before we merged any ABI 
patches.

So my hope was that we'd also grow some debugging features: such as a periodic 
watchdog timer clearing all non-allocated pkeys of a task and re-setting them to 
their (kernel-)known values and thus forcing user-space to coordinate key 
allocation/freeing.

While allocation/freeing of keys is very likely a slow path in any reasonable 
workload, _setting_ the values of pkeys could easily be a fast path. The whole 
point of pkeys is to allow both thread local and dynamic mapping and unmapping of 
memory ranges, without having to touch any page table attributes in the hot path.

Now another, more fundamental question is that pkeys are per-CPU (per thread) on 
the hardware side - so why do we even care about the mmap_sem in the syscalls in 
the first place? If we want any serialization wouldn't a pair of 
get_cpu()/put_cpu() preemption control calls be enough? Those would also be much 
cheaper.

The idea behind my suggestion to manage all pkey details via a kernel interface 
and 'shadow' the pkeys state in the kernel was that if we don't even offer a 
complete system call interface then user-space is _forced_ into using the raw 
instructions and into implementing a (potentially crappy) uncoordinated API or not 
implementing any API at all but randomly using fixed pkey indices.

My hope was to avoid that, especially since currently _all_ memory mapping details 
on x86 are controlled via system calls. If we don't offer that kind of facility 
then we force user-space into using the raw instructions and will likely end up 
with a poor user-space interface.

So the question is, what is user-space going to do? Do any glibc patches exist? 
How are the user-space library side APIs going to look like?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-07 14:45   ` Mel Gorman
@ 2016-07-07 17:33     ` Dave Hansen
  2016-07-08  7:18       ` Ingo Molnar
  2016-07-08 10:22       ` Mel Gorman
  0 siblings, 2 replies; 41+ messages in thread
From: Dave Hansen @ 2016-07-07 17:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, x86, linux-api, linux-arch, linux-mm, torvalds,
	akpm, dave.hansen, arnd, hughd, viro

On 07/07/2016 07:45 AM, Mel Gorman wrote:
> On Thu, Jul 07, 2016 at 05:47:28AM -0700, Dave Hansen wrote:
>> > 
>> > From: Dave Hansen <dave.hansen@linux.intel.com>
>> > 
>> > This establishes two more system calls for protection key management:
>> > 
>> > 	unsigned long pkey_get(int pkey);
>> > 	int pkey_set(int pkey, unsigned long access_rights);
>> > 
>> > The return value from pkey_get() and the 'access_rights' passed
>> > to pkey_set() are the same format: a bitmask containing
>> > PKEY_DENY_WRITE and/or PKEY_DENY_ACCESS, or nothing set at all.
>> > 
>> > These can replace userspace's direct use of the new rdpkru/wrpkru
>> > instructions.
...
> This one feels like something that can or should be implemented in
> glibc.

I generally agree, except that glibc doesn't have any visibility into
whether a pkey is currently valid or not.

> There is no real enforcement of the values yet looking them up or
> setting them takes mmap_sem for write.

There are checks for mm_pkey_is_allocated().  That's the main thing
these syscalls add on top of the raw instructions.

> Applications that frequently get
> called will get hammed into the ground with serialisation on mmap_sem
> not to mention the cost of the syscall entry/exit.

I think we can do both of them without mmap_sem, as long as we resign
ourselves to this just being fundamentally racy (which it is already, I
think).  But, is it worth performance-tuning things that we don't expect
performance-sensitive apps to be using in the first place?  They'll just
use the RDPKRU/WRPKRU instructions directly.

Ingo, do you still feel strongly that these syscalls (pkey_set/get())
should be included?  Of the 5, they're definitely the two with the
weakest justification.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-07 12:47 ` [PATCH 6/9] x86, pkeys: add pkey set/get syscalls Dave Hansen
@ 2016-07-07 14:45   ` Mel Gorman
  2016-07-07 17:33     ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2016-07-07 14:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, x86, linux-api, linux-arch, linux-mm, torvalds,
	akpm, dave.hansen, arnd, hughd, viro

On Thu, Jul 07, 2016 at 05:47:28AM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This establishes two more system calls for protection key management:
> 
> 	unsigned long pkey_get(int pkey);
> 	int pkey_set(int pkey, unsigned long access_rights);
> 
> The return value from pkey_get() and the 'access_rights' passed
> to pkey_set() are the same format: a bitmask containing
> PKEY_DENY_WRITE and/or PKEY_DENY_ACCESS, or nothing set at all.
> 
> These can replace userspace's direct use of the new rdpkru/wrpkru
> instructions.
> 
> With current hardware, the kernel can not enforce that it has
> control over a given key.  But, this at least allows the kernel
> to indicate to userspace that userspace does not control a given
> protection key.  This makes it more likely that situations like
> using a pkey after sys_pkey_free() can be detected.
> 
> The kernel does _not_ enforce that this interface must be used for
> changes to PKRU, whether or not a key has been "allocated".
> 
> This syscall interface could also theoretically be replaced with a
> pair of vsyscalls.  The vsyscalls would just call WRPKRU/RDPKRU
> directly in situations where they are drop-in equivalents for
> what the kernel would be doing.
> 

This one feels like something that can or should be implemented in
glibc.

There is no real enforcement of the values yet looking them up or
setting them takes mmap_sem for write. Applications that frequently get
called will get hammed into the ground with serialisation on mmap_sem
not to mention the cost of the syscall entry/exit.

RIght now, I'm seeing a lot of cost and not much benefit with this
specific patch.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-07-07 12:47 [PATCH 0/9] [REVIEW-REQUEST] [v4] System Calls for Memory Protection Keys Dave Hansen
@ 2016-07-07 12:47 ` Dave Hansen
  2016-07-07 14:45   ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-07-07 12:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen, arnd, mgorman, hughd, viro


From: Dave Hansen <dave.hansen@linux.intel.com>

This establishes two more system calls for protection key management:

	unsigned long pkey_get(int pkey);
	int pkey_set(int pkey, unsigned long access_rights);

The return value from pkey_get() and the 'access_rights' passed
to pkey_set() are the same format: a bitmask containing
PKEY_DENY_WRITE and/or PKEY_DENY_ACCESS, or nothing set at all.

These can replace userspace's direct use of the new rdpkru/wrpkru
instructions.

With current hardware, the kernel can not enforce that it has
control over a given key.  But, this at least allows the kernel
to indicate to userspace that userspace does not control a given
protection key.  This makes it more likely that situations like
using a pkey after sys_pkey_free() can be detected.

The kernel does _not_ enforce that this interface must be used for
changes to PKRU, whether or not a key has been "allocated".

This syscall interface could also theoretically be replaced with a
pair of vsyscalls.  The vsyscalls would just call WRPKRU/RDPKRU
directly in situations where they are drop-in equivalents for
what the kernel would be doing.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: mgorman@techsingularity.net
Cc: hughd@google.com
Cc: viro@zeniv.linux.org.uk
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    2 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 b/arch/x86/include/asm/pkeys.h           |    4 +-
 b/arch/x86/kernel/fpu/xstate.c           |   55 +++++++++++++++++++++++++++++--
 b/include/linux/pkeys.h                  |    8 ++++
 b/mm/mprotect.c                          |   41 +++++++++++++++++++++++
 6 files changed, 109 insertions(+), 3 deletions(-)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-118-syscalls-set-get arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-118-syscalls-set-get	2016-07-07 05:47:02.197865624 -0700
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2016-07-07 05:47:02.209866169 -0700
@@ -389,3 +389,5 @@
 380	i386	pkey_mprotect		sys_pkey_mprotect
 381	i386	pkey_alloc		sys_pkey_alloc
 382	i386	pkey_free		sys_pkey_free
+383	i386	pkey_get		sys_pkey_get
+384	i386	pkey_set		sys_pkey_set
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-118-syscalls-set-get arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-118-syscalls-set-get	2016-07-07 05:47:02.199865715 -0700
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2016-07-07 05:47:02.211866259 -0700
@@ -338,6 +338,8 @@
 329	common	pkey_mprotect		sys_pkey_mprotect
 330	common	pkey_alloc		sys_pkey_alloc
 331	common	pkey_free		sys_pkey_free
+332	common	pkey_get		sys_pkey_get
+333	common	pkey_set		sys_pkey_set
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/asm/pkeys.h~pkeys-118-syscalls-set-get arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-118-syscalls-set-get	2016-07-07 05:47:02.201865805 -0700
+++ b/arch/x86/include/asm/pkeys.h	2016-07-07 05:47:02.211866259 -0700
@@ -56,7 +56,7 @@ static inline bool validate_pkey(int pke
 }
 
 static inline
-bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
+bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
 	if (!validate_pkey(pkey))
 		return true;
@@ -107,4 +107,6 @@ extern int arch_set_user_pkey_access(str
 extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
 
+extern unsigned long arch_get_user_pkey_access(struct task_struct *tsk,
+		int pkey);
 #endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-118-syscalls-set-get arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-118-syscalls-set-get	2016-07-07 05:47:02.203865896 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2016-07-07 05:47:02.211866259 -0700
@@ -708,7 +708,7 @@ void fpu__resume_cpu(void)
  *
  * Note: does not work for compacted buffers.
  */
-void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+static void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
 {
 	int feature_nr = fls64(xstate_feature_mask) - 1;
 
@@ -882,6 +882,7 @@ out:
 
 #define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
 #define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+#define PKRU_INIT_STATE	0
 
 /*
  * This will go out and modify the XSAVE buffer so that PKRU is
@@ -900,6 +901,9 @@ int __arch_set_user_pkey_access(struct t
 	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
 	u32 new_pkru_bits = 0;
 
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -EINVAL;
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
 	 * set if we enable XSAVE and we enable PKU in XCR0.
@@ -925,7 +929,7 @@ int __arch_set_user_pkey_access(struct t
 	 * state.
 	 */
 	if (!old_pkru_state)
-		new_pkru_state.pkru = 0;
+		new_pkru_state.pkru = PKRU_INIT_STATE;
 	else
 		new_pkru_state.pkru = old_pkru_state->pkru;
 
@@ -963,4 +967,51 @@ int arch_set_user_pkey_access(struct tas
 		return -EINVAL;
 	return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
+
+/*
+ * Figures out what the rights are currently for 'pkey'.
+ * Converts from PKRU's format to the user-visible PKEY_DISABLE_*
+ * format.
+ */
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	struct fpu *fpu = &current->thread.fpu;
+	u32 pkru_reg;
+	int ret = 0;
+
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -1;
+	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
+		return -1;
+	/*
+	 * The contents of PKRU itself are invalid.  Consult the
+	 * task's XSAVE buffer for PKRU contents.  This is much
+	 * more expensive than reading PKRU directly, but should
+	 * be rare or impossible with eagerfpu mode.
+	 */
+	if (!fpu->fpregs_active) {
+		struct xregs_state *xsave = &fpu->state.xsave;
+		struct pkru_state *pkru_state =
+			get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+		/*
+		 * PKRU is in its init state and not present in
+		 * the buffer in a saved form.
+		 */
+		if (!pkru_state)
+			return PKRU_INIT_STATE;
+
+		return pkru_state->pkru;
+	}
+	/*
+	 * Consult the user register directly.
+	 */
+	pkru_reg = read_pkru();
+	if (!__pkru_allows_read(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_ACCESS;
+	if (!__pkru_allows_write(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_WRITE;
+
+	return ret;
+}
 #endif /* CONFIG_ARCH_HAS_PKEYS */
diff -puN include/linux/pkeys.h~pkeys-118-syscalls-set-get include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-118-syscalls-set-get	2016-07-07 05:47:02.204865942 -0700
+++ b/include/linux/pkeys.h	2016-07-07 05:47:02.212866305 -0700
@@ -44,6 +44,14 @@ static inline int mm_pkey_free(struct mm
 static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 			unsigned long init_val)
 {
+	return -EINVAL;
+}
+
+static inline
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	if (pkey)
+		return -1;
 	return 0;
 }
 
diff -puN mm/mprotect.c~pkeys-118-syscalls-set-get mm/mprotect.c
--- a/mm/mprotect.c~pkeys-118-syscalls-set-get	2016-07-07 05:47:02.206866032 -0700
+++ b/mm/mprotect.c	2016-07-07 05:47:02.212866305 -0700
@@ -537,3 +537,44 @@ SYSCALL_DEFINE1(pkey_free, int, pkey)
 	 */
 	return ret;
 }
+
+SYSCALL_DEFINE2(pkey_get, int, pkey, unsigned long, flags)
+{
+	unsigned long ret = 0;
+
+	if (flags)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_get_user_pkey_access(current, pkey);
+
+	return ret;
+}
+
+SYSCALL_DEFINE3(pkey_set, int, pkey, unsigned long, access_rights,
+		unsigned long, flags)
+{
+	unsigned long ret = 0;
+
+	if (flags)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_set_user_pkey_access(current, pkey, access_rights);
+
+	return ret;
+}
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 6/9] x86, pkeys: add pkey set/get syscalls
  2016-06-09  0:01 [PATCH 0/9] [v3] " Dave Hansen
@ 2016-06-09  0:01 ` Dave Hansen
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Hansen @ 2016-06-09  0:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, linux-api, linux-arch, linux-mm, torvalds, akpm,
	Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This establishes two more system calls for protection key management:

	unsigned long pkey_get(int pkey);
	int pkey_set(int pkey, unsigned long access_rights);

The return value from pkey_get() and the 'access_rights' passed
to pkey_set() are the same format: a bitmask containing
PKEY_DENY_WRITE and/or PKEY_DENY_ACCESS, or nothing set at all.

These can replace userspace's direct use of the new rdpkru/wrpkru
instructions.

With current hardware, the kernel can not enforce that it has
control over a given key.  But, this at least allows the kernel
to indicate to userspace that userspace does not control a given
protection key.  This makes it more likely that situations like
using a pkey after sys_pkey_free() can be detected.

The kernel does _not_ enforce that this interface must be used for
changes to PKRU, whether or not a key has been "allocated".

This syscall interface could also theoretically be replaced with a
pair of vsyscalls.  The vsyscalls would just call WRPKRU/RDPKRU
directly in situations where they are drop-in equivalents for
what the kernel would be doing.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    2 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 b/arch/x86/include/asm/pkeys.h           |    4 +-
 b/arch/x86/kernel/fpu/xstate.c           |   55 +++++++++++++++++++++++++++++--
 b/include/linux/pkeys.h                  |    8 ++++
 b/mm/mprotect.c                          |   41 +++++++++++++++++++++++
 6 files changed, 109 insertions(+), 3 deletions(-)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-118-syscalls-set-get arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-118-syscalls-set-get	2016-06-08 16:26:35.894970089 -0700
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2016-06-08 16:26:35.906970634 -0700
@@ -389,3 +389,5 @@
 380	i386	pkey_mprotect		sys_pkey_mprotect
 381	i386	pkey_alloc		sys_pkey_alloc
 382	i386	pkey_free		sys_pkey_free
+383	i386	pkey_get		sys_pkey_get
+384	i386	pkey_set		sys_pkey_set
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-118-syscalls-set-get arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-118-syscalls-set-get	2016-06-08 16:26:35.896970180 -0700
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2016-06-08 16:26:35.906970634 -0700
@@ -338,6 +338,8 @@
 329	common	pkey_mprotect		sys_pkey_mprotect
 330	common	pkey_alloc		sys_pkey_alloc
 331	common	pkey_free		sys_pkey_free
+332	common	pkey_get		sys_pkey_get
+333	common	pkey_set		sys_pkey_set
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/asm/pkeys.h~pkeys-118-syscalls-set-get arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-118-syscalls-set-get	2016-06-08 16:26:35.898970271 -0700
+++ b/arch/x86/include/asm/pkeys.h	2016-06-08 16:26:35.908970724 -0700
@@ -56,7 +56,7 @@ static inline bool validate_pkey(int pke
 }
 
 static inline
-bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
+bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
 	if (!validate_pkey(pkey))
 		return true;
@@ -107,4 +107,6 @@ extern int arch_set_user_pkey_access(str
 extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
 
+extern unsigned long arch_get_user_pkey_access(struct task_struct *tsk,
+		int pkey);
 #endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-118-syscalls-set-get arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-118-syscalls-set-get	2016-06-08 16:26:35.899970316 -0700
+++ b/arch/x86/kernel/fpu/xstate.c	2016-06-08 16:26:35.908970724 -0700
@@ -690,7 +690,7 @@ void fpu__resume_cpu(void)
  *
  * Note: does not work for compacted buffers.
  */
-void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+static void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
 {
 	int feature_nr = fls64(xstate_feature_mask) - 1;
 
@@ -864,6 +864,7 @@ out:
 
 #define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
 #define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+#define PKRU_INIT_STATE	0
 
 /*
  * This will go out and modify the XSAVE buffer so that PKRU is
@@ -882,6 +883,9 @@ int __arch_set_user_pkey_access(struct t
 	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
 	u32 new_pkru_bits = 0;
 
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -EINVAL;
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
 	 * set if we enable XSAVE and we enable PKU in XCR0.
@@ -907,7 +911,7 @@ int __arch_set_user_pkey_access(struct t
 	 * state.
 	 */
 	if (!old_pkru_state)
-		new_pkru_state.pkru = 0;
+		new_pkru_state.pkru = PKRU_INIT_STATE;
 	else
 		new_pkru_state.pkru = old_pkru_state->pkru;
 
@@ -945,4 +949,51 @@ int arch_set_user_pkey_access(struct tas
 		return -EINVAL;
 	return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
+
+/*
+ * Figures out what the rights are currently for 'pkey'.
+ * Converts from PKRU's format to the user-visible PKEY_DISABLE_*
+ * format.
+ */
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	struct fpu *fpu = &current->thread.fpu;
+	u32 pkru_reg;
+	int ret = 0;
+
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -1;
+	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
+		return -1;
+	/*
+	 * The contents of PKRU itself are invalid.  Consult the
+	 * task's XSAVE buffer for PKRU contents.  This is much
+	 * more expensive than reading PKRU directly, but should
+	 * be rare or impossible with eagerfpu mode.
+	 */
+	if (!fpu->fpregs_active) {
+		struct xregs_state *xsave = &fpu->state.xsave;
+		struct pkru_state *pkru_state =
+			get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+		/*
+		 * PKRU is in its init state and not present in
+		 * the buffer in a saved form.
+		 */
+		if (!pkru_state)
+			return PKRU_INIT_STATE;
+
+		return pkru_state->pkru;
+	}
+	/*
+	 * Consult the user register directly.
+	 */
+	pkru_reg = read_pkru();
+	if (!__pkru_allows_read(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_ACCESS;
+	if (!__pkru_allows_write(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_WRITE;
+
+	return ret;
+}
 #endif /* CONFIG_ARCH_HAS_PKEYS */
diff -puN include/linux/pkeys.h~pkeys-118-syscalls-set-get include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-118-syscalls-set-get	2016-06-08 16:26:35.901970407 -0700
+++ b/include/linux/pkeys.h	2016-06-08 16:26:35.909970770 -0700
@@ -44,6 +44,14 @@ static inline int mm_pkey_free(struct mm
 static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 			unsigned long init_val)
 {
+	return -EINVAL;
+}
+
+static inline
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	if (pkey)
+		return -1;
 	return 0;
 }
 
diff -puN mm/mprotect.c~pkeys-118-syscalls-set-get mm/mprotect.c
--- a/mm/mprotect.c~pkeys-118-syscalls-set-get	2016-06-08 16:26:35.903970497 -0700
+++ b/mm/mprotect.c	2016-06-08 16:26:35.909970770 -0700
@@ -537,3 +537,44 @@ SYSCALL_DEFINE1(pkey_free, int, pkey)
 	 */
 	return ret;
 }
+
+SYSCALL_DEFINE2(pkey_get, int, pkey, unsigned long, flags)
+{
+	unsigned long ret = 0;
+
+	if (flags)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_get_user_pkey_access(current, pkey);
+
+	return ret;
+}
+
+SYSCALL_DEFINE3(pkey_set, int, pkey, unsigned long, access_rights,
+		unsigned long, flags)
+{
+	unsigned long ret = 0;
+
+	if (flags)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_set_user_pkey_access(current, pkey, access_rights);
+
+	return ret;
+}
_

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2016-07-18 20:12 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-07 20:47 [PATCH 0/9] [v2] System Calls for Memory Protection Keys Dave Hansen
2016-06-07 20:47 ` [PATCH 1/9] x86, pkeys: add fault handling for PF_PK page fault bit Dave Hansen
2016-06-07 20:47 ` [PATCH 2/9] mm: implement new pkey_mprotect() system call Dave Hansen
2016-06-07 20:47 ` [PATCH 3/9] x86, pkeys: make mprotect_key() mask off additional vm_flags Dave Hansen
2016-06-07 20:47 ` [PATCH 4/9] x86: wire up mprotect_key() system call Dave Hansen
2016-06-07 20:47 ` [PATCH 5/9] x86, pkeys: allocation/free syscalls Dave Hansen
2016-06-07 20:47 ` [PATCH 6/9] x86, pkeys: add pkey set/get syscalls Dave Hansen
2016-06-07 20:47 ` [PATCH 7/9] generic syscalls: wire up memory protection keys syscalls Dave Hansen
2016-06-07 21:25   ` Arnd Bergmann
2016-06-07 20:47 ` [PATCH 8/9] pkeys: add details of system call use to Documentation/ Dave Hansen
2016-06-07 20:47 ` [PATCH 9/9] x86, pkeys: add self-tests Dave Hansen
2016-06-08  9:23 ` [PATCH 0/9] [v2] System Calls for Memory Protection Keys Michael Kerrisk (man-pages)
2016-06-08 17:35   ` Dave Hansen
2016-06-09  0:01 [PATCH 0/9] [v3] " Dave Hansen
2016-06-09  0:01 ` [PATCH 6/9] x86, pkeys: add pkey set/get syscalls Dave Hansen
2016-07-07 12:47 [PATCH 0/9] [REVIEW-REQUEST] [v4] System Calls for Memory Protection Keys Dave Hansen
2016-07-07 12:47 ` [PATCH 6/9] x86, pkeys: add pkey set/get syscalls Dave Hansen
2016-07-07 14:45   ` Mel Gorman
2016-07-07 17:33     ` Dave Hansen
2016-07-08  7:18       ` Ingo Molnar
2016-07-08 16:32         ` Dave Hansen
2016-07-09  8:37           ` Ingo Molnar
2016-07-11  4:25             ` Andy Lutomirski
2016-07-11  7:35               ` Ingo Molnar
2016-07-11 14:28                 ` Dave Hansen
2016-07-12  7:13                   ` Ingo Molnar
2016-07-12 15:39                     ` Dave Hansen
2016-07-11 14:50                 ` Andy Lutomirski
2016-07-11 14:34               ` Dave Hansen
2016-07-11 14:45                 ` Andy Lutomirski
2016-07-11 15:48                   ` Dave Hansen
2016-07-12 16:32                     ` Andy Lutomirski
2016-07-12 17:12                       ` Dave Hansen
2016-07-12 22:55                         ` Andy Lutomirski
2016-07-13  7:56                       ` Ingo Molnar
2016-07-13 18:43                         ` Andy Lutomirski
2016-07-14  8:07                           ` Ingo Molnar
2016-07-18  4:43                             ` Andy Lutomirski
2016-07-18  9:56                               ` Ingo Molnar
2016-07-18 18:02             ` Dave Hansen
2016-07-18 20:12             ` Dave Hansen
2016-07-08 19:26         ` Dave Hansen
2016-07-08 10:22       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).