linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/7] System Calls for Memory Protection Keys
@ 2016-02-23  1:11 Dave Hansen
  2016-02-23  1:11 ` [RFC][PATCH 1/7] x86, pkeys: Documentation Dave Hansen
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Dave Hansen @ 2016-02-23  1:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dave Hansen, linux-api, linux-mm, x86, torvalds, akpm

As promised, here are the proposed new Memory Protection Keys
interfaces.  These interfaces make it possible to do something
with pkeys other than execute-only support.

There are 5 syscalls here.  I'm hoping for reviews of this set
which can help nail down what the final interfaces will be.

You can find a high-level overview of the feature and the new
syscalls here:

	https://www.sr71.net/~dave/intel/pkeys.txt

===============================================================

To use memory protection keys (pkeys), an application absolutely
needs to be able to set the pkey field in the PTE (obviously has
to be done in-kernel) and make changes to the "rights" register
(using unprivileged instructions).

An application also needs to have an an allocator for the keys
themselves.  If two different parts of an application both want
to protect their data with pkeys, they first need to know which
key to use for their individual purposes.

This set introduces 5 system calls, in 3 logical groups:

1. PTE pkey setting (sys_pkey_mprotect(), patches #1-3)
2. Key allocation (sys_pkey_alloc() / sys_pkey_free(), patch #4)
3. Rights register manipulation (sys_pkey_set/get(), patch #5)

These patches build on top of "core" support already in the tip tree,
specifically 62b5f7d013, which can currently be found at:

	http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=mm/pkeys

I have manpages written for some of these syscalls, and I will
submit a full set of manpages once we've reached some consensus
on what the interfaces should be.

This set is also available here:

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v024

I've written a set of unit tests for these interfaces, which is
available here:

	https://www.sr71.net/~dave/intel/pkeys-test-2016-02-22/

I will submit that code for inclusion with the final version of
these patches.

=== diffstat ===

Dave Hansen (7):
      x86, pkeys: Documentation
      mm: implement new pkey_mprotect() system call
      x86, pkeys: make mprotect_key() mask off additional vm_flags
      x86: wire up mprotect_key() system call
      x86, pkeys: allocation/free syscalls
      x86, pkeys: add pkey set/get syscalls
      pkeys: add details of system call use to Documentation/

 Documentation/x86/protection-keys.txt  |  91 +++++++++++++++++
 arch/x86/entry/syscalls/syscall_32.tbl |   5 +
 arch/x86/entry/syscalls/syscall_64.tbl |   5 +
 arch/x86/include/asm/mmu.h             |   8 ++
 arch/x86/include/asm/mmu_context.h     |  25 +++--
 arch/x86/include/asm/pkeys.h           |  83 ++++++++++++++-
 arch/x86/kernel/fpu/xstate.c           |  73 +++++++++++++-
 arch/x86/mm/pkeys.c                    |  40 ++++++--
 include/linux/pkeys.h                  |  39 ++++++--
 include/uapi/asm-generic/mman-common.h |   5 +
 mm/mprotect.c                          | 133 ++++++++++++++++++++++++-
 11 files changed, 476 insertions(+), 31 deletions(-)

Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 1/7] x86, pkeys: Documentation
  2016-02-23  1:11 [RFC][PATCH 0/7] System Calls for Memory Protection Keys Dave Hansen
@ 2016-02-23  1:11 ` Dave Hansen
  2016-02-23  1:11 ` [RFC][PATCH 2/7] mm: implement new pkey_mprotect() system call Dave Hansen
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2016-02-23  1:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Give a high-level overview of Protection Keys from a hardware
perspective, as well as some description since we referred to
this from the Kconfig text.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/Documentation/x86/protection-keys.txt |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff -puN /dev/null Documentation/x86/protection-keys.txt
--- /dev/null	2015-12-10 15:28:13.322405854 -0800
+++ b/Documentation/x86/protection-keys.txt	2016-02-22 17:09:22.811272277 -0800
@@ -0,0 +1,28 @@
+Memory Protection Keys for User pages is a CPU feature which will
+first appear on Skylake Servers, but will also be supported on
+future non-server parts.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.  It works by
+dedicating 4 previously ignored bits in each page table entry to a
+"protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+=========================== Config Option ===========================
+
+This config option adds approximately 1.5kb of text. and 50 bytes of
+data to the executable.  A workload which does large O_DIRECT reads
+of holes in XFS files was run to exercise get_user_pages_fast().  No
+performance delta was observed with the config option
+enabled or disabled.
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 2/7] mm: implement new pkey_mprotect() system call
  2016-02-23  1:11 [RFC][PATCH 0/7] System Calls for Memory Protection Keys Dave Hansen
  2016-02-23  1:11 ` [RFC][PATCH 1/7] x86, pkeys: Documentation Dave Hansen
@ 2016-02-23  1:11 ` Dave Hansen
  2016-02-23  1:11 ` [RFC][PATCH 3/7] x86, pkeys: make mprotect_key() mask off additional vm_flags Dave Hansen
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2016-02-23  1:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dave.hansen, linux-api, linux-mm, x86, torvalds, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

pkey_mprotect() is just like mprotect, except it also takes a
protection key as an argument.  On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

	int real_prot = PROT_READ|PROT_WRITE;
	pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set.

We settled on 'unsigned long' for the type of the key here.  We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.

Semantically, we have a bit of a problem if we combine this
syscall with our previously-introduced execute-only support:
What do we do when we mix execute-only pkey use with
pkey_mprotect() use?  For instance:

	pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
	mprotect(ptr, PAGE_SIZE, PROT_EXEC);  // set pkey=X_ONLY_PKEY?
	mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?

To solve that, we make the plain-mprotect()-initiated execute-only
support only apply to VMAs that have the default protection key (0)
set on them.

Proposed semantics:
1. protection key 0 is special and represents the default,
   unassigned protection key.  It is always allocated.
2. mprotect() never affects a mapping's pkey_mprotect()-assigned
   protection key. A protection key of 0 (even if set explicitly)
   represents an unassigned protection key.
   2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
       key may or may not result in a mapping with execute-only
       properties.  pkey_mprotect() plus pkey_set() on all threads
       should be used to _guarantee_ execute-only semantics.
3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
   kernel will internally attempt to allocate and dedicate a
   protection key for the purpose of execute-only mappings.  This
   may not be possible in cases where there are no free protection
   keys available.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/x86/include/asm/mmu_context.h |   15 ++++++++++-----
 b/arch/x86/include/asm/pkeys.h       |   11 +++++++++--
 b/arch/x86/kernel/fpu/xstate.c       |   15 ++++++++++++++-
 b/arch/x86/mm/pkeys.c                |    2 +-
 b/mm/mprotect.c                      |   27 +++++++++++++++++++++++----
 5 files changed, 57 insertions(+), 13 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-85-syscalls-mprotect_pkey arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-85-syscalls-mprotect_pkey	2016-02-22 17:09:23.217290781 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2016-02-22 17:09:23.228291282 -0800
@@ -4,6 +4,7 @@
 #include <asm/desc.h>
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
+#include <linux/pkeys.h>
 
 #include <trace/events/tlb.h>
 
@@ -286,16 +287,20 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 static inline int vma_pkey(struct vm_area_struct *vma)
 {
-	u16 pkey = 0;
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
 				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
-	pkey = (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
-#endif
-	return pkey;
+
+	return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
+}
+#else
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+	return 0;
 }
+#endif
 
 static inline bool __pkru_allows_pkey(u16 pkey, bool write)
 {
diff -puN arch/x86/include/asm/pkeys.h~pkeys-85-syscalls-mprotect_pkey arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-85-syscalls-mprotect_pkey	2016-02-22 17:09:23.219290872 -0800
+++ b/arch/x86/include/asm/pkeys.h	2016-02-22 17:09:23.228291282 -0800
@@ -1,7 +1,12 @@
 #ifndef _ASM_X86_PKEYS_H
 #define _ASM_X86_PKEYS_H
 
-#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)
+#define PKEY_DEDICATED_EXECUTE_ONLY 15
+/*
+ * Consider the PKEY_DEDICATED_EXECUTE_ONLY key unavailable.
+ */
+#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? \
+		PKEY_DEDICATED_EXECUTE_ONLY : 1)
 
 extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
@@ -10,7 +15,6 @@ extern int arch_set_user_pkey_access(str
  * Try to dedicate one of the protection keys to be used as an
  * execute-only protection key.
  */
-#define PKEY_DEDICATED_EXECUTE_ONLY 15
 extern int __execute_only_pkey(struct mm_struct *mm);
 static inline int execute_only_pkey(struct mm_struct *mm)
 {
@@ -31,4 +35,7 @@ static inline int arch_override_mprotect
 	return __arch_override_mprotect_pkey(vma, prot, pkey);
 }
 
+extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+
 #endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-85-syscalls-mprotect_pkey arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-85-syscalls-mprotect_pkey	2016-02-22 17:09:23.221290963 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2016-02-22 17:09:23.228291282 -0800
@@ -868,7 +868,7 @@ out:
  * not modfiy PKRU *itself* here, only the XSAVE state that will
  * be restored in to PKRU when we return back to userspace.
  */
-int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val)
 {
 	struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
@@ -927,3 +927,16 @@ int arch_set_user_pkey_access(struct tas
 
 	return 0;
 }
+
+/*
+ * When setting a userspace-provided value, we need to ensure
+ * that it is valid.  The __ version can get used by
+ * kernel-internal uses like the execute-only support.
+ */
+int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val)
+{
+	if (!validate_pkey(pkey))
+		return -EINVAL;
+	return __arch_set_user_pkey_access(tsk, pkey, init_val);
+}
diff -puN arch/x86/mm/pkeys.c~pkeys-85-syscalls-mprotect_pkey arch/x86/mm/pkeys.c
--- a/arch/x86/mm/pkeys.c~pkeys-85-syscalls-mprotect_pkey	2016-02-22 17:09:23.222291008 -0800
+++ b/arch/x86/mm/pkeys.c	2016-02-22 17:09:23.229291327 -0800
@@ -38,7 +38,7 @@ int __execute_only_pkey(struct mm_struct
 		return PKEY_DEDICATED_EXECUTE_ONLY;
 	}
 	preempt_enable();
-	ret = arch_set_user_pkey_access(current, PKEY_DEDICATED_EXECUTE_ONLY,
+	ret = __arch_set_user_pkey_access(current, PKEY_DEDICATED_EXECUTE_ONLY,
 			PKEY_DISABLE_ACCESS);
 	/*
 	 * If the PKRU-set operation failed somehow, just return
diff -puN mm/mprotect.c~pkeys-85-syscalls-mprotect_pkey mm/mprotect.c
--- a/mm/mprotect.c~pkeys-85-syscalls-mprotect_pkey	2016-02-22 17:09:23.224291100 -0800
+++ b/mm/mprotect.c	2016-02-22 17:09:23.229291327 -0800
@@ -352,8 +352,11 @@ fail:
 	return error;
 }
 
-SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
-		unsigned long, prot)
+/*
+ * pkey==-1 when doing a legacy mprotect()
+ */
+static int do_mprotect_pkey(unsigned long start, size_t len,
+		unsigned long prot, int pkey)
 {
 	unsigned long nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
@@ -410,11 +413,12 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 
 	for (nstart = start ; ; ) {
 		unsigned long newflags;
-		int pkey = arch_override_mprotect_pkey(vma, prot, -1);
+		int vma_pkey;
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
-		newflags = calc_vm_prot_bits(prot, pkey);
+		vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
+		newflags = calc_vm_prot_bits(prot, vma_pkey);
 		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
@@ -450,3 +454,18 @@ out:
 	up_write(&current->mm->mmap_sem);
 	return error;
 }
+
+SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot)
+{
+	return do_mprotect_pkey(start, len, prot, -1);
+}
+
+SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
+		unsigned long, prot, int, pkey)
+{
+	if (!validate_pkey(pkey))
+		return -EINVAL;
+
+	return do_mprotect_pkey(start, len, prot, pkey);
+}
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 3/7] x86, pkeys: make mprotect_key() mask off additional vm_flags
  2016-02-23  1:11 [RFC][PATCH 0/7] System Calls for Memory Protection Keys Dave Hansen
  2016-02-23  1:11 ` [RFC][PATCH 1/7] x86, pkeys: Documentation Dave Hansen
  2016-02-23  1:11 ` [RFC][PATCH 2/7] mm: implement new pkey_mprotect() system call Dave Hansen
@ 2016-02-23  1:11 ` Dave Hansen
  2016-02-23  1:11 ` [RFC][PATCH 4/7] x86: wire up mprotect_key() system call Dave Hansen
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2016-02-23  1:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dave Hansen, dave.hansen, linux-mm, x86, torvalds, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
Three of those bits: READ/WRITE/EXEC get translated directly in to
vma->vm_flags by calc_vm_prot_bits().  If a bit is unset in
mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
during the mprotect() call.

We do this clearing today by first calculating the VMA flags we
want set, then clearing the ones we do not want to inherit from
the original VMA:

	vm_flags = calc_vm_prot_bits(prot, key);
	...
	newflags = vm_flags;
	newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));

However, we *also* want to mask off the original VMA's vm_flags in
which we store the protection key.

To do that, this patch adds a new macro:

	ARCH_VM_PKEY_FLAGS

which allows the architecture to specify additional bits that it would
like cleared.  We use that to ensure that the VM_PKEY_BIT* bits get
cleared.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/x86/include/asm/pkeys.h |    2 ++
 b/include/linux/pkeys.h        |    1 +
 b/mm/mprotect.c                |   10 +++++++++-
 3 files changed, 12 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pkeys.h~pkeys-85a-mask-off-correct-vm_flags arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-85a-mask-off-correct-vm_flags	2016-02-22 17:09:23.727314024 -0800
+++ b/arch/x86/include/asm/pkeys.h	2016-02-22 17:09:23.733314297 -0800
@@ -38,4 +38,6 @@ static inline int arch_override_mprotect
 extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
 
+#define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
+
 #endif /*_ASM_X86_PKEYS_H */
diff -puN include/linux/pkeys.h~pkeys-85a-mask-off-correct-vm_flags include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-85a-mask-off-correct-vm_flags	2016-02-22 17:09:23.728314069 -0800
+++ b/include/linux/pkeys.h	2016-02-22 17:09:23.733314297 -0800
@@ -16,6 +16,7 @@
 #define execute_only_pkey(mm) (0)
 #define arch_override_mprotect_pkey(vma, prot, pkey) (0)
 #define PKEY_DEDICATED_EXECUTE_ONLY 0
+#define ARCH_VM_PKEY_FLAGS 0
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 /*
diff -puN mm/mprotect.c~pkeys-85a-mask-off-correct-vm_flags mm/mprotect.c
--- a/mm/mprotect.c~pkeys-85a-mask-off-correct-vm_flags	2016-02-22 17:09:23.730314160 -0800
+++ b/mm/mprotect.c	2016-02-22 17:09:23.733314297 -0800
@@ -417,9 +417,17 @@ static int do_mprotect_pkey(unsigned lon
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
+		/*
+		 * Each mprotect() call explicitly passes r/w/x permissions.
+		 * If a permission is not passed to mprotect(), it must be
+		 * cleared from the VMA.
+		 */
+		unsigned long mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC;
+		mask_off_old_flags |= ARCH_VM_PKEY_FLAGS;
+
 		vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
 		newflags = calc_vm_prot_bits(prot, vma_pkey);
-		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
+		newflags |= (vma->vm_flags & ~mask_off_old_flags);
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
 		if ((newflags & ~(newflags >> 4)) & (VM_READ | VM_WRITE | VM_EXEC)) {
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 4/7] x86: wire up mprotect_key() system call
  2016-02-23  1:11 [RFC][PATCH 0/7] System Calls for Memory Protection Keys Dave Hansen
                   ` (2 preceding siblings ...)
  2016-02-23  1:11 ` [RFC][PATCH 3/7] x86, pkeys: make mprotect_key() mask off additional vm_flags Dave Hansen
@ 2016-02-23  1:11 ` Dave Hansen
  2016-02-23  1:11 ` [RFC][PATCH 5/7] x86, pkeys: allocation/free syscalls Dave Hansen
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2016-02-23  1:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dave.hansen, linux-api, linux-mm, x86, torvalds, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

This is all that we need to get the new system call itself
working on x86.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 2 files changed, 2 insertions(+)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-85b-x86-mprotect_key arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-85b-x86-mprotect_key	2016-02-22 17:09:24.183334806 -0800
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2016-02-22 17:09:24.188335034 -0800
@@ -384,3 +384,4 @@
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
 377	i386	copy_file_range		sys_copy_file_range
+378	i386	pkey_mprotect		sys_pkey_mprotect
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-85b-x86-mprotect_key arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-85b-x86-mprotect_key	2016-02-22 17:09:24.185334897 -0800
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2016-02-22 17:09:24.188335034 -0800
@@ -333,6 +333,7 @@
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
 326	common	copy_file_range		sys_copy_file_range
+327	common	pkey_mprotect		sys_pkey_mprotect
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 5/7] x86, pkeys: allocation/free syscalls
  2016-02-23  1:11 [RFC][PATCH 0/7] System Calls for Memory Protection Keys Dave Hansen
                   ` (3 preceding siblings ...)
  2016-02-23  1:11 ` [RFC][PATCH 4/7] x86: wire up mprotect_key() system call Dave Hansen
@ 2016-02-23  1:11 ` Dave Hansen
  2016-02-23  1:11 ` [RFC][PATCH 6/7] x86, pkeys: add pkey set/get syscalls Dave Hansen
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2016-02-23  1:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dave.hansen, linux-api, linux-mm, x86, torvalds, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

This patch adds two new system calls:

	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
	int pkey_free(int pkey);

These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors.  The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid.  A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect() (or the forthcoming
get/set syscalls).

These system calls are also very important given the kernel's use
of pkeys to implement execute-only support.  These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel.

The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey.  For
instance:

	pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.  This keeps userspace from
needing to have knowledge about manipulating PKRU with the
RDPKRU/WRPKRU instructions.  Userspace is still free to use these
instructions as it wishes, but this facility ensures it is no
longer required.

The kernel does _not_ enforce that this interface must be used for
changes to PKRU, even for keys it does not control.

This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings).  Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.

1. PKRU is the Protection Key Rights User register.  It is a
   usermode-accessible register that controls whether writes
   and/or access to each individual pkey is allowed or denied.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    2 
 b/arch/x86/entry/syscalls/syscall_64.tbl |    2 
 b/arch/x86/include/asm/mmu.h             |    8 +++
 b/arch/x86/include/asm/mmu_context.h     |   10 +++
 b/arch/x86/include/asm/pkeys.h           |   78 +++++++++++++++++++++++++++++--
 b/arch/x86/kernel/fpu/xstate.c           |    3 +
 b/arch/x86/mm/pkeys.c                    |   40 ++++++++++++---
 b/include/linux/pkeys.h                  |   30 +++++++++--
 b/include/uapi/asm-generic/mman-common.h |    5 +
 b/mm/mprotect.c                          |   56 ++++++++++++++++++++++
 10 files changed, 213 insertions(+), 21 deletions(-)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-86-syscalls-allocation arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-86-syscalls-allocation	2016-02-22 17:09:24.623354858 -0800
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2016-02-22 17:09:24.640355633 -0800
@@ -385,3 +385,5 @@
 376	i386	mlock2			sys_mlock2
 377	i386	copy_file_range		sys_copy_file_range
 378	i386	pkey_mprotect		sys_pkey_mprotect
+379	i386	pkey_alloc		sys_pkey_alloc
+380	i386	pkey_free		sys_pkey_free
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-86-syscalls-allocation arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-86-syscalls-allocation	2016-02-22 17:09:24.624354904 -0800
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2016-02-22 17:09:24.641355679 -0800
@@ -334,6 +334,8 @@
 325	common	mlock2			sys_mlock2
 326	common	copy_file_range		sys_copy_file_range
 327	common	pkey_mprotect		sys_pkey_mprotect
+328	common	pkey_alloc		sys_pkey_alloc
+329	common	pkey_free		sys_pkey_free
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-86-syscalls-allocation arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-86-syscalls-allocation	2016-02-22 17:09:24.626354995 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2016-02-22 17:09:24.641355679 -0800
@@ -108,7 +108,16 @@ static inline void enter_lazy_tlb(struct
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+	#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	if (boot_cpu_has(X86_FEATURE_OSPKE)) {
+		/* pkey 0 is the default and always allocated */
+		mm->context.pkey_allocation_map = 0x1;
+		/* -1 means unallocated or invalid */
+		mm->context.execute_only_pkey = -1;
+	}
+	#endif
 	init_new_context_ldt(tsk, mm);
+
 	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
@@ -354,5 +363,4 @@ static inline bool arch_pte_access_permi
 {
 	return __pkru_allows_pkey(pte_flags_pkey(pte_flags(pte)), write);
 }
-
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/mmu.h~pkeys-86-syscalls-allocation arch/x86/include/asm/mmu.h
--- a/arch/x86/include/asm/mmu.h~pkeys-86-syscalls-allocation	2016-02-22 17:09:24.628355086 -0800
+++ b/arch/x86/include/asm/mmu.h	2016-02-22 17:09:24.641355679 -0800
@@ -23,6 +23,14 @@ typedef struct {
 	const struct vdso_image *vdso_image;	/* vdso image in use */
 
 	atomic_t perf_rdpmc_allowed;	/* nonzero if rdpmc is allowed */
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/*
+	 * One bit per protection key says whether userspace can
+	 * use it or not.  protected by mmap_sem.
+	 */
+	u16 pkey_allocation_map;
+	s16 execute_only_pkey;
+#endif
 } mm_context_t;
 
 #ifdef CONFIG_SMP
diff -puN arch/x86/include/asm/pkeys.h~pkeys-86-syscalls-allocation arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-86-syscalls-allocation	2016-02-22 17:09:24.629355132 -0800
+++ b/arch/x86/include/asm/pkeys.h	2016-02-22 17:09:24.642355725 -0800
@@ -1,12 +1,8 @@
 #ifndef _ASM_X86_PKEYS_H
 #define _ASM_X86_PKEYS_H
 
-#define PKEY_DEDICATED_EXECUTE_ONLY 15
-/*
- * Consider the PKEY_DEDICATED_EXECUTE_ONLY key unavailable.
- */
 #define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? \
-		PKEY_DEDICATED_EXECUTE_ONLY : 1)
+		16 : 1)
 
 extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
@@ -40,4 +36,76 @@ extern int __arch_set_user_pkey_access(s
 
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
 
+#define mm_pkey_allocation_map(mm)	(mm->context.pkey_allocation_map)
+#define mm_set_pkey_allocated(mm, pkey) do {		\
+	mm_pkey_allocation_map(mm) |= (1 << pkey);	\
+} while (0)
+#define mm_set_pkey_free(mm, pkey) do {			\
+	mm_pkey_allocation_map(mm) &= ~(1 << pkey);	\
+} while (0)
+
+/*
+ * This is called from mprotect_pkey().
+ *
+ * Returns true if the protection keys is valid.
+ */
+static inline bool validate_pkey(int pkey)
+{
+	if (pkey < 0)
+		return false;
+	return (pkey < arch_max_pkey());
+}
+
+static inline
+bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
+{
+	if (!validate_pkey(pkey))
+		return true;
+
+	return mm_pkey_allocation_map(mm) & (1 << pkey);
+}
+
+static inline
+int mm_pkey_alloc(struct mm_struct *mm)
+{
+	int all_pkeys_mask = ((1 << arch_max_pkey()) - 1);
+	int ret;
+
+	/*
+	 * Are we out of pkeys?  We must handle this specially
+	 * because ffz() behavior is undefined if there are no
+	 * zeros.
+	 */
+	if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
+		return -1;
+
+	ret = ffz(mm_pkey_allocation_map(mm));
+
+	mm_set_pkey_allocated(mm, ret);
+
+	return ret;
+}
+
+static inline
+int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+	/*
+	 * pkey 0 is special, always allocated and can never
+	 * be freed.
+	 */
+	if (!pkey || !validate_pkey(pkey))
+		return -EINVAL;
+	if (!mm_pkey_is_allocated(mm, pkey))
+		return -EINVAL;
+
+	mm_set_pkey_free(mm, pkey);
+
+	return 0;
+}
+
+extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+
 #endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-86-syscalls-allocation arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-86-syscalls-allocation	2016-02-22 17:09:24.631355223 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2016-02-22 17:09:24.642355725 -0800
@@ -5,6 +5,7 @@
  */
 #include <linux/compat.h>
 #include <linux/cpu.h>
+#include <linux/mman.h>
 #include <linux/pkeys.h>
 
 #include <asm/fpu/api.h>
@@ -775,6 +776,7 @@ const void *get_xsave_field_ptr(int xsav
 	return get_xsave_addr(&fpu->state.xsave, xsave_state);
 }
 
+#ifdef CONFIG_ARCH_HAS_PKEYS
 
 /*
  * Set xfeatures (aka XSTATE_BV) bit for a feature that we want
@@ -940,3 +942,4 @@ int arch_set_user_pkey_access(struct tas
 		return -EINVAL;
 	return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
+#endif /* CONFIG_ARCH_HAS_PKEYS */
diff -puN arch/x86/mm/pkeys.c~pkeys-86-syscalls-allocation arch/x86/mm/pkeys.c
--- a/arch/x86/mm/pkeys.c~pkeys-86-syscalls-allocation	2016-02-22 17:09:24.632355269 -0800
+++ b/arch/x86/mm/pkeys.c	2016-02-22 17:09:24.643355770 -0800
@@ -21,8 +21,19 @@
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
+	bool need_to_set_mm_pkey = false;
+	int execute_only_pkey = mm->context.execute_only_pkey;
 	int ret;
 
+	/* Do we need to assign a pkey for mm's execute-only maps? */
+	if (execute_only_pkey == -1) {
+		/* Go allocate one to use, which might fail */
+		execute_only_pkey = mm_pkey_alloc(mm);
+		if (!validate_pkey(execute_only_pkey))
+			return -1;
+		need_to_set_mm_pkey = true;
+	}
+
 	/*
 	 * We do not want to go through the relatively costly
 	 * dance to set PKRU if we do not need to.  Check it
@@ -32,22 +43,33 @@ int __execute_only_pkey(struct mm_struct
 	 * can make fpregs inactive.
 	 */
 	preempt_disable();
-	if (fpregs_active() &&
-	    !__pkru_allows_read(read_pkru(), PKEY_DEDICATED_EXECUTE_ONLY)) {
+	if (!need_to_set_mm_pkey &&
+	    fpregs_active() &&
+	    !__pkru_allows_read(read_pkru(), execute_only_pkey)) {
 		preempt_enable();
-		return PKEY_DEDICATED_EXECUTE_ONLY;
+		return execute_only_pkey;
 	}
 	preempt_enable();
-	ret = __arch_set_user_pkey_access(current, PKEY_DEDICATED_EXECUTE_ONLY,
-			PKEY_DISABLE_ACCESS);
+
 	/*
+	 * Set up PKRU so that it denies access for everything
+	 * other than execution.
+	 */
+	ret = __arch_set_user_pkey_access(current, execute_only_pkey,
+			PKEY_DISABLE_ACCESS);
+        /*
 	 * If the PKRU-set operation failed somehow, just return
 	 * 0 and effectively disable execute-only support.
 	 */
-	if (ret)
-		return 0;
+	if (ret) {
+		mm_set_pkey_free(mm, execute_only_pkey);
+		return -1;
+	}
 
-	return PKEY_DEDICATED_EXECUTE_ONLY;
+	/* We got one, store it and use it from here on out */
+	if (need_to_set_mm_pkey)
+		mm->context.execute_only_pkey = execute_only_pkey;
+	return execute_only_pkey;
 }
 
 static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma)
@@ -55,7 +77,7 @@ static inline bool vma_is_pkey_exec_only
 	/* Do this check first since the vm_flags should be hot */
 	if ((vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) != VM_EXEC)
 		return false;
-	if (vma_pkey(vma) != PKEY_DEDICATED_EXECUTE_ONLY)
+	if (vma_pkey(vma) != vma->vm_mm->context.execute_only_pkey)
 		return false;
 
 	return true;
diff -puN include/linux/pkeys.h~pkeys-86-syscalls-allocation include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-86-syscalls-allocation	2016-02-22 17:09:24.634355360 -0800
+++ b/include/linux/pkeys.h	2016-02-22 17:09:24.643355770 -0800
@@ -4,11 +4,6 @@
 #include <linux/mm_types.h>
 #include <asm/mmu_context.h>
 
-#define PKEY_DISABLE_ACCESS	0x1
-#define PKEY_DISABLE_WRITE	0x2
-#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
-				 PKEY_DISABLE_WRITE)
-
 #ifdef CONFIG_ARCH_HAS_PKEYS
 #include <asm/pkeys.h>
 #else /* ! CONFIG_ARCH_HAS_PKEYS */
@@ -17,7 +12,6 @@
 #define arch_override_mprotect_pkey(vma, prot, pkey) (0)
 #define PKEY_DEDICATED_EXECUTE_ONLY 0
 #define ARCH_VM_PKEY_FLAGS 0
-#endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 /*
  * This is called from mprotect_pkey().
@@ -31,4 +25,28 @@ static inline bool validate_pkey(int pke
 	return (pkey < arch_max_pkey());
 }
 
+static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
+{
+	return (pkey == 0);
+}
+
+static inline int mm_pkey_alloc(struct mm_struct *mm)
+{
+	return -1;
+}
+
+static inline int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+	WARN_ONCE(1, "free of protection key when disabled");
+	return -EINVAL;
+}
+
+static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+			unsigned long init_val)
+{
+	return 0;
+}
+
+#endif /* ! CONFIG_ARCH_HAS_PKEYS */
+
 #endif /* _LINUX_PKEYS_H */
diff -puN include/uapi/asm-generic/mman-common.h~pkeys-86-syscalls-allocation include/uapi/asm-generic/mman-common.h
--- a/include/uapi/asm-generic/mman-common.h~pkeys-86-syscalls-allocation	2016-02-22 17:09:24.636355451 -0800
+++ b/include/uapi/asm-generic/mman-common.h	2016-02-22 17:09:24.643355770 -0800
@@ -72,4 +72,9 @@
 #define MAP_HUGE_SHIFT	26
 #define MAP_HUGE_MASK	0x3f
 
+#define PKEY_DISABLE_ACCESS	0x1
+#define PKEY_DISABLE_WRITE	0x2
+#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
+				 PKEY_DISABLE_WRITE)
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff -puN mm/mprotect.c~pkeys-86-syscalls-allocation mm/mprotect.c
--- a/mm/mprotect.c~pkeys-86-syscalls-allocation	2016-02-22 17:09:24.637355497 -0800
+++ b/mm/mprotect.c	2016-02-22 17:09:24.644355816 -0800
@@ -23,12 +23,15 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
+#include <linux/pkeys.h>
 #include <linux/ksm.h>
 #include <linux/pkeys.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
+#include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
+#include <asm-generic/mman-common.h>
 
 #include "internal.h"
 
@@ -386,6 +389,14 @@ static int do_mprotect_pkey(unsigned lon
 
 	down_write(&current->mm->mmap_sem);
 
+	/*
+	 * If userspace did not allocate the pkey, do not let
+	 * them use it here.
+	 */
+	error = -EINVAL;
+	if ((pkey != -1) && !mm_pkey_is_allocated(current->mm, pkey))
+		goto out;
+
 	vma = find_vma(current->mm, start);
 	error = -ENOMEM;
 	if (!vma)
@@ -477,3 +488,48 @@ SYSCALL_DEFINE4(pkey_mprotect, unsigned
 
 	return do_mprotect_pkey(start, len, prot, pkey);
 }
+
+SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
+{
+	int pkey;
+	int ret;
+
+	/* No flags supported yet. */
+	if (flags)
+		return -EINVAL;
+	/* check for unsupported init values */
+	if (init_val & ~PKEY_ACCESS_MASK)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	pkey = mm_pkey_alloc(current->mm);
+
+	ret = -ENOSPC;
+	if (pkey == -1)
+		goto out;
+
+	ret = arch_set_user_pkey_access(current, pkey, init_val);
+	if (ret) {
+		mm_pkey_free(current->mm, pkey);
+		goto out;
+	}
+	ret = pkey;
+out:
+	up_write(&current->mm->mmap_sem);
+	return ret;
+}
+
+SYSCALL_DEFINE1(pkey_free, int, pkey)
+{
+	int ret;
+
+	down_write(&current->mm->mmap_sem);
+	ret = mm_pkey_free(current->mm, pkey);
+	up_write(&current->mm->mmap_sem);
+
+	/*
+	 * We could provie warnings or errors if any VMA still
+	 * has the pkey set here.
+	 */
+	return ret;
+}
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 6/7] x86, pkeys: add pkey set/get syscalls
  2016-02-23  1:11 [RFC][PATCH 0/7] System Calls for Memory Protection Keys Dave Hansen
                   ` (4 preceding siblings ...)
  2016-02-23  1:11 ` [RFC][PATCH 5/7] x86, pkeys: allocation/free syscalls Dave Hansen
@ 2016-02-23  1:11 ` Dave Hansen
  2016-02-23  6:45   ` Ingo Molnar
  2016-02-23  1:11 ` [RFC][PATCH 7/7] pkeys: add details of system call use to Documentation/ Dave Hansen
  2016-03-03  8:05 ` [RFC][PATCH 0/7] System Calls for Memory Protection Keys Michael Kerrisk (man-pages)
  7 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2016-02-23  1:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dave.hansen, linux-api, linux-mm, x86, torvalds, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

This establishes two more system calls for protection key management:

	unsigned long pkey_get(int pkey);
	int pkey_set(int pkey, unsigned long access_rights);

The return value from pkey_get() and the 'access_rights' passed
to pkey_set() are the same format: a bitmask containing
PKEY_DENY_WRITE and/or PKEY_DENY_ACCESS, or nothing set at all.

These can replace userspace's direct use of the new rdpkru/wrpkru
instructions.

With current hardware, the kernel can not enforce that it has
control over a given key.  But, this at least allows the kernel
to indicate to userspace that userspace does not control a given
protection key.  This makes it more likely that situations like
using a pkey after sys_pkey_free() can be detected.

The kernel does _not_ enforce that this interface must be used for
changes to PKRU, whether or not a key has been "allocated".

This syscall interface could also theoretically be replaced with a
pair of vsyscalls.  The vsyscalls would just call WRPKRU/RDPKRU
directly in situations where they are drop-in equivalents for
what the kernel would be doing.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/arch/x86/entry/syscalls/syscall_32.tbl |    2 +
 b/arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 b/arch/x86/include/asm/pkeys.h           |    4 +-
 b/arch/x86/kernel/fpu/xstate.c           |   55 +++++++++++++++++++++++++++++--
 b/include/linux/pkeys.h                  |    8 ++++
 b/mm/mprotect.c                          |   41 +++++++++++++++++++++++
 6 files changed, 109 insertions(+), 3 deletions(-)

diff -puN arch/x86/entry/syscalls/syscall_32.tbl~pkeys-87-syscalls-set-get arch/x86/entry/syscalls/syscall_32.tbl
--- a/arch/x86/entry/syscalls/syscall_32.tbl~pkeys-87-syscalls-set-get	2016-02-22 17:09:25.275384573 -0800
+++ b/arch/x86/entry/syscalls/syscall_32.tbl	2016-02-22 17:09:25.286385075 -0800
@@ -387,3 +387,5 @@
 378	i386	pkey_mprotect		sys_pkey_mprotect
 379	i386	pkey_alloc		sys_pkey_alloc
 380	i386	pkey_free		sys_pkey_free
+381	i386	pkey_get		sys_pkey_get
+382	i386	pkey_set		sys_pkey_set
diff -puN arch/x86/entry/syscalls/syscall_64.tbl~pkeys-87-syscalls-set-get arch/x86/entry/syscalls/syscall_64.tbl
--- a/arch/x86/entry/syscalls/syscall_64.tbl~pkeys-87-syscalls-set-get	2016-02-22 17:09:25.276384619 -0800
+++ b/arch/x86/entry/syscalls/syscall_64.tbl	2016-02-22 17:09:25.287385120 -0800
@@ -336,6 +336,8 @@
 327	common	pkey_mprotect		sys_pkey_mprotect
 328	common	pkey_alloc		sys_pkey_alloc
 329	common	pkey_free		sys_pkey_free
+330	common	pkey_get		sys_pkey_get
+331	common	pkey_set		sys_pkey_set
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN arch/x86/include/asm/pkeys.h~pkeys-87-syscalls-set-get arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-87-syscalls-set-get	2016-02-22 17:09:25.278384710 -0800
+++ b/arch/x86/include/asm/pkeys.h	2016-02-22 17:09:25.287385120 -0800
@@ -57,7 +57,7 @@ static inline bool validate_pkey(int pke
 }
 
 static inline
-bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
+bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
 	if (!validate_pkey(pkey))
 		return true;
@@ -108,4 +108,6 @@ extern int arch_set_user_pkey_access(str
 extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
 
+extern unsigned long arch_get_user_pkey_access(struct task_struct *tsk,
+		                int pkey);
 #endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-87-syscalls-set-get arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-87-syscalls-set-get	2016-02-22 17:09:25.280384801 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2016-02-22 17:09:25.287385120 -0800
@@ -687,7 +687,7 @@ void fpu__resume_cpu(void)
  *
  * Note: does not work for compacted buffers.
  */
-void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+static void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
 {
 	int feature_nr = fls64(xstate_feature_mask) - 1;
 
@@ -861,6 +861,7 @@ out:
 
 #define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
 #define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+#define PKRU_INIT_STATE	0
 
 /*
  * This will go out and modify the XSAVE buffer so that PKRU is
@@ -879,6 +880,9 @@ int __arch_set_user_pkey_access(struct t
 	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
 	u32 new_pkru_bits = 0;
 
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -EINVAL;
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
 	 * set if we enable XSAVE and we enable PKU in XCR0.
@@ -904,7 +908,7 @@ int __arch_set_user_pkey_access(struct t
 	 * state.
 	 */
 	if (!old_pkru_state)
-		new_pkru_state.pkru = 0;
+		new_pkru_state.pkru = PKRU_INIT_STATE;
 	else
 		new_pkru_state.pkru = old_pkru_state->pkru;
 
@@ -942,4 +946,51 @@ int arch_set_user_pkey_access(struct tas
 		return -EINVAL;
 	return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
+
+/*
+ * Figures out what the rights are currently for 'pkey'.
+ * Converts from PKRU's format to the user-visible PKEY_DISABLE_*
+ * format.
+ */
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	struct fpu *fpu = &current->thread.fpu;
+	u32 pkru_reg;
+	int ret = 0;
+
+	/* Only support manipulating current task for now */
+	if (tsk != current)
+		return -1;
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return -1;
+	/*
+	 * The contents of PKRU itself are invalid.  Consult the
+	 * task's XSAVE buffer for PKRU contents.  This is much
+	 * more expensive than reading PKRU directly, but should
+	 * be rare or impossible with eagerfpu mode.
+	 */
+	if (!fpu->fpregs_active) {
+		struct xregs_state *xsave = &fpu->state.xsave;
+		struct pkru_state *pkru_state =
+			get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+		/*
+		 * PKRU is in its init state and not present in
+		 * the buffer in a saved form.
+		 */
+		if (!pkru_state)
+			return PKRU_INIT_STATE;
+
+		return pkru_state->pkru;
+	}
+	/*
+	 * Consult the user register directly.
+	 */
+	pkru_reg = read_pkru();
+	if (!__pkru_allows_read(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_ACCESS;
+	if (!__pkru_allows_write(pkru_reg, pkey))
+		ret |= PKEY_DISABLE_WRITE;
+
+	return ret;
+}
 #endif /* CONFIG_ARCH_HAS_PKEYS */
diff -puN include/linux/pkeys.h~pkeys-87-syscalls-set-get include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-87-syscalls-set-get	2016-02-22 17:09:25.281384847 -0800
+++ b/include/linux/pkeys.h	2016-02-22 17:09:25.288385166 -0800
@@ -44,6 +44,14 @@ static inline int mm_pkey_free(struct mm
 static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 			unsigned long init_val)
 {
+	return -EINVAL;
+}
+
+static inline
+unsigned long arch_get_user_pkey_access(struct task_struct *tsk, int pkey)
+{
+	if (pkey)
+		return -1;
 	return 0;
 }
 
diff -puN mm/mprotect.c~pkeys-87-syscalls-set-get mm/mprotect.c
--- a/mm/mprotect.c~pkeys-87-syscalls-set-get	2016-02-22 17:09:25.283384938 -0800
+++ b/mm/mprotect.c	2016-02-22 17:09:25.288385166 -0800
@@ -533,3 +533,44 @@ SYSCALL_DEFINE1(pkey_free, int, pkey)
 	 */
 	return ret;
 }
+
+SYSCALL_DEFINE2(pkey_get, int, pkey, unsigned long, flags)
+{
+	unsigned long ret = 0;
+
+	if (flags)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_get_user_pkey_access(current, pkey);
+
+	return ret;
+}
+
+SYSCALL_DEFINE3(pkey_set, int, pkey, unsigned long, access_rights,
+		unsigned long, flags)
+{
+	unsigned long ret = 0;
+
+	if (flags)
+		return -EINVAL;
+
+	down_write(&current->mm->mmap_sem);
+	if (!mm_pkey_is_allocated(current->mm, pkey))
+		ret = -EBADF;
+	up_write(&current->mm->mmap_sem);
+
+	if (ret)
+		return ret;
+
+	ret = arch_set_user_pkey_access(current, pkey, access_rights);
+
+	return ret;
+}
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 7/7] pkeys: add details of system call use to Documentation/
  2016-02-23  1:11 [RFC][PATCH 0/7] System Calls for Memory Protection Keys Dave Hansen
                   ` (5 preceding siblings ...)
  2016-02-23  1:11 ` [RFC][PATCH 6/7] x86, pkeys: add pkey set/get syscalls Dave Hansen
@ 2016-02-23  1:11 ` Dave Hansen
  2016-02-23  6:38   ` Ingo Molnar
  2016-03-03  8:05 ` [RFC][PATCH 0/7] System Calls for Memory Protection Keys Michael Kerrisk (man-pages)
  7 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2016-02-23  1:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dave.hansen, linux-api, linux-mm, x86, torvalds, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

This spells out all of the pkey-related system calls that we have
and provides some example code fragments to demonstrate how we
expect them to be used.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
---

 b/Documentation/x86/protection-keys.txt |   63 ++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff -puN Documentation/x86/protection-keys.txt~pkeys-98-syscall-docs Documentation/x86/protection-keys.txt
--- a/Documentation/x86/protection-keys.txt~pkeys-98-syscall-docs	2016-02-22 17:09:25.814409138 -0800
+++ b/Documentation/x86/protection-keys.txt	2016-02-22 17:09:25.818409320 -0800
@@ -19,6 +19,69 @@ even though there is theoretically space
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
+=========================== Syscalls ===========================
+
+There are 5 system calls which directly interact with pkeys:
+
+	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
+	int pkey_free(int pkey);
+	int sys_pkey_mprotect(unsigned long start, size_t len,
+			      unsigned long prot, int pkey);
+	unsigned long pkey_get(int pkey);
+	int pkey_set(int pkey, unsigned long access_rights);
+
+Before a pkey can be used, it must first be allocated with
+pkey_alloc().  An application may either call pkey_set() or the
+WRPKRU instruction directly in order to change access permissions
+to memory covered with a key.
+
+	int real_prot = PROT_READ|PROT_WRITE;
+	pkey = pkey_alloc(0, PKEY_DENY_WRITE);
+	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
+	... application runs here
+
+Now, if the application needs to update the data at 'ptr', it can
+gain access, do the update, then remove its write access:
+
+	pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
+	*ptr = foo; // assign something
+	pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
+
+Now when it frees the memory, it will also free the pkey since it
+is no longer in use:
+
+	munmap(ptr, PAGE_SIZE);
+	pkey_free(pkey);
+
+=========================== Behavior ===========================
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+	mprotect(ptr, size, PROT_NONE);
+	something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+	sys_pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
+	sys_pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE);
+	something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+	*ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+	read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
+
 =========================== Config Option ===========================
 
 This config option adds approximately 1.5kb of text. and 50 bytes of
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 7/7] pkeys: add details of system call use to Documentation/
  2016-02-23  1:11 ` [RFC][PATCH 7/7] pkeys: add details of system call use to Documentation/ Dave Hansen
@ 2016-02-23  6:38   ` Ingo Molnar
  0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2016-02-23  6:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, dave.hansen, linux-api, linux-mm, x86, torvalds, akpm


* Dave Hansen <dave@sr71.net> wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This spells out all of the pkey-related system calls that we have
> and provides some example code fragments to demonstrate how we
> expect them to be used.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: linux-api@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: x86@kernel.org
> Cc: torvalds@linux-foundation.org
> Cc: akpm@linux-foundation.org
> ---
> 
>  b/Documentation/x86/protection-keys.txt |   63 ++++++++++++++++++++++++++++++++
>  1 file changed, 63 insertions(+)

Please also add pkeys testcases to tools/tests/self-tests.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 6/7] x86, pkeys: add pkey set/get syscalls
  2016-02-23  1:11 ` [RFC][PATCH 6/7] x86, pkeys: add pkey set/get syscalls Dave Hansen
@ 2016-02-23  6:45   ` Ingo Molnar
  0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2016-02-23  6:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, dave.hansen, linux-api, linux-mm, x86, torvalds, akpm


* Dave Hansen <dave@sr71.net> wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This establishes two more system calls for protection key management:
> 
> 	unsigned long pkey_get(int pkey);
> 	int pkey_set(int pkey, unsigned long access_rights);
> 
> The return value from pkey_get() and the 'access_rights' passed
> to pkey_set() are the same format: a bitmask containing
> PKEY_DENY_WRITE and/or PKEY_DENY_ACCESS, or nothing set at all.
> 
> These can replace userspace's direct use of the new rdpkru/wrpkru
> instructions.
> 
> With current hardware, the kernel can not enforce that it has
> control over a given key.  But, this at least allows the kernel
> to indicate to userspace that userspace does not control a given
> protection key.  This makes it more likely that situations like
> using a pkey after sys_pkey_free() can be detected.

So it's analogous to file descriptor open()/close() syscalls: the kernel does not 
enforce that different libraries of the same process do not interfere with each 
other's file descriptors - but in practice it's not a problem because everyone 
uses open()/close().

Resources that a process uses don't per se 'need' kernel level isolation to be 
useful.

> The kernel does _not_ enforce that this interface must be used for
> changes to PKRU, whether or not a key has been "allocated".

Nor does the kernel enforce that open() must be used to get a file descriptor, so 
code can do the following:

	close(100);

and can interfere with a library that is holding a file open - but it's generally 
not a problem and the above is considered poor code that will cause problems.

One thing that is different is that file descriptors are generally plentiful, 
while of pkeys there are at most 16 - but I think it's still "large enough" to not 
be an issue in practice.

We'll see ...

> This syscall interface could also theoretically be replaced with a pair of 
> vsyscalls.  The vsyscalls would just call WRPKRU/RDPKRU directly in situations 
> where they are drop-in equivalents for what the kernel would be doing.

Indeed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/7] System Calls for Memory Protection Keys
  2016-02-23  1:11 [RFC][PATCH 0/7] System Calls for Memory Protection Keys Dave Hansen
                   ` (6 preceding siblings ...)
  2016-02-23  1:11 ` [RFC][PATCH 7/7] pkeys: add details of system call use to Documentation/ Dave Hansen
@ 2016-03-03  8:05 ` Michael Kerrisk (man-pages)
  2016-03-03 23:49   ` Dave Hansen
  7 siblings, 1 reply; 12+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-03-03  8:05 UTC (permalink / raw)
  To: Dave Hansen; +Cc: lkml, Linux API, linux-mm, x86, Linus Torvalds, Andrew Morton

Hi Dave,

On 23 February 2016 at 02:11, Dave Hansen <dave@sr71.net> wrote:
> As promised, here are the proposed new Memory Protection Keys
> interfaces.  These interfaces make it possible to do something
> with pkeys other than execute-only support.
>
> There are 5 syscalls here.  I'm hoping for reviews of this set
> which can help nail down what the final interfaces will be.
>
> You can find a high-level overview of the feature and the new
> syscalls here:
>
>         https://www.sr71.net/~dave/intel/pkeys.txt

(That's pretty thin...)

> ===============================================================
>
> To use memory protection keys (pkeys), an application absolutely
> needs to be able to set the pkey field in the PTE (obviously has
> to be done in-kernel) and make changes to the "rights" register
> (using unprivileged instructions).
>
> An application also needs to have an an allocator for the keys
> themselves.  If two different parts of an application both want
> to protect their data with pkeys, they first need to know which
> key to use for their individual purposes.
>
> This set introduces 5 system calls, in 3 logical groups:
>
> 1. PTE pkey setting (sys_pkey_mprotect(), patches #1-3)
> 2. Key allocation (sys_pkey_alloc() / sys_pkey_free(), patch #4)
> 3. Rights register manipulation (sys_pkey_set/get(), patch #5)
>
> These patches build on top of "core" support already in the tip tree,
> specifically 62b5f7d013, which can currently be found at:
>
>         http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=mm/pkeys
>
> I have manpages written for some of these syscalls, and I will
> submit a full set of manpages once we've reached some consensus
> on what the interfaces should be.

Please don't do things in this order. Providing man pages up front
make it easier for people to understand, review, and critique the API.
Submitting man pages should be a foundational part of submitting a new
set of interfaces and discussing their design.

Thanks,

Michael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 0/7] System Calls for Memory Protection Keys
  2016-03-03  8:05 ` [RFC][PATCH 0/7] System Calls for Memory Protection Keys Michael Kerrisk (man-pages)
@ 2016-03-03 23:49   ` Dave Hansen
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2016-03-03 23:49 UTC (permalink / raw)
  To: mtk.manpages
  Cc: lkml, Linux API, linux-mm, x86, Linus Torvalds, Andrew Morton

On 03/03/2016 12:05 AM, Michael Kerrisk (man-pages) wrote:
>> > I have manpages written for some of these syscalls, and I will
>> > submit a full set of manpages once we've reached some consensus
>> > on what the interfaces should be.
> Please don't do things in this order. Providing man pages up front
> make it easier for people to understand, review, and critique the API.
> Submitting man pages should be a foundational part of submitting a new
> set of interfaces and discussing their design.

Michael, thanks for taking a look, plus the very detailed previous
review you did of the first batch of man-pages that I posted.

I've posted a newer version including all of the new system calls, and
I've attempted to address all the earlier review comments you made.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-03-03 23:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-23  1:11 [RFC][PATCH 0/7] System Calls for Memory Protection Keys Dave Hansen
2016-02-23  1:11 ` [RFC][PATCH 1/7] x86, pkeys: Documentation Dave Hansen
2016-02-23  1:11 ` [RFC][PATCH 2/7] mm: implement new pkey_mprotect() system call Dave Hansen
2016-02-23  1:11 ` [RFC][PATCH 3/7] x86, pkeys: make mprotect_key() mask off additional vm_flags Dave Hansen
2016-02-23  1:11 ` [RFC][PATCH 4/7] x86: wire up mprotect_key() system call Dave Hansen
2016-02-23  1:11 ` [RFC][PATCH 5/7] x86, pkeys: allocation/free syscalls Dave Hansen
2016-02-23  1:11 ` [RFC][PATCH 6/7] x86, pkeys: add pkey set/get syscalls Dave Hansen
2016-02-23  6:45   ` Ingo Molnar
2016-02-23  1:11 ` [RFC][PATCH 7/7] pkeys: add details of system call use to Documentation/ Dave Hansen
2016-02-23  6:38   ` Ingo Molnar
2016-03-03  8:05 ` [RFC][PATCH 0/7] System Calls for Memory Protection Keys Michael Kerrisk (man-pages)
2016-03-03 23:49   ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).