All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] [RFC] x86: Memory Protection Keys
@ 2015-05-07 17:41 Dave Hansen
  2015-05-07 17:41 ` [PATCH 02/12] x86, pku: define new CR4 bit Dave Hansen
                   ` (12 more replies)
  0 siblings, 13 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86

This is a big, fat RFC.  This code is going to be unrunable to
anyone outside of Intel.  But, this patch set has user interface
implications because we need to pass the protection key in to
the kernel somehow.

At this point, I would especially appreciate feedback on how
we should do that.  I've taken the most expedient approach for
this first attempt, especially since we piggyback on existing
syscalls here.

There is a lot of work left to do here.  Mainly, we need to
ensure that when we are walking the page tables in software
that we obey protection keys when at all possible.  This is
going to mean a lot of audits of the page table walking code,
although some of it like access_process_vm() we can probably
safely ignore.

This set is also available here:

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v001

== FEATURE OVERVIEW ==

Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU
feature which will be found in future Intel CPUs.  The work here
was done with the aid of simulators.

Memory Protection Keys provides a mechanism for enforcing
page-based protections, but without requiring modification of the
page tables when an application changes protection domains.  It
works by dedicating 4 previously ignored bits in each page table
entry to a "protection key", giving 16 possible keys.

There is also a new user-accessible register (PKRU) with two
separate bits (Access Disable and Write Disable) for each key.
Being a CPU register, PKRU is inherently thread-local,
potentially giving each thread a different set of protections
from every other thread.

There are two new instructions (RDPKRU/WRPKRU) for reading and
writing to the new register.  The feature is only available in
64-bit mode, even though there is theoretically space in the PAE
PTEs.  These permissions are enforced on data access only and
have no effect on instruction fetches.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 02/12] x86, pku: define new CR4 bit
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 17:41 ` [PATCH 01/12] x86, pkeys: cpuid bit definition Dave Hansen
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86


There is a new bit in CR4 for enabling protection keys.

---

 b/arch/x86/include/uapi/asm/processor-flags.h |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/include/uapi/asm/processor-flags.h~pkeys-1-cr4, arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~pkeys-1-cr4,	2015-05-07 10:31:41.384187278 -0700
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2015-05-07 10:31:41.387187413 -0700
@@ -120,6 +120,8 @@
 #define X86_CR4_SMEP		_BITUL(X86_CR4_SMEP_BIT)
 #define X86_CR4_SMAP_BIT	21 /* enable SMAP support */
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
+#define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
+#define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/12] x86, pkeys: cpuid bit definition
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
  2015-05-07 17:41 ` [PATCH 02/12] x86, pku: define new CR4 bit Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 17:41 ` [PATCH 03/12] x86, pkey: pkru xsave fields and data structure Dave Hansen
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86


There are two CPUID bits for protection keys.  One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys.  Specifically:

	Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
	Protection keys (and the RDPKRU/WRPKRU instructions)

This is because userspace can not see CR4 contents, but it can
see CPUID contents.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:

	CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]

X86_FEATURE_OSPKE is "OSPKU":

	CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]

These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.

---

 b/arch/x86/include/asm/cpufeature.h |    6 +++++-
 b/arch/x86/kernel/cpu/common.c      |    1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/cpufeature.h~pkeys-0-cpuid arch/x86/include/asm/cpufeature.h
--- a/arch/x86/include/asm/cpufeature.h~pkeys-0-cpuid	2015-05-07 10:31:40.985169281 -0700
+++ b/arch/x86/include/asm/cpufeature.h	2015-05-07 10:31:40.991169552 -0700
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	13	/* N 32-bit words worth of info */
+#define NCAPINTS	14	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -252,6 +252,10 @@
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 13 */
+#define X86_FEATURE_PKU		(13*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE	(13*32+ 4) /* OS Protection Keys Enable */
+
 /*
  * BUG word(s)
  */
diff -puN arch/x86/kernel/cpu/common.c~pkeys-0-cpuid arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-0-cpuid	2015-05-07 10:31:40.987169371 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-05-07 10:31:40.991169552 -0700
@@ -635,6 +635,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		cpuid_count(0x00000007, 0, &eax, &ebx, &ecx, &edx);
 
 		c->x86_capability[9] = ebx;
+		c->x86_capability[13] = ecx;
 	}
 
 	/* Extended state features: level 0x0000000d */
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 05/12] x86, pkeys: new page fault error code bit: PF_PK
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
                   ` (4 preceding siblings ...)
  2015-05-07 17:41 ` [PATCH 06/12] x86, pkeys: store protection in high VMA flags Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 17:41 ` [PATCH 08/12] x86, pkeys: arch-specific protection bits Dave Hansen
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86


Note: "PK" is how the Intel SDM refers to this bit, so we also
use that nomenclature.

This only defines the bit, it does not plumb it anywhere to be
handled.

---

 b/arch/x86/mm/fault.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN arch/x86/mm/fault.c~pkeys-4-pfec arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-4-pfec	2015-05-07 10:31:42.568240681 -0700
+++ b/arch/x86/mm/fault.c	2015-05-07 10:31:42.571240816 -0700
@@ -31,6 +31,7 @@
  *   bit 2 ==	 0: kernel-mode access	1: user-mode access
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
+ *   bit 5 ==				1: protection keys block access
  */
 enum x86_pf_error_code {
 
@@ -39,6 +40,7 @@ enum x86_pf_error_code {
 	PF_USER		=		1 << 2,
 	PF_RSVD		=		1 << 3,
 	PF_INSTR	=		1 << 4,
+	PF_PK		=		1 << 5,
 };
 
 /*
@@ -912,7 +914,10 @@ static int spurious_fault_check(unsigned
 
 	if ((error_code & PF_INSTR) && !pte_exec(*pte))
 		return 0;
-
+	/*
+	 * Note: We do not do lazy flushing on protection key
+	 * changes, so no spurious fault will ever set PF_PK.
+	 */
 	return 1;
 }
 
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 06/12] x86, pkeys: store protection in high VMA flags
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
                   ` (3 preceding siblings ...)
  2015-05-07 17:41 ` [PATCH 04/12] x86, pkeys: PTE bits Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-15 21:10   ` Thomas Gleixner
  2015-05-07 17:41 ` [PATCH 05/12] x86, pkeys: new page fault error code bit: PF_PK Dave Hansen
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86


vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures.  The high 32 bits are unused on 64-bit
platforms.  We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.

Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.

This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.

---

 b/arch/x86/Kconfig   |    1 +
 b/include/linux/mm.h |    7 +++++++
 b/mm/Kconfig         |    3 +++
 3 files changed, 11 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-7-eat-high-vma-flags arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-7-eat-high-vma-flags	2015-05-07 10:31:42.943257595 -0700
+++ b/arch/x86/Kconfig	2015-05-07 10:31:42.951257956 -0700
@@ -142,6 +142,7 @@ config X86
 	select ACPI_LEGACY_TABLES_LOOKUP if ACPI
 	select X86_FEATURE_NAMES if PROC_FS
 	select SRCU
+	select ARCH_USES_HIGH_VMA_FLAGS if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN include/linux/mm.h~pkeys-7-eat-high-vma-flags include/linux/mm.h
--- a/include/linux/mm.h~pkeys-7-eat-high-vma-flags	2015-05-07 10:31:42.945257685 -0700
+++ b/include/linux/mm.h	2015-05-07 10:31:42.951257956 -0700
@@ -153,6 +153,13 @@ extern unsigned int kobjsize(const void
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_HIGH_ARCH_1  0x100000000	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_2  0x200000000	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_3  0x400000000	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_4  0x800000000	/* bit only usable on 64-bit architectures */
+#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff -puN mm/Kconfig~pkeys-7-eat-high-vma-flags mm/Kconfig
--- a/mm/Kconfig~pkeys-7-eat-high-vma-flags	2015-05-07 10:31:42.947257775 -0700
+++ b/mm/Kconfig	2015-05-07 10:31:42.952258001 -0700
@@ -635,3 +635,6 @@ config MAX_STACK_SIZE_MB
 	  changed to a smaller value in which case that is used.
 
 	  A sane initial value is 80 MB.
+
+config ARCH_USES_HIGH_VMA_FLAGS
+	bool
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 03/12] x86, pkey: pkru xsave fields and data structure
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
  2015-05-07 17:41 ` [PATCH 02/12] x86, pku: define new CR4 bit Dave Hansen
  2015-05-07 17:41 ` [PATCH 01/12] x86, pkeys: cpuid bit definition Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 17:41 ` [PATCH 04/12] x86, pkeys: PTE bits Dave Hansen
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86


The protection keys register (PKRU) is saved and restored using
xsave.  Define the data structure that we will use to access it
inside the xsave buffer, and also double-check that the new
structure matches the size that comes out of the CPU.

---

 b/arch/x86/include/asm/processor.h |    9 +++++++++
 b/arch/x86/include/asm/xsave.h     |    3 ++-
 b/arch/x86/kernel/xsave.c          |    7 +++++++
 3 files changed, 18 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/processor.h~pkeys-2-xsave arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~pkeys-2-xsave	2015-05-07 10:31:41.756204056 -0700
+++ b/arch/x86/include/asm/processor.h	2015-05-07 10:31:41.763204372 -0700
@@ -406,6 +406,15 @@ struct bndcsr {
 	u64 bndstatus;
 } __packed;
 
+/*
+ * "The size of XSAVE state component for PKRU is 8 bytes,
+ *  of which only the first four bytes are used...".
+ */
+struct pkru {
+	u32 pkru;
+	u32 pkru_unused;
+} __packed;
+
 struct xsave_hdr_struct {
 	u64 xstate_bv;
 	u64 xcomp_bv;
diff -puN arch/x86/include/asm/xsave.h~pkeys-2-xsave arch/x86/include/asm/xsave.h
--- a/arch/x86/include/asm/xsave.h~pkeys-2-xsave	2015-05-07 10:31:41.758204147 -0700
+++ b/arch/x86/include/asm/xsave.h	2015-05-07 10:31:41.764204417 -0700
@@ -14,6 +14,7 @@
 #define XSTATE_OPMASK		0x20
 #define XSTATE_ZMM_Hi256	0x40
 #define XSTATE_Hi16_ZMM		0x80
+#define XSTATE_PKRU		0x200
 
 #define XSTATE_FPSSE	(XSTATE_FP | XSTATE_SSE)
 #define XSTATE_AVX512	(XSTATE_OPMASK | XSTATE_ZMM_Hi256 | XSTATE_Hi16_ZMM)
@@ -33,7 +34,7 @@
 			| XSTATE_OPMASK | XSTATE_ZMM_Hi256 | XSTATE_Hi16_ZMM)
 
 /* Supported features which require eager state saving */
-#define XSTATE_EAGER	(XSTATE_BNDREGS | XSTATE_BNDCSR)
+#define XSTATE_EAGER	(XSTATE_BNDREGS | XSTATE_BNDCSR | XSTATE_PKRU)
 
 /* All currently supported features */
 #define XCNTXT_MASK	(XSTATE_LAZY | XSTATE_EAGER)
diff -puN arch/x86/kernel/xsave.c~pkeys-2-xsave arch/x86/kernel/xsave.c
--- a/arch/x86/kernel/xsave.c~pkeys-2-xsave	2015-05-07 10:31:41.760204237 -0700
+++ b/arch/x86/kernel/xsave.c	2015-05-07 10:31:41.764204417 -0700
@@ -528,6 +528,13 @@ void setup_xstate_comp(void)
 					+ xstate_comp_sizes[i-1];
 
 	}
+	/*
+	 * Check that the size of the "PKRU" xsave area
+	 * which the CPU knows about matches the kernel
+	 * data structure that we have defined.
+	 */
+	if ((xstate_features >= XSTATE_PKRU) && xstate_comp_sizes[XSTATE_PKRU])
+		WARN_ON(xstate_comp_sizes[XSTATE_PKRU] != sizeof(struct pkru));
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 04/12] x86, pkeys: PTE bits
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
                   ` (2 preceding siblings ...)
  2015-05-07 17:41 ` [PATCH 03/12] x86, pkey: pkru xsave fields and data structure Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 17:41 ` [PATCH 06/12] x86, pkeys: store protection in high VMA flags Dave Hansen
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86


Previous documentation has referred to these 4 bits as "ignored".
That means that software could have made use of them.  But, as
far as I know, the kernel never used them.

They are still ignored when protection keys is not enabled, so
they could theoretically still get used for software purposes.

---

 b/arch/x86/include/asm/pgtable_types.h |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-3-ptebits arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-3-ptebits	2015-05-07 10:31:42.194223812 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-05-07 10:31:42.198223992 -0700
@@ -25,7 +25,11 @@
 #define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
+#define _PAGE_BIT_PKEY_BIT0	59       /* Protection Keys, bit 1/4 */
+#define _PAGE_BIT_PKEY_BIT1	60       /* Protection Keys, bit 2/4 */
+#define _PAGE_BIT_PKEY_BIT2	61       /* Protection Keys, bit 3/4 */
+#define _PAGE_BIT_PKEY_BIT3	62       /* Protection Keys, bit 4/4 */
+#define _PAGE_BIT_NX		63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -47,6 +51,10 @@
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2)
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 08/12] x86, pkeys: arch-specific protection bits
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
                   ` (5 preceding siblings ...)
  2015-05-07 17:41 ` [PATCH 05/12] x86, pkeys: new page fault error code bit: PF_PK Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 17:41 ` [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls Dave Hansen
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86


Lots of things seem to do:

        vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot.  So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT).  It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
	1. "prot" argument to system calls
	2. vma->vm_flags, filled from the mmap "prot"
	3. vma->vm_page prot, filled from vma->vm_flags
	4. the PTE itself.

The pseudocode for these for steps are as follows:

	mmap(PROT_PKEY*)
	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
	pte = pfn | vma->vm_page_prot

Note that these are new definitions for x86:

	arch_vm_get_page_prot()
	arch_calc_vm_prot_bits()

---

 b/arch/x86/include/asm/pgtable_types.h |   12 ++++++++++--
 b/arch/x86/include/uapi/asm/mman.h     |   17 +++++++++++++++++
 b/include/linux/mm.h                   |    4 ++++
 3 files changed, 31 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-7-store-pkey-in-vma arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-7-store-pkey-in-vma	2015-05-07 10:31:43.740293543 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2015-05-07 10:31:43.747293858 -0700
@@ -104,7 +104,12 @@
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
 			 _PAGE_DIRTY)
 
-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify.  The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY)
@@ -220,7 +225,10 @@ enum page_cache_mode {
 /* PTE_PFN_MASK extracts the PFN from a (pte|pmd|pud|pgd)val_t */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)
 
-/* PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t */
+/*
+ *  PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t
+ *  This includes the protection key value.
+ */
 #define PTE_FLAGS_MASK		(~PTE_PFN_MASK)
 
 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-7-store-pkey-in-vma arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-7-store-pkey-in-vma	2015-05-07 10:31:43.742293633 -0700
+++ b/arch/x86/include/uapi/asm/mman.h	2015-05-07 10:31:43.747293858 -0700
@@ -6,6 +6,23 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot) (	\
+		((prot) & PROT_PKEY0 ? VM_PKEY_BIT0 : 0) |	\
+		((prot) & PROT_PKEY1 ? VM_PKEY_BIT1 : 0) |	\
+		((prot) & PROT_PKEY2 ? VM_PKEY_BIT2 : 0) |	\
+		((prot) & PROT_PKEY3 ? VM_PKEY_BIT3 : 0))
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff -puN include/linux/mm.h~pkeys-7-store-pkey-in-vma include/linux/mm.h
--- a/include/linux/mm.h~pkeys-7-store-pkey-in-vma	2015-05-07 10:31:43.744293723 -0700
+++ b/include/linux/mm.h	2015-05-07 10:31:43.748293904 -0700
@@ -162,6 +162,10 @@ extern unsigned int kobjsize(const void
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+# define VM_PKEY_BIT0	VM_HIGH_ARCH_1	/* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1	VM_HIGH_ARCH_2
+# define VM_PKEY_BIT2	VM_HIGH_ARCH_3
+# define VM_PKEY_BIT3	VM_HIGH_ARCH_4
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
                   ` (6 preceding siblings ...)
  2015-05-07 17:41 ` [PATCH 08/12] x86, pkeys: arch-specific protection bits Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 19:11   ` One Thousand Gnomes
  2015-05-07 17:41 ` [PATCH 09/12] x86, pkeys: notify userspace about protection key faults Dave Hansen
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86


If a system call takes a PROT_{NONE,EXEC,WRITE,...} argument,
this adds support to it to take a protection key.

	mmap()
	mrprotect()
	drivers/char/agp/frontend.c's ioctl(AGPIOC_RESERVE)

This does not include direct support for shmat() since it uses
a diffferent set of permission bits.  You can use mprotect()
after the attach to assign an attched SHM segment a protection
key.

---

 b/include/uapi/asm-generic/mman-common.h |    4 ++++
 1 file changed, 4 insertions(+)

diff -puN include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits include/uapi/asm-generic/mman-common.h
--- a/include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits	2015-05-07 10:31:43.367276719 -0700
+++ b/include/uapi/asm-generic/mman-common.h	2015-05-07 10:31:43.370276855 -0700
@@ -10,6 +10,10 @@
 #define PROT_WRITE	0x2		/* page can be written */
 #define PROT_EXEC	0x4		/* page can be executed */
 #define PROT_SEM	0x8		/* page may be used for atomic ops */
+#define PROT_PKEY0	0x10		/* protection key value (bit 0) */
+#define PROT_PKEY1	0x20		/* protection key value (bit 1) */
+#define PROT_PKEY2	0x40		/* protection key value (bit 2) */
+#define PROT_PKEY3	0x80		/* protection key value (bit 3) */
 #define PROT_NONE	0x0		/* page can not be accessed */
 #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
 #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 10/12] x86, pkeys: differentiate Protection Key faults from normal
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
                   ` (9 preceding siblings ...)
  2015-05-07 17:41 ` [PATCH 11/12] x86, pkeys: actually enable Memory Protection Keys in CPU Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 17:41 ` [PATCH 12/12] x86, pkeys: Documentation Dave Hansen
  2015-05-07 17:57 ` [PATCH 00/12] [RFC] x86: Memory Protection Keys Ingo Molnar
  12 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86



---

 b/arch/x86/mm/fault.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-12-fault-differentiation arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-12-fault-differentiation	2015-05-07 10:31:44.570330979 -0700
+++ b/arch/x86/mm/fault.c	2015-05-07 10:31:44.573331114 -0700
@@ -1009,6 +1009,15 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/*
+	 * Access or read was blocked by protection keys. We do
+	 * this check before any others because we do not want
+	 * to, for instance, confuse a protection-key-denied
+	 * write with one for which we should do a COW.
+	 */
+	if (error_code & PF_PK)
+		return 1;
+
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 11/12] x86, pkeys: actually enable Memory Protection Keys in CPU
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
                   ` (8 preceding siblings ...)
  2015-05-07 17:41 ` [PATCH 09/12] x86, pkeys: notify userspace about protection key faults Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 17:41 ` [PATCH 10/12] x86, pkeys: differentiate Protection Key faults from normal Dave Hansen
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86


This sets the bit in 'cr4' to actually enable the protection
keys feature.  We also include a boot-time disable for the
feature "nopku".

Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid
bit to appear set.  At this point in boot, identify_cpu()
has already run the actual CPUID instructions and populated
the "cpu features" structures.  We need to go back and
re-run identify_cpu() to make sure it gets updated values.

We *could* simply re-populate the 11th word of the cpuid
data, but this is probably quick enough.


---

 b/Documentation/kernel-parameters.txt |    3 +++
 b/arch/x86/kernel/cpu/common.c        |   27 +++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~pkeys-5-should-be-last-patch arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-5-should-be-last-patch	2015-05-07 10:31:44.946347938 -0700
+++ b/arch/x86/kernel/cpu/common.c	2015-05-07 10:31:44.952348209 -0700
@@ -306,6 +306,32 @@ static __always_inline void setup_smap(s
 	}
 }
 
+#ifdef CONFIG_X86_64
+/*
+ * Protection Keys are not available in 32-bit mode.
+ */
+static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+{
+	if (!cpu_has(c, X86_FEATURE_PKU))
+		return;
+
+	cr4_set_bits(X86_CR4_PKE);
+	/*
+	 * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+	 * cpuid bit to be set.  We need to ensure that we
+	 * update that bit in this CPU's "cpu_info".
+	 */
+	get_cpu_cap(&boot_cpu_data);
+}
+
+static __init int setup_disable_pku(char *arg)
+{
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
+	return 1;
+}
+__setup("nopku", setup_disable_pku);
+#endif /* CONFIG_X86_64 */
+
 /*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
@@ -957,6 +983,7 @@ static void identify_cpu(struct cpuinfo_
 	}
 
 #ifdef CONFIG_X86_64
+	setup_pku(c);
 	detect_ht(c);
 #endif
 
diff -puN Documentation/kernel-parameters.txt~pkeys-5-should-be-last-patch Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~pkeys-5-should-be-last-patch	2015-05-07 10:31:44.948348028 -0700
+++ b/Documentation/kernel-parameters.txt	2015-05-07 10:31:44.953348254 -0700
@@ -936,6 +936,9 @@ bytes respectively. Such letter suffixes
 			Enable debug messages at boot time.  See
 			Documentation/dynamic-debug-howto.txt for details.
 
+	nopku		[X86] Disable Memory Protection Keys CPU feature found
+			in some Intel CPUs.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 12/12] x86, pkeys: Documentation
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
                   ` (10 preceding siblings ...)
  2015-05-07 17:41 ` [PATCH 10/12] x86, pkeys: differentiate Protection Key faults from normal Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 17:57 ` [PATCH 00/12] [RFC] x86: Memory Protection Keys Ingo Molnar
  12 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1379 bytes --]



---

 b/Documentation/x86/protection-keys.txt |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff -puN /dev/null Documentation/x86/protection-keys.txt
--- /dev/null	2015-05-06 22:34:35.845652580 -0700
+++ b/Documentation/x86/protection-keys.txt	2015-05-07 10:31:45.360366611 -0700
@@ -0,0 +1,22 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU
+feature which will be found in future Intel CPUs.  The work here
+was done with the aid of simulators.
+
+Memory Protection Keys provides a mechanism for enforcing
+page-based protections, but without requiring modification of the
+page tables when an application changes protection domains.  It
+works by dedicating 4 previously ignored bits in each page table
+entry to a “protection key”, giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two
+separate bits (Access Disable and Write Disable) for each key.
+Being a CPU register, PKRU is inherently thread-local,
+potentially giving each thread a different set of protections
+from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and
+writing to the new register.  The feature is only available in
+64-bit mode, even though there is theoretically space in the PAE
+PTEs.  These permissions are enforced on data access only and
+have no effect on instruction fetches.
+
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 09/12] x86, pkeys: notify userspace about protection key faults
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
                   ` (7 preceding siblings ...)
  2015-05-07 17:41 ` [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls Dave Hansen
@ 2015-05-07 17:41 ` Dave Hansen
  2015-05-07 17:41 ` [PATCH 11/12] x86, pkeys: actually enable Memory Protection Keys in CPU Dave Hansen
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 17:41 UTC (permalink / raw)
  To: dave; +Cc: linux-kernel, x86


A protection key fault is very similar to any other access
error.  There must be a VMA, etc...  We even want to take
the same action (SIGSEGV) that we do with a normal access
fault.

However, we do need to let userspace know that something
is different.  We do this the same way what we did with
SEGV_BNDERR with Memory Protection eXtensions (MPX):
define a new SEGV code: SEGV_PKUERR.

We will, at some point need to allow userspace a way to
figure out which protection key coveres the address that
we faulted on.  We can either do that with a separate
interface, or we could pass it up in the siginfo like
MPX did.

Suggestions welcome. :)

---

 b/arch/x86/mm/fault.c                |    5 ++++-
 b/include/uapi/asm-generic/siginfo.h |   10 +++++++++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff -puN arch/x86/mm/fault.c~pkeys-13-siginfo arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-13-siginfo	2015-05-07 10:31:44.169312893 -0700
+++ b/arch/x86/mm/fault.c	2015-05-07 10:31:44.174313118 -0700
@@ -838,7 +838,10 @@ static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	if (error_code & PF_PK)
+		__bad_area(regs, error_code, address, SEGV_PKUERR);
+	else
+		__bad_area(regs, error_code, address, SEGV_ACCERR);
 }
 
 static void
diff -puN include/uapi/asm-generic/siginfo.h~pkeys-13-siginfo include/uapi/asm-generic/siginfo.h
--- a/include/uapi/asm-generic/siginfo.h~pkeys-13-siginfo	2015-05-07 10:31:44.170312938 -0700
+++ b/include/uapi/asm-generic/siginfo.h	2015-05-07 10:31:44.174313118 -0700
@@ -95,6 +95,13 @@ typedef struct siginfo {
 				void __user *_lower;
 				void __user *_upper;
 			} _addr_bnd;
+			int protection_key; /* FIXME: protection key value??
+					     * Do we really need this in here?
+					     * userspace can get the PKRU value in
+					     * the signal handler, but they do not
+					     * easily have access to the PKEY value
+					     * from the PTE.
+					     */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -206,7 +213,8 @@ typedef struct siginfo {
 #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
 #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
 #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
-#define NSIGSEGV	3
+#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed address bound checks */
+#define NSIGSEGV	4
 
 /*
  * SIGBUS si_codes
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
                   ` (11 preceding siblings ...)
  2015-05-07 17:41 ` [PATCH 12/12] x86, pkeys: Documentation Dave Hansen
@ 2015-05-07 17:57 ` Ingo Molnar
  2015-05-07 18:09   ` Dave Hansen
  12 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2015-05-07 17:57 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, x86


* Dave Hansen <dave@sr71.net> wrote:

> == FEATURE OVERVIEW ==
> 
> Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU 
> feature which will be found in future Intel CPUs.  The work here was 
> done with the aid of simulators.
> 
> Memory Protection Keys provides a mechanism for enforcing page-based 
> protections, but without requiring modification of the page tables 
> when an application changes protection domains.  It works by 
> dedicating 4 previously ignored bits in each page table entry to a 
> "protection key", giving 16 possible keys.
> 
> There is also a new user-accessible register (PKRU) with two 
> separate bits (Access Disable and Write Disable) for each key. Being 
> a CPU register, PKRU is inherently thread-local, potentially giving 
> each thread a different set of protections from every other thread.
> 
> There are two new instructions (RDPKRU/WRPKRU) for reading and 
> writing to the new register.  The feature is only available in 
> 64-bit mode, even though there is theoretically space in the PAE 
> PTEs.  These permissions are enforced on data access only and have 
> no effect on instruction fetches.

So I'm wondering what the primary usecases are for this feature?
Could you outline applications/workloads/libraries that would
benefit from this?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 17:57 ` [PATCH 00/12] [RFC] x86: Memory Protection Keys Ingo Molnar
@ 2015-05-07 18:09   ` Dave Hansen
  2015-05-07 18:48     ` Vlastimil Babka
                       ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 18:09 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, x86

On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>> > There are two new instructions (RDPKRU/WRPKRU) for reading and 
>> > writing to the new register.  The feature is only available in 
>> > 64-bit mode, even though there is theoretically space in the PAE 
>> > PTEs.  These permissions are enforced on data access only and have 
>> > no effect on instruction fetches.
> So I'm wondering what the primary usecases are for this feature?
> Could you outline applications/workloads/libraries that would
> benefit from this?

There are lots of things that folks would _like_ to mprotect(), but end
up not being feasible because of the overhead of going and mucking with
thousands of PTEs and shooting down remote TLBs every time you want to
change protections.

Data structures like logs or journals that are only written to in very
limited code paths, but that you want to protect from "stray" writes.

Maybe even a database where a query operation will never need to write
to memory, but an insert would.  You could keep the data R/O during the
entire operation except when an insert is actually in progress.  It
narrows the window where data might be corrupted.  This becomes even
more valuable if a stray write to memory is guaranteed to hit storage...
like with persistent memory.

Someone mentioned to me that valgrind does lots of mprotect()s and might
benefit from this.

We could keep heap metadata as R/O and only make it R/W inside of
malloc() itself to catch corruption more quickly.

More crazy ideas welcome. :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 18:09   ` Dave Hansen
@ 2015-05-07 18:48     ` Vlastimil Babka
  2015-05-07 21:45       ` Dave Hansen
  2015-05-09 19:09       ` Dr. David Alan Gilbert
  2015-05-07 19:18     ` One Thousand Gnomes
  2015-05-07 19:22     ` Christian Borntraeger
  2 siblings, 2 replies; 37+ messages in thread
From: Vlastimil Babka @ 2015-05-07 18:48 UTC (permalink / raw)
  To: Dave Hansen, Ingo Molnar; +Cc: linux-kernel, x86

On 05/07/2015 08:09 PM, Dave Hansen wrote:
> On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>>>> There are two new instructions (RDPKRU/WRPKRU) for reading and
>>>> writing to the new register.  The feature is only available in
>>>> 64-bit mode, even though there is theoretically space in the PAE
>>>> PTEs.  These permissions are enforced on data access only and have
>>>> no effect on instruction fetches.
>> So I'm wondering what the primary usecases are for this feature?
>> Could you outline applications/workloads/libraries that would
>> benefit from this?
>
> There are lots of things that folks would _like_ to mprotect(), but end
> up not being feasible because of the overhead of going and mucking with
> thousands of PTEs and shooting down remote TLBs every time you want to
> change protections.
>
> Data structures like logs or journals that are only written to in very
> limited code paths, but that you want to protect from "stray" writes.
>
> Maybe even a database where a query operation will never need to write
> to memory, but an insert would.  You could keep the data R/O during the
> entire operation except when an insert is actually in progress.  It
> narrows the window where data might be corrupted.  This becomes even
> more valuable if a stray write to memory is guaranteed to hit storage...
> like with persistent memory.
>
> Someone mentioned to me that valgrind does lots of mprotect()s and might
> benefit from this.
>
> We could keep heap metadata as R/O and only make it R/W inside of
> malloc() itself to catch corruption more quickly.

But that metadata is typically within the same page as the data itself 
(for small objects at least), no?

> More crazy ideas welcome. :)

Since you asked :) I wonder if the usefulness could be extended by 
making it possible for a thread to revoke its access to WRPKRU (it's not 
privileged, right?). Then I could imagine some extra security for 
sandbox/bytecode/JIT code so it doesn't interfere with the runtime. But 
since it doesn't block instruction fetches, then maybe it wouldn't make 
much difference...

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls
  2015-05-07 17:41 ` [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls Dave Hansen
@ 2015-05-07 19:11   ` One Thousand Gnomes
  2015-05-07 19:19     ` Dave Hansen
  0 siblings, 1 reply; 37+ messages in thread
From: One Thousand Gnomes @ 2015-05-07 19:11 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, x86

> diff -puN include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits include/uapi/asm-generic/mman-common.h
> --- a/include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits	2015-05-07 10:31:43.367276719 -0700
> +++ b/include/uapi/asm-generic/mman-common.h	2015-05-07 10:31:43.370276855 -0700
> @@ -10,6 +10,10 @@
>  #define PROT_WRITE	0x2		/* page can be written */
>  #define PROT_EXEC	0x4		/* page can be executed */
>  #define PROT_SEM	0x8		/* page may be used for atomic ops */
> +#define PROT_PKEY0	0x10		/* protection key value (bit 0) */
> +#define PROT_PKEY1	0x20		/* protection key value (bit 1) */
> +#define PROT_PKEY2	0x40		/* protection key value (bit 2) */
> +#define PROT_PKEY3	0x80		/* protection key value (bit 3) */

Thats leaking deep Intelisms into asm-generic which makes me very
uncomfortable. Whether we need to reserve some bits for "arch specific"
is one question, what we do with them ought not to be leaking out.

To start with trying to port code people will want to do

#define PROT_PKEY0	0
#define PROT_PKEY1	0
.. 

etc


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 18:09   ` Dave Hansen
  2015-05-07 18:48     ` Vlastimil Babka
@ 2015-05-07 19:18     ` One Thousand Gnomes
  2015-05-07 19:26       ` Ingo Molnar
  2015-05-08  6:09       ` Kevin Easton
  2015-05-07 19:22     ` Christian Borntraeger
  2 siblings, 2 replies; 37+ messages in thread
From: One Thousand Gnomes @ 2015-05-07 19:18 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Ingo Molnar, linux-kernel, x86

> Data structures like logs or journals that are only written to in very
> limited code paths, but that you want to protect from "stray" writes.

Anything with lots of data where you want to minimise the risk of stray
accesses even if just as a debug aid (consider things like memcached).
> 
> Maybe even a database where a query operation will never need to write
> to memory, but an insert would.  You could keep the data R/O during the
> entire operation except when an insert is actually in progress.  It
> narrows the window where data might be corrupted.  This becomes even
> more valuable if a stray write to memory is guaranteed to hit storage...
> like with persistent memory.
> 
> Someone mentioned to me that valgrind does lots of mprotect()s and might
> benefit from this.

You can also use it for certain types of emulator trickery, and I suspect
even for things like interpreters and controlling access to "tainted"
values.

Other obvious uses are making it a shade harder for SSL or ssh type
errors to leak things like key data by reducing the damage done by out of
bound accesses.

> We could keep heap metadata as R/O and only make it R/W inside of
> malloc() itself to catch corruption more quickly.

If you implement multiple malloc pools you can chop up lots of stuff.

In library land it isn't just stuff like malloc, you can use it as
a debug weapon to protect library private data from naughty application
code.

There are some other debug uses when catching faults - fast ways to do
range access breakpoints for example.

Alan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls
  2015-05-07 19:11   ` One Thousand Gnomes
@ 2015-05-07 19:19     ` Dave Hansen
  2015-09-04 20:13       ` Florian Weimer
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 19:19 UTC (permalink / raw)
  To: One Thousand Gnomes; +Cc: linux-kernel, x86

On 05/07/2015 12:11 PM, One Thousand Gnomes wrote:
>> diff -puN include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits include/uapi/asm-generic/mman-common.h
>> --- a/include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits	2015-05-07 10:31:43.367276719 -0700
>> +++ b/include/uapi/asm-generic/mman-common.h	2015-05-07 10:31:43.370276855 -0700
>> @@ -10,6 +10,10 @@
>>  #define PROT_WRITE	0x2		/* page can be written */
>>  #define PROT_EXEC	0x4		/* page can be executed */
>>  #define PROT_SEM	0x8		/* page may be used for atomic ops */
>> +#define PROT_PKEY0	0x10		/* protection key value (bit 0) */
>> +#define PROT_PKEY1	0x20		/* protection key value (bit 1) */
>> +#define PROT_PKEY2	0x40		/* protection key value (bit 2) */
>> +#define PROT_PKEY3	0x80		/* protection key value (bit 3) */
> 
> Thats leaking deep Intelisms into asm-generic which makes me very
> uncomfortable. Whether we need to reserve some bits for "arch specific"
> is one question, what we do with them ought not to be leaking out.
> 
> To start with trying to port code people will want to do
> 
> #define PROT_PKEY0	0
> #define PROT_PKEY1	0

Yeah, I feel pretty uncomfortable with it as well.  I really don't
expect these to live like this in asm-generic when I submit this.

Powerpc and ia64 have _something_ resembling protection keys, so the
concept isn't entirely x86 or Intel-specific.  My hope would be that we
do this in a way that other architectures can use.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 18:09   ` Dave Hansen
  2015-05-07 18:48     ` Vlastimil Babka
  2015-05-07 19:18     ` One Thousand Gnomes
@ 2015-05-07 19:22     ` Christian Borntraeger
  2015-05-07 19:29       ` Dave Hansen
  2 siblings, 1 reply; 37+ messages in thread
From: Christian Borntraeger @ 2015-05-07 19:22 UTC (permalink / raw)
  To: Dave Hansen, Ingo Molnar; +Cc: linux-kernel, x86, linux-s390

Am 07.05.2015 um 20:09 schrieb Dave Hansen:
> On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>>>> There are two new instructions (RDPKRU/WRPKRU) for reading and 
>>>> writing to the new register.  The feature is only available in 
>>>> 64-bit mode, even though there is theoretically space in the PAE 
>>>> PTEs.  These permissions are enforced on data access only and have 
>>>> no effect on instruction fetches.
>> So I'm wondering what the primary usecases are for this feature?
>> Could you outline applications/workloads/libraries that would
>> benefit from this?
> 
> There are lots of things that folks would _like_ to mprotect(), but end
> up not being feasible because of the overhead of going and mucking with
> thousands of PTEs and shooting down remote TLBs every time you want to
> change protections.

These protection bits would need to be cached in TLBs as well, no?
So the saving would come by switching the PKRU instead of the page bits.

This all looks like s390 storage keys (with the key in pagetables instead
of a dedicated place). There we also have 16 values for the key and 4 bits 
in the PSW that describe the thread local key both are matched.
There is an additional field F (fetch protection) that decides, if the
key value is used for stores or for stores+fetches.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 19:18     ` One Thousand Gnomes
@ 2015-05-07 19:26       ` Ingo Molnar
  2015-05-07 19:40         ` Dave Hansen
  2015-05-07 20:11         ` One Thousand Gnomes
  2015-05-08  6:09       ` Kevin Easton
  1 sibling, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2015-05-07 19:26 UTC (permalink / raw)
  To: One Thousand Gnomes; +Cc: Dave Hansen, linux-kernel, x86


* One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> wrote:

> > We could keep heap metadata as R/O and only make it R/W inside of 
> > malloc() itself to catch corruption more quickly.
> 
> If you implement multiple malloc pools you can chop up lots of 
> stuff.

I'd say that a 64-bit address space is large enough to hide buffers in 
from accidental corruption, without any runtime page protection 
flipping overhead?

> In library land it isn't just stuff like malloc, you can use it as a 
> debug weapon to protect library private data from naughty 
> application code.
> 
> There are some other debug uses when catching faults - fast ways to 
> do range access breakpoints for example.

I think libraries are happy enough to work without bugs - apps digging 
around in library data are in a "you keep all the broken pieces" 
situation, why would a library want to slow down every good citizen 
down with extra protection flipping/unflipping accesses?

The Valgrind usecase looks somewhat legit, albeit not necessarily for 
multithreaded apps: there you generally really want protection changes 
to be globally visible, such as publishing the effects of free() or 
malloc().

Also, will apps/libraries bother if it's not a standard API and if it 
only runs on very fresh CPUs?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 19:22     ` Christian Borntraeger
@ 2015-05-07 19:29       ` Dave Hansen
  2015-05-07 19:45         ` Christian Borntraeger
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 19:29 UTC (permalink / raw)
  To: Christian Borntraeger, Ingo Molnar; +Cc: linux-kernel, x86, linux-s390

On 05/07/2015 12:22 PM, Christian Borntraeger wrote:
> Am 07.05.2015 um 20:09 schrieb Dave Hansen:
>> On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>>>>> There are two new instructions (RDPKRU/WRPKRU) for reading and 
>>>>> writing to the new register.  The feature is only available in 
>>>>> 64-bit mode, even though there is theoretically space in the PAE 
>>>>> PTEs.  These permissions are enforced on data access only and have 
>>>>> no effect on instruction fetches.
>>> So I'm wondering what the primary usecases are for this feature?
>>> Could you outline applications/workloads/libraries that would
>>> benefit from this?
>>
>> There are lots of things that folks would _like_ to mprotect(), but end
>> up not being feasible because of the overhead of going and mucking with
>> thousands of PTEs and shooting down remote TLBs every time you want to
>> change protections.
> 
> These protection bits would need to be cached in TLBs as well, no?

Yes, they are cached in the TLBs.  It's actually explicitly called out
in the documentation.

> So the saving would come by switching the PKRU instead of the page bits.

Right.

> This all looks like s390 storage keys (with the key in pagetables instead
> of a dedicated place). There we also have 16 values for the key and 4 bits 
> in the PSW that describe the thread local key both are matched.
> There is an additional field F (fetch protection) that decides, if the
> key value is used for stores or for stores+fetches.

OK, so a thread can only be in one domain at a time?

That's a bit different than x86 where each page can be in one protection
domain, but each CPU thread can independently enable/disable access to
each of the 16 protection domains.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 19:26       ` Ingo Molnar
@ 2015-05-07 19:40         ` Dave Hansen
  2015-05-07 20:11         ` One Thousand Gnomes
  1 sibling, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 19:40 UTC (permalink / raw)
  To: Ingo Molnar, One Thousand Gnomes; +Cc: linux-kernel, x86

On 05/07/2015 12:26 PM, Ingo Molnar wrote:
> The Valgrind usecase looks somewhat legit, albeit not necessarily for 
> multithreaded apps: there you generally really want protection changes 
> to be globally visible, such as publishing the effects of free() or 
> malloc().

I guess we could theoretically have an IPC of some kind that voluntarily
broadcasts changes so that we can be guaranteed that other threads see it.

> Also, will apps/libraries bother if it's not a standard API and if it 
> only runs on very fresh CPUs?

It's always a problem with new CPU features.

I've thought a bit about trying to "emulate" the feature on older CPUs
using good ol' mprotect() so that we could have an API that folks can
use _today_, but that would get magically fast on future CPUs.  But, the
problem with that is the thread-local aspect.

mprotect() is fundamentally process-wide and protection keys right are
fundamentally thread-local.  Those things are going to be hard to
reconcile unless we do something slightly extreme like having per-thread
page tables.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 19:29       ` Dave Hansen
@ 2015-05-07 19:45         ` Christian Borntraeger
  2015-05-07 19:49           ` Dave Hansen
  0 siblings, 1 reply; 37+ messages in thread
From: Christian Borntraeger @ 2015-05-07 19:45 UTC (permalink / raw)
  To: Dave Hansen, Ingo Molnar; +Cc: linux-kernel, x86, linux-s390

Am 07.05.2015 um 21:29 schrieb Dave Hansen:
> On 05/07/2015 12:22 PM, Christian Borntraeger wrote:
>> Am 07.05.2015 um 20:09 schrieb Dave Hansen:
>>> On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>>>>>> There are two new instructions (RDPKRU/WRPKRU) for reading and 
>>>>>> writing to the new register.  The feature is only available in 
>>>>>> 64-bit mode, even though there is theoretically space in the PAE 
>>>>>> PTEs.  These permissions are enforced on data access only and have 
>>>>>> no effect on instruction fetches.
>>>> So I'm wondering what the primary usecases are for this feature?
>>>> Could you outline applications/workloads/libraries that would
>>>> benefit from this?
>>>
>>> There are lots of things that folks would _like_ to mprotect(), but end
>>> up not being feasible because of the overhead of going and mucking with
>>> thousands of PTEs and shooting down remote TLBs every time you want to
>>> change protections.
>>
>> These protection bits would need to be cached in TLBs as well, no?
> 
> Yes, they are cached in the TLBs.  It's actually explicitly called out
> in the documentation.
> 
>> So the saving would come by switching the PKRU instead of the page bits.
> 
> Right.
> 
>> This all looks like s390 storage keys (with the key in pagetables instead
>> of a dedicated place). There we also have 16 values for the key and 4 bits 
>> in the PSW that describe the thread local key both are matched.
>> There is an additional field F (fetch protection) that decides, if the
>> key value is used for stores or for stores+fetches.
> 
> OK, so a thread can only be in one domain at a time?

Via the PSW yes.
Actually the docs talk about access key, which is usually the PSW. There are
some instructions like MOVE WITH KEY that allow to specify the key for this
specific instruction. For compiled code these insructions are not used in 
Linux and I can not really see a way to implement that properly. Furthermore
enabling these key ops has other implications which are unwanted.

 
> That's a bit different than x86 where each page can be in one protection
> domain, but each CPU thread can independently enable/disable access to
> each of the 16 protection domains.
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 19:45         ` Christian Borntraeger
@ 2015-05-07 19:49           ` Dave Hansen
  2015-05-07 19:57             ` Christian Borntraeger
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 19:49 UTC (permalink / raw)
  To: Christian Borntraeger, Ingo Molnar; +Cc: linux-kernel, x86, linux-s390

On 05/07/2015 12:45 PM, Christian Borntraeger wrote:
>>> >> This all looks like s390 storage keys (with the key in pagetables instead
>>> >> of a dedicated place). There we also have 16 values for the key and 4 bits 
>>> >> in the PSW that describe the thread local key both are matched.
>>> >> There is an additional field F (fetch protection) that decides, if the
>>> >> key value is used for stores or for stores+fetches.
>> > 
>> > OK, so a thread can only be in one domain at a time?
> Via the PSW yes.
> Actually the docs talk about access key, which is usually the PSW. There are
> some instructions like MOVE WITH KEY that allow to specify the key for this
> specific instruction. For compiled code these insructions are not used in 
> Linux and I can not really see a way to implement that properly. Furthermore
> enabling these key ops has other implications which are unwanted.

OK, so we have to basic operations that need to be done for
protection/storage/$FOO keys:

1. Assign a key (or set of keys) to a memory area
2. Have a thread request the access (read and/or write) to a set of
   areas be acquired or revoked.

For (2) on x86, we basically allow any combination of keys and r/w
permissions.  On s390, we would need to ensure that acces to only one
key was allowed at a time.

BTW, do the s390 keys affect instructions and data, or data only?

The x86 ones affect data only.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 19:49           ` Dave Hansen
@ 2015-05-07 19:57             ` Christian Borntraeger
  0 siblings, 0 replies; 37+ messages in thread
From: Christian Borntraeger @ 2015-05-07 19:57 UTC (permalink / raw)
  To: Dave Hansen, Ingo Molnar; +Cc: linux-kernel, x86, linux-s390

Am 07.05.2015 um 21:49 schrieb Dave Hansen:
> On 05/07/2015 12:45 PM, Christian Borntraeger wrote:
>>>>>> This all looks like s390 storage keys (with the key in pagetables instead
>>>>>> of a dedicated place). There we also have 16 values for the key and 4 bits 
>>>>>> in the PSW that describe the thread local key both are matched.
>>>>>> There is an additional field F (fetch protection) that decides, if the
>>>>>> key value is used for stores or for stores+fetches.
>>>>
>>>> OK, so a thread can only be in one domain at a time?
>> Via the PSW yes.
>> Actually the docs talk about access key, which is usually the PSW. There are
>> some instructions like MOVE WITH KEY that allow to specify the key for this
>> specific instruction. For compiled code these insructions are not used in 
>> Linux and I can not really see a way to implement that properly. Furthermore
>> enabling these key ops has other implications which are unwanted.
> 
> OK, so we have to basic operations that need to be done for
> protection/storage/$FOO keys:
> 
> 1. Assign a key (or set of keys) to a memory area
> 2. Have a thread request the access (read and/or write) to a set of
>    areas be acquired or revoked.
> 
> For (2) on x86, we basically allow any combination of keys and r/w
> permissions.  On s390, we would need to ensure that acces to only one
> key was allowed at a time.
> 
> BTW, do the s390 keys affect instructions and data, or data only?

Both. In fact its also used for I/O. Maybe that also points out the
biggest difference. the storage key is a property of the physical page
frame (and not of the virtual page defined by the page tables).
So we cannot really use that for shared memory and then set different
protection keys in different mappings.

 
> The x86 ones affect data only.
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 19:26       ` Ingo Molnar
  2015-05-07 19:40         ` Dave Hansen
@ 2015-05-07 20:11         ` One Thousand Gnomes
  2015-05-08  4:51           ` Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: One Thousand Gnomes @ 2015-05-07 20:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Dave Hansen, linux-kernel, x86

On Thu, 7 May 2015 21:26:20 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> 
> * One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> wrote:
> 
> > > We could keep heap metadata as R/O and only make it R/W inside of 
> > > malloc() itself to catch corruption more quickly.
> > 
> > If you implement multiple malloc pools you can chop up lots of 
> > stuff.
> 
> I'd say that a 64-bit address space is large enough to hide buffers in 
> from accidental corruption, without any runtime page protection 
> flipping overhead?

I'd say no. And from actual real world demand for PK the answer is also
no. It's already a problem with very large data sets. Worse still in many
cases its a problem that nobody is actually measuring or doing much about
(because mprotect on many gigabytes of data is expensive).

> > In library land it isn't just stuff like malloc, you can use it as a 
> > debug weapon to protect library private data from naughty 
> > application code.
> > 
> > There are some other debug uses when catching faults - fast ways to 
> > do range access breakpoints for example.
> 
> I think libraries are happy enough to work without bugs - apps digging 
> around in library data are in a "you keep all the broken pieces" 
> situation, why would a library want to slow down every good citizen 
> down with extra protection flipping/unflipping accesses?

For debugging, when the library maintained data is sensitive or
something you don't want corupted, or because the user puts security
first. Protection keys are an awful lot faster than mprotect. You've got
no synchronization and shootdowns to do just a CPU register to load to
indicate which mask of keys you are happy with. That really changes what
it is useful for, because it's cheap. It means you can happily do stuff
like

	while(data_blocks) {
		allow_key_and_source_access();
		do_crypto_func();
		revoke_key_and_source_access();
		do_network_io();  /* Can't accidentally leak keys or
					input */
	}


> Also, will apps/libraries bother if it's not a standard API and if it 
> only runs on very fresh CPUs?

In time I think yes.

Alan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 18:48     ` Vlastimil Babka
@ 2015-05-07 21:45       ` Dave Hansen
  2015-05-09 19:09       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-07 21:45 UTC (permalink / raw)
  To: Vlastimil Babka, Ingo Molnar; +Cc: linux-kernel, x86

On 05/07/2015 11:48 AM, Vlastimil Babka wrote:
> On 05/07/2015 08:09 PM, Dave Hansen wrote:
>> On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>>>>> There are two new instructions (RDPKRU/WRPKRU) for reading and
>>>>> writing to the new register.  The feature is only available in
>>>>> 64-bit mode, even though there is theoretically space in the PAE
>>>>> PTEs.  These permissions are enforced on data access only and have
>>>>> no effect on instruction fetches.
>>> So I'm wondering what the primary usecases are for this feature?
>>> Could you outline applications/workloads/libraries that would
>>> benefit from this?
>>
>> There are lots of things that folks would _like_ to mprotect(), but end
>> up not being feasible because of the overhead of going and mucking with
>> thousands of PTEs and shooting down remote TLBs every time you want to
>> change protections.
>>
>> Data structures like logs or journals that are only written to in very
>> limited code paths, but that you want to protect from "stray" writes.
>>
>> Maybe even a database where a query operation will never need to write
>> to memory, but an insert would.  You could keep the data R/O during the
>> entire operation except when an insert is actually in progress.  It
>> narrows the window where data might be corrupted.  This becomes even
>> more valuable if a stray write to memory is guaranteed to hit storage...
>> like with persistent memory.
>>
>> Someone mentioned to me that valgrind does lots of mprotect()s and might
>> benefit from this.
>>
>> We could keep heap metadata as R/O and only make it R/W inside of
>> malloc() itself to catch corruption more quickly.
> 
> But that metadata is typically within the same page as the data itself
> (for small objects at least), no?

I guess it depends on the implementation.  I honestly don't know what
glibc's malloc does specifically.

>> More crazy ideas welcome. :)
> 
> Since you asked :) I wonder if the usefulness could be extended by
> making it possible for a thread to revoke its access to WRPKRU (it's not
> privileged, right?). Then I could imagine some extra security for
> sandbox/bytecode/JIT code so it doesn't interfere with the runtime. But
> since it doesn't block instruction fetches, then maybe it wouldn't make
> much difference...

Correct, is is not privileged.  The only way to "revoke" access would be
to disable the feature in CR4, in which case the keys wouldn't be
enforced either.

PKRU it saved/restored using xsave*/xrstor*, which require having the
FPU enabled.  But, you can still *use* them even if the FPU is not in
play.  So we can't use the FPU en/disable to help us, either. :(


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 20:11         ` One Thousand Gnomes
@ 2015-05-08  4:51           ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2015-05-08  4:51 UTC (permalink / raw)
  To: One Thousand Gnomes; +Cc: Dave Hansen, linux-kernel, x86


* One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> wrote:

> On Thu, 7 May 2015 21:26:20 +0200
> Ingo Molnar <mingo@kernel.org> wrote:
> 
> > 
> > * One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> wrote:
> > 
> > > > We could keep heap metadata as R/O and only make it R/W inside of 
> > > > malloc() itself to catch corruption more quickly.
> > > 
> > > If you implement multiple malloc pools you can chop up lots of 
> > > stuff.
> > 
> > I'd say that a 64-bit address space is large enough to hide 
> > buffers in from accidental corruption, without any runtime page 
> > protection flipping overhead?
> 
> I'd say no. [...]

So if putting your buffers anywhere in a byte range of 
18446744073709551616 bytes large (well, 281474976710656 bytes with 
current CPUs) isn't enough to protect from stray writes? Could you 
outline the situations where that isn't enough?

> [...] And from actual real world demand for PK the answer is also 
> no. It's already a problem with very large data sets. [...]

So that's why I asked: what real world demand is there? Is it 
described/documented/reported anywhere public?

> [...] Worse still in many cases its a problem that nobody is 
> actually measuring or doing much about (because mprotect on many 
> gigabytes of data is expensive).

It's not necessarily expensive if the remote TLB shootdown guarantee 
is weakened (i.e. we could have an mprotect() flag that says "I don't 
need remote TLB shootdowns") - and nobody has asked for that yet 
AFAICS.

With 2MB or 1GB pages it would be even cheaper.

Also, the way databases usually protect themselves is by making a 
robust central engine and communicating with (complex) DB users via 
memory sharing and IPC.

> > I think libraries are happy enough to work without bugs - apps 
> > digging around in library data are in a "you keep all the broken 
> > pieces" situation, why would a library want to slow down every 
> > good citizen down with extra protection flipping/unflipping 
> > accesses?
> 
> For debugging, when the library maintained data is sensitive or 
> something you don't want corupted, or because the user puts security 
> first. Protection keys are an awful lot faster than mprotect.

There's no flushing of TLBs involved even locally, a PK 'flip' is just 
a handful of cycles no matter whether protections are narrowed or 
broadened, right?

> [...] You've got no synchronization and shootdowns to do just a CPU 
> register to load to indicate which mask of keys you are happy with. 
> That really changes what it is useful for, because it's cheap. It 
> means you can happily do stuff like
> 
> 	while(data_blocks) {
> 		allow_key_and_source_access();
> 		do_crypto_func();
> 		revoke_key_and_source_access();
> 		do_network_io();  /* Can't accidentally leak keys or
> 					input */
> 	}

That looks useful if it's fast enough. I suspect a similar benefit 
could be gained if we allowed individually randomized anonymous 
mmap()s: the key wouldn't just be part of the heap, but isolated and 
randomized somewhere in a 64-bit (48-bit) address space.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 19:18     ` One Thousand Gnomes
  2015-05-07 19:26       ` Ingo Molnar
@ 2015-05-08  6:09       ` Kevin Easton
  1 sibling, 0 replies; 37+ messages in thread
From: Kevin Easton @ 2015-05-08  6:09 UTC (permalink / raw)
  To: One Thousand Gnomes; +Cc: Dave Hansen, Ingo Molnar, linux-kernel, x86

On Thu, May 07, 2015 at 08:18:43PM +0100, One Thousand Gnomes wrote:
> > We could keep heap metadata as R/O and only make it R/W inside of
> > malloc() itself to catch corruption more quickly.
> 
> If you implement multiple malloc pools you can chop up lots of stuff.
> 
> In library land it isn't just stuff like malloc, you can use it as
> a debug weapon to protect library private data from naughty application
> code.

How could a library (or debugger, for that matter) arbitrate ownership
of the protection domains with the application?

One interesting use for it might be to be to provide an interface to
allocate memory and associate it with a lock that's supposed to be held
while accessing that memory.  The allocation function hashes the lock
address down to one of the 15 non-zero protection domains and applies 
that key to the memory, the lock function then adds RW access to the
appropriate protection domain and the unlock function removes it.

    - Kevin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys
  2015-05-07 18:48     ` Vlastimil Babka
  2015-05-07 21:45       ` Dave Hansen
@ 2015-05-09 19:09       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 37+ messages in thread
From: Dr. David Alan Gilbert @ 2015-05-09 19:09 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Dave Hansen, Ingo Molnar, linux-kernel, x86

* Vlastimil Babka (vbabka@suse.cz) wrote:
> On 05/07/2015 08:09 PM, Dave Hansen wrote:
> >On 05/07/2015 10:57 AM, Ingo Molnar wrote:
> >>>>There are two new instructions (RDPKRU/WRPKRU) for reading and
> >>>>writing to the new register.  The feature is only available in
> >>>>64-bit mode, even though there is theoretically space in the PAE
> >>>>PTEs.  These permissions are enforced on data access only and have
> >>>>no effect on instruction fetches.
> >>So I'm wondering what the primary usecases are for this feature?
> >>Could you outline applications/workloads/libraries that would
> >>benefit from this?
> >
> >There are lots of things that folks would _like_ to mprotect(), but end
> >up not being feasible because of the overhead of going and mucking with
> >thousands of PTEs and shooting down remote TLBs every time you want to
> >change protections.
> >
> >Data structures like logs or journals that are only written to in very
> >limited code paths, but that you want to protect from "stray" writes.
> >
> >Maybe even a database where a query operation will never need to write
> >to memory, but an insert would.  You could keep the data R/O during the
> >entire operation except when an insert is actually in progress.  It
> >narrows the window where data might be corrupted.  This becomes even
> >more valuable if a stray write to memory is guaranteed to hit storage...
> >like with persistent memory.
> >
> >Someone mentioned to me that valgrind does lots of mprotect()s and might
> >benefit from this.
> >
> >We could keep heap metadata as R/O and only make it R/W inside of
> >malloc() itself to catch corruption more quickly.
> 
> But that metadata is typically within the same page as the data
> itself (for small objects at least), no?
> 
> >More crazy ideas welcome. :)
> 
> Since you asked :) I wonder if the usefulness could be extended by
> making it possible for a thread to revoke its access to WRPKRU (it's
> not privileged, right?). Then I could imagine some extra security
> for sandbox/bytecode/JIT code so it doesn't interfere with the
> runtime. But since it doesn't block instruction fetches, then maybe
> it wouldn't make much difference...

Even without revoking a threads ability to change it, it would still
be useful just to restrict what data your JITd code can get to; if a JIT
generated the code it would know it's not generating any code that change
the keys and so as long as it bounds the code that's accessible, it could
use this to stop the generated code getting to JIT data structures.
I can see it also being useful for things like NaCl that supposedly bound
what the code can contain.

Dave

> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at  http://www.tux.org/lkml/
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\ gro.gilbert @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/12] x86, pkeys: store protection in high VMA flags
  2015-05-07 17:41 ` [PATCH 06/12] x86, pkeys: store protection in high VMA flags Dave Hansen
@ 2015-05-15 21:10   ` Thomas Gleixner
  2015-05-15 21:13     ` Dave Hansen
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2015-05-15 21:10 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, x86

On Thu, 7 May 2015, Dave Hansen wrote:
> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> +#define VM_HIGH_ARCH_1  0x100000000	/* bit only usable on 64-bit architectures */

Nit. Shouldn't this start with VM_HIGH_ARCH_0 ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/12] x86, pkeys: store protection in high VMA flags
  2015-05-15 21:10   ` Thomas Gleixner
@ 2015-05-15 21:13     ` Dave Hansen
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-05-15 21:13 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, x86

On 05/15/2015 02:10 PM, Thomas Gleixner wrote:
> On Thu, 7 May 2015, Dave Hansen wrote:
>> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
>> +#define VM_HIGH_ARCH_1  0x100000000	/* bit only usable on 64-bit architectures */
> 
> Nit. Shouldn't this start with VM_HIGH_ARCH_0 ?

Yeah, it does make the later #defines look a bit funny.  I modeled it
after the "low" VM_ARCH_ flags which start at 1:

#define VM_ARCH_1       0x01000000      /* Architecture-specific flag */
#define VM_ARCH_2       0x02000000

I can change it to be 0 based though.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls
  2015-05-07 19:19     ` Dave Hansen
@ 2015-09-04 20:13       ` Florian Weimer
  2015-09-04 20:18         ` Dave Hansen
  0 siblings, 1 reply; 37+ messages in thread
From: Florian Weimer @ 2015-09-04 20:13 UTC (permalink / raw)
  To: Dave Hansen; +Cc: One Thousand Gnomes, linux-kernel, x86

* Dave Hansen:

> On 05/07/2015 12:11 PM, One Thousand Gnomes wrote:
>>> diff -puN
>>> include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits
>>> include/uapi/asm-generic/mman-common.h
>>> --- a/include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits
>>> 2015-05-07 10:31:43.367276719 -0700
>>> +++ b/include/uapi/asm-generic/mman-common.h 2015-05-07
>>> 10:31:43.370276855 -0700
>>> @@ -10,6 +10,10 @@
>>>  #define PROT_WRITE	0x2		/* page can be written */
>>>  #define PROT_EXEC	0x4		/* page can be executed */
>>>  #define PROT_SEM	0x8		/* page may be used for atomic ops */
>>> +#define PROT_PKEY0	0x10		/* protection key value (bit 0) */
>>> +#define PROT_PKEY1	0x20		/* protection key value (bit 1) */
>>> +#define PROT_PKEY2	0x40		/* protection key value (bit 2) */
>>> +#define PROT_PKEY3	0x80		/* protection key value (bit 3) */
>> 
>> Thats leaking deep Intelisms into asm-generic which makes me very
>> uncomfortable. Whether we need to reserve some bits for "arch specific"
>> is one question, what we do with them ought not to be leaking out.
>> 
>> To start with trying to port code people will want to do
>> 
>> #define PROT_PKEY0	0
>> #define PROT_PKEY1	0
>
> Yeah, I feel pretty uncomfortable with it as well.  I really don't
> expect these to live like this in asm-generic when I submit this.
>
> Powerpc and ia64 have _something_ resembling protection keys, so the
> concept isn't entirely x86 or Intel-specific.  My hope would be that we
> do this in a way that other architectures can use.

It will also be very painful to add additional bits.  We went through
this with the CPU affinity mask, and it still hurts it.  Please use a
more sensible interface from the start. :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls
  2015-09-04 20:13       ` Florian Weimer
@ 2015-09-04 20:18         ` Dave Hansen
  2015-09-04 20:34           ` Florian Weimer
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2015-09-04 20:18 UTC (permalink / raw)
  To: Florian Weimer; +Cc: One Thousand Gnomes, linux-kernel, x86

On 09/04/2015 01:13 PM, Florian Weimer wrote:
...
>>>> >>>  #define PROT_WRITE	0x2		/* page can be written */
>>>> >>>  #define PROT_EXEC	0x4		/* page can be executed */
>>>> >>>  #define PROT_SEM	0x8		/* page may be used for atomic ops */
>>>> >>> +#define PROT_PKEY0	0x10		/* protection key value (bit 0) */
>>>> >>> +#define PROT_PKEY1	0x20		/* protection key value (bit 1) */
>>>> >>> +#define PROT_PKEY2	0x40		/* protection key value (bit 2) */
>>>> >>> +#define PROT_PKEY3	0x80		/* protection key value (bit 3) */
>>> >> 
>>> >> Thats leaking deep Intelisms into asm-generic which makes me very
>>> >> uncomfortable. Whether we need to reserve some bits for "arch specific"
>>> >> is one question, what we do with them ought not to be leaking out.
>>> >> 
>>> >> To start with trying to port code people will want to do
>>> >> 
>>> >> #define PROT_PKEY0	0
>>> >> #define PROT_PKEY1	0
>> >
>> > Yeah, I feel pretty uncomfortable with it as well.  I really don't
>> > expect these to live like this in asm-generic when I submit this.
>> >
>> > Powerpc and ia64 have _something_ resembling protection keys, so the
>> > concept isn't entirely x86 or Intel-specific.  My hope would be that we
>> > do this in a way that other architectures can use.
> It will also be very painful to add additional bits.  We went through
> this with the CPU affinity mask, and it still hurts it.  Please use a
> more sensible interface from the start. :)

Any suggestions?

Are you thinking that we want a completely separate syscall and
completely avoid using the PROT_* bits?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls
  2015-09-04 20:18         ` Dave Hansen
@ 2015-09-04 20:34           ` Florian Weimer
  2015-09-04 20:41             ` Dave Hansen
  0 siblings, 1 reply; 37+ messages in thread
From: Florian Weimer @ 2015-09-04 20:34 UTC (permalink / raw)
  To: Dave Hansen; +Cc: One Thousand Gnomes, linux-kernel, x86

* Dave Hansen:

> On 09/04/2015 01:13 PM, Florian Weimer wrote:
> ...
>>>>> >>>  #define PROT_WRITE	0x2		/* page can be written */
>>>>> >>>  #define PROT_EXEC	0x4		/* page can be executed */
>>>>> >>>  #define PROT_SEM 0x8 /* page may be used for atomic ops */
>>>>> >>> +#define PROT_PKEY0 0x10 /* protection key value (bit 0) */
>>>>> >>> +#define PROT_PKEY1 0x20 /* protection key value (bit 1) */
>>>>> >>> +#define PROT_PKEY2 0x40 /* protection key value (bit 2) */
>>>>> >>> +#define PROT_PKEY3 0x80 /* protection key value (bit 3) */
>>>> >> 
>>>> >> Thats leaking deep Intelisms into asm-generic which makes me very
>>>> >> uncomfortable. Whether we need to reserve some bits for "arch specific"
>>>> >> is one question, what we do with them ought not to be leaking out.
>>>> >> 
>>>> >> To start with trying to port code people will want to do
>>>> >> 
>>>> >> #define PROT_PKEY0	0
>>>> >> #define PROT_PKEY1	0
>>> >
>>> > Yeah, I feel pretty uncomfortable with it as well.  I really don't
>>> > expect these to live like this in asm-generic when I submit this.
>>> >
>>> > Powerpc and ia64 have _something_ resembling protection keys, so the
>>> > concept isn't entirely x86 or Intel-specific.  My hope would be that we
>>> > do this in a way that other architectures can use.
>> It will also be very painful to add additional bits.  We went through
>> this with the CPU affinity mask, and it still hurts it.  Please use a
>> more sensible interface from the start. :)
>
> Any suggestions?

It's difficult.  I don't know what kind of programming model you
expect.  Could glibc use these bits for its own implementation?  Or
OpenSSL?  Or is this intended for tightly integrated language
run-times which have a very precise idea what kind of stuff runs
within the same address space?

> Are you thinking that we want a completely separate syscall and
> completely avoid using the PROT_* bits?

Yes, that would seem more future-proof.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls
  2015-09-04 20:34           ` Florian Weimer
@ 2015-09-04 20:41             ` Dave Hansen
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2015-09-04 20:41 UTC (permalink / raw)
  To: Florian Weimer; +Cc: One Thousand Gnomes, linux-kernel, x86

On 09/04/2015 01:34 PM, Florian Weimer wrote:
...>>> It will also be very painful to add additional bits.  We went through
>>> this with the CPU affinity mask, and it still hurts it.  Please use a
>>> more sensible interface from the start. :)
>>
>> Any suggestions?
> 
> It's difficult.  I don't know what kind of programming model you
> expect.  Could glibc use these bits for its own implementation?  Or
> OpenSSL?

Our expectation is that there will be a central "allocator" for these
bits for mixed-use, like when two libraries want to control a portion of
the address space for their purposes.

Applications will also be completely free to implement their own, like
with a language runtime.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2015-09-04 20:41 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-07 17:41 [PATCH 00/12] [RFC] x86: Memory Protection Keys Dave Hansen
2015-05-07 17:41 ` [PATCH 02/12] x86, pku: define new CR4 bit Dave Hansen
2015-05-07 17:41 ` [PATCH 01/12] x86, pkeys: cpuid bit definition Dave Hansen
2015-05-07 17:41 ` [PATCH 03/12] x86, pkey: pkru xsave fields and data structure Dave Hansen
2015-05-07 17:41 ` [PATCH 04/12] x86, pkeys: PTE bits Dave Hansen
2015-05-07 17:41 ` [PATCH 06/12] x86, pkeys: store protection in high VMA flags Dave Hansen
2015-05-15 21:10   ` Thomas Gleixner
2015-05-15 21:13     ` Dave Hansen
2015-05-07 17:41 ` [PATCH 05/12] x86, pkeys: new page fault error code bit: PF_PK Dave Hansen
2015-05-07 17:41 ` [PATCH 08/12] x86, pkeys: arch-specific protection bits Dave Hansen
2015-05-07 17:41 ` [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls Dave Hansen
2015-05-07 19:11   ` One Thousand Gnomes
2015-05-07 19:19     ` Dave Hansen
2015-09-04 20:13       ` Florian Weimer
2015-09-04 20:18         ` Dave Hansen
2015-09-04 20:34           ` Florian Weimer
2015-09-04 20:41             ` Dave Hansen
2015-05-07 17:41 ` [PATCH 09/12] x86, pkeys: notify userspace about protection key faults Dave Hansen
2015-05-07 17:41 ` [PATCH 11/12] x86, pkeys: actually enable Memory Protection Keys in CPU Dave Hansen
2015-05-07 17:41 ` [PATCH 10/12] x86, pkeys: differentiate Protection Key faults from normal Dave Hansen
2015-05-07 17:41 ` [PATCH 12/12] x86, pkeys: Documentation Dave Hansen
2015-05-07 17:57 ` [PATCH 00/12] [RFC] x86: Memory Protection Keys Ingo Molnar
2015-05-07 18:09   ` Dave Hansen
2015-05-07 18:48     ` Vlastimil Babka
2015-05-07 21:45       ` Dave Hansen
2015-05-09 19:09       ` Dr. David Alan Gilbert
2015-05-07 19:18     ` One Thousand Gnomes
2015-05-07 19:26       ` Ingo Molnar
2015-05-07 19:40         ` Dave Hansen
2015-05-07 20:11         ` One Thousand Gnomes
2015-05-08  4:51           ` Ingo Molnar
2015-05-08  6:09       ` Kevin Easton
2015-05-07 19:22     ` Christian Borntraeger
2015-05-07 19:29       ` Dave Hansen
2015-05-07 19:45         ` Christian Borntraeger
2015-05-07 19:49           ` Dave Hansen
2015-05-07 19:57             ` Christian Borntraeger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.