linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/33] x86: Memory Protection Keys (v10)
@ 2016-02-12 21:01 Dave Hansen
  2016-02-12 21:01 ` [PATCH 01/33] mm: introduce get_user_pages_remote() Dave Hansen
                   ` (33 more replies)
  0 siblings, 34 replies; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, torvalds, Dave Hansen, linux-api, linux-arch,
	aarcange, akpm, jack, kirill.shutemov, n-horiguchi, vbabka

Memory Protection Keys for User pages is a CPU feature which will
first appear on Skylake Servers, but will also be supported on
future non-server parts (there is also a QEMU implementation).  It
provides a mechanism for enforcing page-based protections, but
without requiring modification of the page tables when an
application changes wishes to change permissions.

This set introduces supported limited to:
1. Allows "execute-only" memory
2. Enables KVM to run Protection-Key-enabled guests

This set contains the vast majority of of the code, with the
small but tricky explicit user interface parts left off.  We can
have a more focused review on those at a later time in a (much
smaller) follow-on series.

Changes from v9:
 * Added macros to allow some source-level backward compatability
   between the old-style get_user_pages*() calls taking tsk/mm and
   the new-style which assumes the current tsk/mm.
 * renamed the new gup variant to get_user_pages_remote() instead
   of get_user_pages_foreign()
 * rebased against 4.5-rc3

Changes from v8:
 * Reorganization of get_user_pages() patch with awesome feedback
   from Vlastimil Babka and suggestions from Jan Kara.
 * fix EXPORT_SYMBOL() and use get_current_user_page() in nommu.c

Changes from v7:
 * Fixed merge issue with cpu feature bitmap definitions
 * Fixed up some comments in get_user_pages() and smaps patches
   (thanks Vlastimil!)

Changes from v6:
 * fix up ??'s showing up in in smaps' VmFlags field
 * added execute-only support
 * removed all the new syscalls from this set.  We can discuss
   them in detail after this is merged.

Changes from v5:

 * make types in read_pkru() u32's, not ints
 * rework VM_* bits to avoid using __ffsl() and clean up
   vma_pkey()
 * rework pte_allows_gup() to use p??_val() instead of passing
   around p{te,md,ud}_t types.
 * Fix up some inconsistent bool vs. int usage
 * corrected name of ARCH_VM_PKEY_FLAGS in patch description
 * remove NR_PKEYS... config option.  Just define it directly

Changes from v4:

 * Made "allow setting of XSAVE state" safe if we got preempted
   between when we saved our FPU state and when we restore it.
   (I would appreciate a look from Ingo on this patch).
 * Fixed up a few things from Thomas's latest comments: splt up
   siginfo in to x86 and generic, removed extra 'eax' variable
   in rdpkru function, reworked vm_flags assignment, reworded
   a comment in pte_allows_gup()
 * Add missing DISABLED/REQUIRED_MASK14 in cpufeature.h
 * Added comment about compile optimization in fault path
 * Left get_user_pages_locked() alone.  Andrea thinks we need it.

Changes from RFCv3:

 * Added 'current' and 'foreign' variants of get_user_pages() to
   help indicate whether protection keys should be enforced.
   Thanks to Jerome Glisse for pointing out this issue.
 * Added "allocation" and set/get system calls so that we can do
   management of proection keys in the kernel.  This opens the
   door to use of specific protection keys for kernel use in the
   future, such as for execute-only memory.
 * Removed the kselftest code for the moment.  It will be
   submitted separately.

Thanks Ingo and Thomas for most of these):
Changes from RFCv2 (Thanks Ingo and Thomas for most of these):

 * few minor compile warnings
 * changed 'nopku' interaction with cpuid bits.  Now, we do not
   clear the PKU cpuid bit, we just skip enabling it.
 * changed __pkru_allows_write() to also check access disable bit
 * removed the unused write_pkru()
 * made si_pkey a u64 and added some patch description details.
   Also made it share space in siginfo with MPX and clarified
   comments.
 * give some real text for the Processor Trace xsave state
 * made vma_pkey() less ugly (and much more optimized actually)
 * added SEGV_PKUERR to copy_siginfo_to_user()
 * remove page table walk when filling in si_pkey, added some
   big fat comments about it being inherently racy.
 * added self test code

This code is not runnable to anyone outside of Intel unless they
have some special hardware or a fancy simulator.  There is a qemu
model to emulate the feature, but it is not currently implemented
fully enough to be usable.  If you are interested in running this
for real, please get in touch with me.  Hardware is available to a
very small but nonzero number of people.

This set is also available here:

	git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v024

=== diffstat ===

Dave Hansen (33):
      mm: introduce get_user_pages_remote()
      mm: overload get_user_pages() functions
      mm, gup: switch callers of get_user_pages() to not pass tsk/mm
      x86, fpu: add placeholder for Processor Trace XSAVE state
      x86, pkeys: Add Kconfig option
      x86, pkeys: cpuid bit definition
      x86, pkeys: define new CR4 bit
      x86, pkeys: add PKRU xsave fields and data structure(s)
      x86, pkeys: PTE bits for storing protection key
      x86, pkeys: new page fault error code bit: PF_PK
      x86, pkeys: store protection in high VMA flags
      x86, pkeys: arch-specific protection bits
      x86, pkeys: pass VMA down in to fault signal generation code
      signals, pkeys: notify userspace about protection key faults
      x86, pkeys: fill in pkey field in siginfo
      x86, pkeys: add functions to fetch PKRU
      mm: factor out VMA fault permission checking
      x86, mm: simplify get_user_pages() PTE bit handling
      x86, pkeys: check VMAs and PTEs for protection keys
      mm: do not enforce PKEY permissions on "foreign" mm access
      x86, pkeys: optimize fault handling in access_error()
      x86, pkeys: differentiate instruction fetches
      x86, pkeys: dump PKRU with other kernel registers
      x86, pkeys: dump pkey from VMA in /proc/pid/smaps
      x86, pkeys: add Kconfig prompt to existing config option
      x86, pkeys: actually enable Memory Protection Keys in CPU
      mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
      x86, pkeys: add arch_validate_pkey()
      x86: separate out LDT init from context init
      x86, fpu: allow setting of XSAVE state
      x86, pkeys: allow kernel to modify user pkey rights register
      x86, pkeys: create an x86 arch_calc_vm_prot_bits() for VMA flags
      x86, pkeys: execute-only support

 Documentation/kernel-parameters.txt         |   3 +
 arch/cris/arch-v32/drivers/cryptocop.c      |   8 +-
 arch/ia64/kernel/err_inject.c               |   3 +-
 arch/mips/mm/gup.c                          |   3 +-
 arch/powerpc/include/asm/mman.h             |   5 +-
 arch/powerpc/include/asm/mmu_context.h      |  12 ++
 arch/s390/include/asm/mmu_context.h         |  12 ++
 arch/s390/mm/gup.c                          |   4 +-
 arch/sh/mm/gup.c                            |   2 +-
 arch/sparc/mm/gup.c                         |   2 +-
 arch/unicore32/include/asm/mmu_context.h    |  12 ++
 arch/x86/Kconfig                            |  16 ++
 arch/x86/include/asm/cpufeature.h           |  61 ++++---
 arch/x86/include/asm/disabled-features.h    |  15 ++
 arch/x86/include/asm/fpu/internal.h         |   2 +
 arch/x86/include/asm/fpu/types.h            |  12 ++
 arch/x86/include/asm/fpu/xstate.h           |   3 +-
 arch/x86/include/asm/mmu_context.h          |  85 ++++++++-
 arch/x86/include/asm/pgtable.h              |  38 ++++
 arch/x86/include/asm/pgtable_types.h        |  39 ++++-
 arch/x86/include/asm/pkeys.h                |  34 ++++
 arch/x86/include/asm/required-features.h    |   7 +
 arch/x86/include/asm/special_insns.h        |  22 +++
 arch/x86/include/uapi/asm/mman.h            |  22 +++
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/cpu/common.c                |  42 +++++
 arch/x86/kernel/fpu/core.c                  |  63 +++++++
 arch/x86/kernel/fpu/xstate.c                | 185 +++++++++++++++++++-
 arch/x86/kernel/ldt.c                       |   4 +-
 arch/x86/kernel/process_64.c                |   2 +
 arch/x86/kernel/setup.c                     |   9 +
 arch/x86/mm/Makefile                        |   2 +
 arch/x86/mm/fault.c                         | 168 +++++++++++++++---
 arch/x86/mm/gup.c                           |  45 +++--
 arch/x86/mm/mpx.c                           |   4 +-
 arch/x86/mm/pkeys.c                         | 101 +++++++++++
 drivers/char/agp/frontend.c                 |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     |   3 +-
 drivers/gpu/drm/etnaviv/etnaviv_gem.c       |   6 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c     |  10 +-
 drivers/gpu/drm/radeon/radeon_ttm.c         |   3 +-
 drivers/gpu/drm/via/via_dmablit.c           |   3 +-
 drivers/infiniband/core/umem.c              |   2 +-
 drivers/infiniband/core/umem_odp.c          |   8 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   3 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |   3 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |   2 +-
 drivers/iommu/amd_iommu_v2.c                |   1 +
 drivers/media/pci/ivtv/ivtv-udma.c          |   4 +-
 drivers/media/pci/ivtv/ivtv-yuv.c           |  10 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |   3 +-
 drivers/misc/mic/scif/scif_rma.c            |   2 -
 drivers/misc/sgi-gru/grufault.c             |   3 +-
 drivers/scsi/st.c                           |   2 -
 drivers/staging/android/ashmem.c            |   4 +-
 drivers/video/fbdev/pvr2fb.c                |   4 +-
 drivers/virt/fsl_hypervisor.c               |   5 +-
 fs/exec.c                                   |   8 +-
 fs/proc/task_mmu.c                          |  14 ++
 include/asm-generic/mm_hooks.h              |  12 ++
 include/linux/mm.h                          | 102 +++++++++--
 include/linux/mman.h                        |   6 +-
 include/linux/pkeys.h                       |  33 ++++
 include/uapi/asm-generic/siginfo.h          |  17 +-
 kernel/events/uprobes.c                     |  10 +-
 kernel/signal.c                             |   4 +
 mm/Kconfig                                  |   5 +
 mm/frame_vector.c                           |   2 +-
 mm/gup.c                                    | 127 +++++++++++---
 mm/ksm.c                                    |  12 +-
 mm/memory.c                                 |   8 +-
 mm/mempolicy.c                              |   6 +-
 mm/mmap.c                                   |  10 +-
 mm/mprotect.c                               |   8 +-
 mm/nommu.c                                  |  66 ++++---
 mm/process_vm_access.c                      |  11 +-
 mm/util.c                                   |   4 +-
 net/ceph/pagevec.c                          |   2 +-
 security/tomoyo/domain.c                    |   9 +-
 virt/kvm/async_pf.c                         |   8 +-
 virt/kvm/kvm_main.c                         |  10 +-
 81 files changed, 1393 insertions(+), 233 deletions(-)

Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: aarcange@redhat.com
Cc: akpm@linux-foundation.org
Cc: jack@suse.cz
Cc: kirill.shutemov@linux.intel.com
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: n-horiguchi@ah.jp.nec.com
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: vbabka@suse.cz

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 01/33] mm: introduce get_user_pages_remote()
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
@ 2016-02-12 21:01 ` Dave Hansen
  2016-02-15  6:09   ` Balbir Singh
                     ` (2 more replies)
  2016-02-12 21:01 ` [PATCH 02/33] mm: overload get_user_pages() functions Dave Hansen
                   ` (32 subsequent siblings)
  33 siblings, 3 replies; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen, srikar,
	vbabka, akpm, kirill.shutemov, aarcange, n-horiguchi, jack


From: Dave Hansen <dave.hansen@linux.intel.com>

For protection keys, we need to understand whether protections
should be enforced in software or not.  In general, we enforce
protections when working on our own task, but not when on others.
We call these "current" and "remote" operations.

This patch introduces a new get_user_pages() variant:

        get_user_pages_remote()

Which is a replacement for when get_user_pages() is called on
non-current tsk/mm.

We also introduce a new gup flag: FOLL_REMOTE which can be used
for the "__" gup variants to get this new behavior.

The uprobes is_trap_at_addr() location holds mmap_sem and
calls get_user_pages(current->mm) on an instruction address.  This
makes it a pretty unique gup caller.  Being an instruction access
and also really originating from the kernel (vs. the app), I opted
to consider this a 'remote' access where protection keys will not
be enforced.

Without protection keys, this patch should not change any behavior.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: jack@suse.cz
---

 b/drivers/gpu/drm/etnaviv/etnaviv_gem.c   |    6 +++---
 b/drivers/gpu/drm/i915/i915_gem_userptr.c |   10 +++++-----
 b/drivers/infiniband/core/umem_odp.c      |    8 ++++----
 b/fs/exec.c                               |    8 ++++++--
 b/include/linux/mm.h                      |    5 +++++
 b/kernel/events/uprobes.c                 |   10 ++++++++--
 b/mm/gup.c                                |   27 ++++++++++++++++++++++-----
 b/mm/memory.c                             |    2 +-
 b/mm/process_vm_access.c                  |   11 ++++++++---
 b/security/tomoyo/domain.c                |    9 ++++++++-
 b/virt/kvm/async_pf.c                     |    8 +++++++-
 11 files changed, 77 insertions(+), 27 deletions(-)

diff -puN drivers/gpu/drm/etnaviv/etnaviv_gem.c~introduce-get_user_pages_remote drivers/gpu/drm/etnaviv/etnaviv_gem.c
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.170106660 -0800
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c	2016-02-12 10:44:13.190107574 -0800
@@ -753,9 +753,9 @@ static struct page **etnaviv_gem_userptr
 
 	down_read(&mm->mmap_sem);
 	while (pinned < npages) {
-		ret = get_user_pages(task, mm, ptr, npages - pinned,
-				     !etnaviv_obj->userptr.ro, 0,
-				     pvec + pinned, NULL);
+		ret = get_user_pages_remote(task, mm, ptr, npages - pinned,
+					    !etnaviv_obj->userptr.ro, 0,
+					    pvec + pinned, NULL);
 		if (ret < 0)
 			break;
 
diff -puN drivers/gpu/drm/i915/i915_gem_userptr.c~introduce-get_user_pages_remote drivers/gpu/drm/i915/i915_gem_userptr.c
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.171106705 -0800
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c	2016-02-12 10:44:13.190107574 -0800
@@ -584,11 +584,11 @@ __i915_gem_userptr_get_pages_worker(stru
 
 		down_read(&mm->mmap_sem);
 		while (pinned < npages) {
-			ret = get_user_pages(work->task, mm,
-					     obj->userptr.ptr + pinned * PAGE_SIZE,
-					     npages - pinned,
-					     !obj->userptr.read_only, 0,
-					     pvec + pinned, NULL);
+			ret = get_user_pages_remote(work->task, mm,
+					obj->userptr.ptr + pinned * PAGE_SIZE,
+					npages - pinned,
+					!obj->userptr.read_only, 0,
+					pvec + pinned, NULL);
 			if (ret < 0)
 				break;
 
diff -puN drivers/infiniband/core/umem_odp.c~introduce-get_user_pages_remote drivers/infiniband/core/umem_odp.c
--- a/drivers/infiniband/core/umem_odp.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.173106797 -0800
+++ b/drivers/infiniband/core/umem_odp.c	2016-02-12 10:44:13.191107620 -0800
@@ -572,10 +572,10 @@ int ib_umem_odp_map_dma_pages(struct ib_
 		 * complex (and doesn't gain us much performance in most use
 		 * cases).
 		 */
-		npages = get_user_pages(owning_process, owning_mm, user_virt,
-					gup_num_pages,
-					access_mask & ODP_WRITE_ALLOWED_BIT, 0,
-					local_page_list, NULL);
+		npages = get_user_pages_remote(owning_process, owning_mm,
+				user_virt, gup_num_pages,
+				access_mask & ODP_WRITE_ALLOWED_BIT,
+				0, local_page_list, NULL);
 		up_read(&owning_mm->mmap_sem);
 
 		if (npages < 0)
diff -puN fs/exec.c~introduce-get_user_pages_remote fs/exec.c
--- a/fs/exec.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.175106888 -0800
+++ b/fs/exec.c	2016-02-12 10:44:13.192107666 -0800
@@ -198,8 +198,12 @@ static struct page *get_arg_page(struct
 			return NULL;
 	}
 #endif
-	ret = get_user_pages(current, bprm->mm, pos,
-			1, write, 1, &page, NULL);
+	/*
+	 * We are doing an exec().  'current' is the process
+	 * doing the exec and bprm->mm is the new process's mm.
+	 */
+	ret = get_user_pages_remote(current, bprm->mm, pos, 1, write,
+			1, &page, NULL);
 	if (ret <= 0)
 		return NULL;
 
diff -puN include/linux/mm.h~introduce-get_user_pages_remote include/linux/mm.h
--- a/include/linux/mm.h~introduce-get_user_pages_remote	2016-02-12 10:44:13.176106934 -0800
+++ b/include/linux/mm.h	2016-02-12 10:44:13.192107666 -0800
@@ -1225,6 +1225,10 @@ long __get_user_pages(struct task_struct
 		      unsigned long start, unsigned long nr_pages,
 		      unsigned int foll_flags, struct page **pages,
 		      struct vm_area_struct **vmas, int *nonblocking);
+long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    struct vm_area_struct **vmas);
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		    unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages,
@@ -2168,6 +2172,7 @@ static inline struct page *follow_page(s
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 #define FOLL_MLOCK	0x1000	/* lock present pages */
+#define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff -puN kernel/events/uprobes.c~introduce-get_user_pages_remote kernel/events/uprobes.c
--- a/kernel/events/uprobes.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.178107026 -0800
+++ b/kernel/events/uprobes.c	2016-02-12 10:44:13.193107711 -0800
@@ -299,7 +299,7 @@ int uprobe_write_opcode(struct mm_struct
 
 retry:
 	/* Read the page with vaddr into memory */
-	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
+	ret = get_user_pages_remote(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
 	if (ret <= 0)
 		return ret;
 
@@ -1700,7 +1700,13 @@ static int is_trap_at_addr(struct mm_str
 	if (likely(result == 0))
 		goto out;
 
-	result = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
+	/*
+	 * The NULL 'tsk' here ensures that any faults that occur here
+	 * will not be accounted to the task.  'mm' *is* current->mm,
+	 * but we treat this as a 'remote' access since it is
+	 * essentially a kernel access to the memory.
+	 */
+	result = get_user_pages_remote(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
 	if (result < 0)
 		return result;
 
diff -puN mm/gup.c~introduce-get_user_pages_remote mm/gup.c
--- a/mm/gup.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.180107117 -0800
+++ b/mm/gup.c	2016-02-12 10:44:13.194107757 -0800
@@ -870,7 +870,7 @@ long get_user_pages_unlocked(struct task
 EXPORT_SYMBOL(get_user_pages_unlocked);
 
 /*
- * get_user_pages() - pin user pages in memory
+ * get_user_pages_remote() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
  *		NULL if faults are not to be recorded.
  * @mm:		mm_struct of target mm
@@ -924,12 +924,29 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  * should use get_user_pages because it cannot pass
  * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, unsigned long nr_pages, int write,
-		int force, struct page **pages, struct vm_area_struct **vmas)
+long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas)
 {
 	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
-				       pages, vmas, NULL, false, FOLL_TOUCH);
+				       pages, vmas, NULL, false,
+				       FOLL_TOUCH | FOLL_REMOTE);
+}
+EXPORT_SYMBOL(get_user_pages_remote);
+
+/*
+ * This is the same as get_user_pages_remote() for the time
+ * being.
+ */
+long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas)
+{
+	return __get_user_pages_locked(tsk, mm, start, nr_pages,
+				       write, force, pages, vmas, NULL, false,
+				       FOLL_TOUCH);
 }
 EXPORT_SYMBOL(get_user_pages);
 
diff -puN mm/memory.c~introduce-get_user_pages_remote mm/memory.c
--- a/mm/memory.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.182107208 -0800
+++ b/mm/memory.c	2016-02-12 10:44:13.195107803 -0800
@@ -3664,7 +3664,7 @@ static int __access_remote_vm(struct tas
 		void *maddr;
 		struct page *page = NULL;
 
-		ret = get_user_pages(tsk, mm, addr, 1,
+		ret = get_user_pages_remote(tsk, mm, addr, 1,
 				write, 1, &page, &vma);
 		if (ret <= 0) {
 #ifndef CONFIG_HAVE_IOREMAP_PROT
diff -puN mm/process_vm_access.c~introduce-get_user_pages_remote mm/process_vm_access.c
--- a/mm/process_vm_access.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.183107254 -0800
+++ b/mm/process_vm_access.c	2016-02-12 10:44:13.195107803 -0800
@@ -98,9 +98,14 @@ static int process_vm_rw_single_vec(unsi
 		int pages = min(nr_pages, max_pages_per_loop);
 		size_t bytes;
 
-		/* Get the pages we're interested in */
-		pages = get_user_pages_unlocked(task, mm, pa, pages,
-						vm_write, 0, process_pages);
+		/*
+		 * Get the pages we're interested in.  We must
+		 * add FOLL_REMOTE because task/mm might not
+		 * current/current->mm
+		 */
+		pages = __get_user_pages_unlocked(task, mm, pa, pages,
+						  vm_write, 0, process_pages,
+						  FOLL_REMOTE);
 		if (pages <= 0)
 			return -EFAULT;
 
diff -puN security/tomoyo/domain.c~introduce-get_user_pages_remote security/tomoyo/domain.c
--- a/security/tomoyo/domain.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.185107346 -0800
+++ b/security/tomoyo/domain.c	2016-02-12 10:44:13.196107848 -0800
@@ -874,7 +874,14 @@ bool tomoyo_dump_page(struct linux_binpr
 	}
 	/* Same with get_arg_page(bprm, pos, 0) in fs/exec.c */
 #ifdef CONFIG_MMU
-	if (get_user_pages(current, bprm->mm, pos, 1, 0, 1, &page, NULL) <= 0)
+	/*
+	 * This is called at execve() time in order to dig around
+	 * in the argv/environment of the new proceess
+	 * (represented by bprm).  'current' is the process doing
+	 * the execve().
+	 */
+	if (get_user_pages_remote(current, bprm->mm, pos, 1,
+				0, 1, &page, NULL) <= 0)
 		return false;
 #else
 	page = bprm->page[pos / PAGE_SIZE];
diff -puN virt/kvm/async_pf.c~introduce-get_user_pages_remote virt/kvm/async_pf.c
--- a/virt/kvm/async_pf.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.187107437 -0800
+++ b/virt/kvm/async_pf.c	2016-02-12 10:44:13.196107848 -0800
@@ -79,7 +79,13 @@ static void async_pf_execute(struct work
 
 	might_sleep();
 
-	get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
+	/*
+	 * This work is run asynchromously to the task which owns
+	 * mm and might be done in another context, so we must
+	 * use FOLL_REMOTE.
+	 */
+	__get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL, FOLL_REMOTE);
+
 	kvm_async_page_present_sync(vcpu, apf);
 
 	spin_lock(&vcpu->async_pf.lock);
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 02/33] mm: overload get_user_pages() functions
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
  2016-02-12 21:01 ` [PATCH 01/33] mm: introduce get_user_pages_remote() Dave Hansen
@ 2016-02-12 21:01 ` Dave Hansen
  2016-02-16  8:36   ` Ingo Molnar
  2016-02-18 20:15   ` [tip:mm/pkeys] mm/gup: Overload " tip-bot for Dave Hansen
  2016-02-12 21:01 ` [PATCH 03/33] mm, gup: switch callers of get_user_pages() to not pass tsk/mm Dave Hansen
                   ` (31 subsequent siblings)
  33 siblings, 2 replies; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen, srikar,
	vbabka, akpm, kirill.shutemov, aarcange, n-horiguchi, jack


From: Dave Hansen <dave.hansen@linux.intel.com>

The concept here was a suggestion from Ingo.  The implementation
horrors are all mine.

This allows get_user_pages(), get_user_pages_unlocked(), and
get_user_pages_locked() to be called with or without the
leading tsk/mm arguments.  We will give a compile-time warning
about the old style being __deprecated and we will also
WARN_ON() if the non-remote version is used for a remote-style
access.

Doing this, folks will get nice warnings and will not break the
build.  This should be nice for -next and will hopefully let
developers fix up their own code instead of maintainers needing
to do it at merge time.

The way we do this is hideous.  It uses the __VA_ARGS__ macro
functionality to call different functions based on the number
of arguments passed to the macro.

There's an additional hack to ensure that our EXPORT_SYMBOL()
of the deprecated symbols doesn't trigger a warning.

We should be able to remove this mess as soon as -rc1 hits in
the release after this is merged.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: jack@suse.cz
---

 b/include/linux/mm.h |   74 ++++++++++++++++++++++++++++++++++++++++++++-------
 b/mm/gup.c           |   62 +++++++++++++++++++++++++++++++++---------
 b/mm/nommu.c         |   64 +++++++++++++++++++++++++++++++-------------
 b/mm/util.c          |    4 --
 4 files changed, 158 insertions(+), 46 deletions(-)

diff -puN include/linux/mm.h~gup-arg-helpers include/linux/mm.h
--- a/include/linux/mm.h~gup-arg-helpers	2016-02-12 10:44:13.845137517 -0800
+++ b/include/linux/mm.h	2016-02-12 10:44:13.854137929 -0800
@@ -1229,24 +1229,78 @@ long get_user_pages_remote(struct task_s
 			    unsigned long start, unsigned long nr_pages,
 			    int write, int force, struct page **pages,
 			    struct vm_area_struct **vmas);
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
-		    int write, int force, struct page **pages,
-		    struct vm_area_struct **vmas);
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
-		    int write, int force, struct page **pages,
-		    int *locked);
+long get_user_pages6(unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    struct vm_area_struct **vmas);
+long get_user_pages_locked6(unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages, int *locked);
 long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
 			       unsigned long start, unsigned long nr_pages,
 			       int write, int force, struct page **pages,
 			       unsigned int gup_flags);
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked5(unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 
+/* suppress warnings from use in EXPORT_SYMBOL() */
+#ifndef __DISABLE_GUP_DEPRECATED
+#define __gup_deprecated __deprecated
+#else
+#define __gup_deprecated
+#endif
+/*
+ * These macros provide backward-compatibility with the old
+ * get_user_pages() variants which took tsk/mm.  These
+ * functions/macros provide both compile-time __deprecated so we
+ * can catch old-style use and not break the build.  The actual
+ * functions also have WARN_ON()s to let us know at runtime if
+ * the get_user_pages() should have been the "remote" variant.
+ *
+ * These are hideous, but temporary.
+ *
+ * If you run into one of these __deprecated warnings, look
+ * at how you are calling get_user_pages().  If you are calling
+ * it with current/current->mm as the first two arguments,
+ * simply remove those arguments.  The behavior will be the same
+ * as it is now.  If you are calling it on another task, use
+ * get_user_pages_remote() instead.
+ *
+ * Any questions?  Ask Dave Hansen <dave@sr71.net>
+ */
+long
+__gup_deprecated
+get_user_pages8(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas);
+#define GUP_MACRO(_1, _2, _3, _4, _5, _6, _7, _8, get_user_pages, ...)	\
+	get_user_pages
+#define get_user_pages(...) GUP_MACRO(__VA_ARGS__,	\
+		get_user_pages8, x,			\
+		get_user_pages6, x, x, x, x, x)(__VA_ARGS__)
+
+__gup_deprecated
+long get_user_pages_locked8(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		int *locked);
+#define GUPL_MACRO(_1, _2, _3, _4, _5, _6, _7, _8, get_user_pages_locked, ...)	\
+	get_user_pages_locked
+#define get_user_pages_locked(...) GUPL_MACRO(__VA_ARGS__,	\
+		get_user_pages_locked8,	x,			\
+		get_user_pages_locked6, x, x, x, x)(__VA_ARGS__)
+
+__gup_deprecated
+long get_user_pages_unlocked7(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages);
+#define GUPU_MACRO(_1, _2, _3, _4, _5, _6, _7, get_user_pages_unlocked, ...)	\
+	get_user_pages_unlocked
+#define get_user_pages_unlocked(...) GUPU_MACRO(__VA_ARGS__,	\
+		get_user_pages_unlocked7, x,			\
+		get_user_pages_unlocked5, x, x, x, x)(__VA_ARGS__)
+
 /* Container for pinned pfns / pages */
 struct frame_vector {
 	unsigned int nr_allocated;	/* Number of frames we have space for */
diff -puN mm/gup.c~gup-arg-helpers mm/gup.c
--- a/mm/gup.c~gup-arg-helpers	2016-02-12 10:44:13.847137609 -0800
+++ b/mm/gup.c	2016-02-12 10:44:13.855137974 -0800
@@ -1,3 +1,4 @@
+#define __DISABLE_GUP_DEPRECATED 1
 #include <linux/kernel.h>
 #include <linux/errno.h>
 #include <linux/err.h>
@@ -807,15 +808,15 @@ static __always_inline long __get_user_p
  *      if (locked)
  *          up_read(&mm->mmap_sem);
  */
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-			   unsigned long start, unsigned long nr_pages,
+long get_user_pages_locked6(unsigned long start, unsigned long nr_pages,
 			   int write, int force, struct page **pages,
 			   int *locked)
 {
-	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
-				       pages, NULL, locked, true, FOLL_TOUCH);
+	return __get_user_pages_locked(current, current->mm, start, nr_pages,
+				       write, force, pages, NULL, locked, true,
+				       FOLL_TOUCH);
 }
-EXPORT_SYMBOL(get_user_pages_locked);
+EXPORT_SYMBOL(get_user_pages_locked6);
 
 /*
  * Same as get_user_pages_unlocked(...., FOLL_TOUCH) but it allows to
@@ -860,14 +861,13 @@ EXPORT_SYMBOL(__get_user_pages_unlocked)
  * or if "force" shall be set to 1 (get_user_pages_fast misses the
  * "force" parameter).
  */
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			     unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked5(unsigned long start, unsigned long nr_pages,
 			     int write, int force, struct page **pages)
 {
-	return __get_user_pages_unlocked(tsk, mm, start, nr_pages, write,
-					 force, pages, FOLL_TOUCH);
+	return __get_user_pages_unlocked(current, current->mm, start, nr_pages,
+					 write, force, pages, FOLL_TOUCH);
 }
-EXPORT_SYMBOL(get_user_pages_unlocked);
+EXPORT_SYMBOL(get_user_pages_unlocked5);
 
 /*
  * get_user_pages_remote() - pin user pages in memory
@@ -939,16 +939,15 @@ EXPORT_SYMBOL(get_user_pages_remote);
  * This is the same as get_user_pages_remote() for the time
  * being.
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, unsigned long nr_pages,
+long get_user_pages6(unsigned long start, unsigned long nr_pages,
 		int write, int force, struct page **pages,
 		struct vm_area_struct **vmas)
 {
-	return __get_user_pages_locked(tsk, mm, start, nr_pages,
+	return __get_user_pages_locked(current, current->mm, start, nr_pages,
 				       write, force, pages, vmas, NULL, false,
 				       FOLL_TOUCH);
 }
-EXPORT_SYMBOL(get_user_pages);
+EXPORT_SYMBOL(get_user_pages6);
 
 /**
  * populate_vma_page_range() -  populate a range of pages in the vma.
@@ -1484,3 +1483,38 @@ int get_user_pages_fast(unsigned long st
 }
 
 #endif /* CONFIG_HAVE_GENERIC_RCU_GUP */
+
+long get_user_pages8(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long start, unsigned long nr_pages,
+		     int write, int force, struct page **pages,
+		     struct vm_area_struct **vmas)
+{
+	WARN_ONCE(tsk != current, "get_user_pages() called on remote task");
+	WARN_ONCE(mm != current->mm, "get_user_pages() called on remote mm");
+
+	return get_user_pages6(start, nr_pages, write, force, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages8);
+
+long get_user_pages_locked8(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages, int *locked)
+{
+	WARN_ONCE(tsk != current, "get_user_pages_locked() called on remote task");
+	WARN_ONCE(mm != current->mm, "get_user_pages_locked() called on remote mm");
+
+	return get_user_pages_locked6(start, nr_pages, write, force, pages, locked);
+}
+EXPORT_SYMBOL(get_user_pages_locked8);
+
+long get_user_pages_unlocked7(struct task_struct *tsk, struct mm_struct *mm,
+				  unsigned long start, unsigned long nr_pages,
+				  int write, int force, struct page **pages)
+{
+	WARN_ONCE(tsk != current, "get_user_pages_unlocked() called on remote task");
+	WARN_ONCE(mm != current->mm, "get_user_pages_unlocked() called on remote mm");
+
+	return get_user_pages_unlocked5(start, nr_pages, write, force, pages);
+}
+EXPORT_SYMBOL(get_user_pages_unlocked7);
+
diff -puN mm/nommu.c~gup-arg-helpers mm/nommu.c
--- a/mm/nommu.c~gup-arg-helpers	2016-02-12 10:44:13.848137654 -0800
+++ b/mm/nommu.c	2016-02-12 10:44:13.856138020 -0800
@@ -15,6 +15,8 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#define __DISABLE_GUP_DEPRECATED
+
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -182,8 +184,7 @@ finish_or_fault:
  *   slab page or a secondary page from a compound page
  * - don't permit access to VMAs that don't support it, such as I/O mappings
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
+long get_user_pages6(unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages,
 		    struct vm_area_struct **vmas)
 {
@@ -194,20 +195,18 @@ long get_user_pages(struct task_struct *
 	if (force)
 		flags |= FOLL_FORCE;
 
-	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
-				NULL);
+	return __get_user_pages(current, current->mm, start, nr_pages, flags,
+				pages, vmas, NULL);
 }
-EXPORT_SYMBOL(get_user_pages);
+EXPORT_SYMBOL(get_user_pages6);
 
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-			   unsigned long start, unsigned long nr_pages,
-			   int write, int force, struct page **pages,
-			   int *locked)
+long get_user_pages_locked6(unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    int *locked)
 {
-	return get_user_pages(tsk, mm, start, nr_pages, write, force,
-			      pages, NULL);
+	return get_user_pages6(start, nr_pages, write, force, pages, NULL);
 }
-EXPORT_SYMBOL(get_user_pages_locked);
+EXPORT_SYMBOL(get_user_pages_locked6);
 
 long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
 			       unsigned long start, unsigned long nr_pages,
@@ -216,21 +215,20 @@ long __get_user_pages_unlocked(struct ta
 {
 	long ret;
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
-			     pages, NULL);
+	ret = __get_user_pages(tsk, mm, start, nr_pages, gup_flags, pages,
+				NULL, NULL);
 	up_read(&mm->mmap_sem);
 	return ret;
 }
 EXPORT_SYMBOL(__get_user_pages_unlocked);
 
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			     unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked5(unsigned long start, unsigned long nr_pages,
 			     int write, int force, struct page **pages)
 {
-	return __get_user_pages_unlocked(tsk, mm, start, nr_pages, write,
-					 force, pages, 0);
+	return __get_user_pages_unlocked(current, current->mm, start, nr_pages,
+					 write, force, pages, 0);
 }
-EXPORT_SYMBOL(get_user_pages_unlocked);
+EXPORT_SYMBOL(get_user_pages_unlocked5);
 
 /**
  * follow_pfn - look up PFN at a user virtual address
@@ -2108,3 +2106,31 @@ static int __meminit init_admin_reserve(
 	return 0;
 }
 subsys_initcall(init_admin_reserve);
+
+long get_user_pages8(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long start, unsigned long nr_pages,
+		     int write, int force, struct page **pages,
+		     struct vm_area_struct **vmas)
+{
+	return get_user_pages6(start, nr_pages, write, force, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages8);
+
+long get_user_pages_locked8(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    int *locked)
+{
+	return get_user_pages_locked6(start, nr_pages, write,
+				      force, pages, locked);
+}
+EXPORT_SYMBOL(get_user_pages_locked8);
+
+long get_user_pages_unlocked7(struct task_struct *tsk, struct mm_struct *mm,
+			      unsigned long start, unsigned long nr_pages,
+			      int write, int force, struct page **pages)
+{
+	return get_user_pages_unlocked5(start, nr_pages, write, force, pages);
+}
+EXPORT_SYMBOL(get_user_pages_unlocked7);
+
diff -puN mm/util.c~gup-arg-helpers mm/util.c
--- a/mm/util.c~gup-arg-helpers	2016-02-12 10:44:13.850137746 -0800
+++ b/mm/util.c	2016-02-12 10:44:13.856138020 -0800
@@ -283,9 +283,7 @@ EXPORT_SYMBOL_GPL(__get_user_pages_fast)
 int __weak get_user_pages_fast(unsigned long start,
 				int nr_pages, int write, struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
-	return get_user_pages_unlocked(current, mm, start, nr_pages,
-				       write, 0, pages);
+	return get_user_pages_unlocked(start, nr_pages, write, 0, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 03/33] mm, gup: switch callers of get_user_pages() to not pass tsk/mm
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
  2016-02-12 21:01 ` [PATCH 01/33] mm: introduce get_user_pages_remote() Dave Hansen
  2016-02-12 21:01 ` [PATCH 02/33] mm: overload get_user_pages() functions Dave Hansen
@ 2016-02-12 21:01 ` Dave Hansen
  2016-02-18 20:16   ` [tip:mm/pkeys] mm/gup: Switch all " tip-bot for Dave Hansen
  2016-02-12 21:01 ` [PATCH 04/33] x86, fpu: add placeholder for Processor Trace XSAVE state Dave Hansen
                   ` (30 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen, srikar,
	vbabka, akpm, kirill.shutemov, aarcange, n-horiguchi, jack


From: Dave Hansen <dave.hansen@linux.intel.com>

We will soon modify the vanilla get_user_pages() so it can no
longer be used on mm/tasks other than 'current/current->mm',
which is by far the most common way it is called.  For now,
we allow the old-style calls, but warn when they are used.
(implemented in previous patch)

This patch switches all callers of:

	get_user_pages()
	get_user_pages_unlocked()
	get_user_pages_locked()

to stop passing tsk/mm so they will no longer see the warnings.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: jack@suse.cz
---

 b/arch/cris/arch-v32/drivers/cryptocop.c      |    8 ++------
 b/arch/ia64/kernel/err_inject.c               |    3 +--
 b/arch/mips/mm/gup.c                          |    3 +--
 b/arch/s390/mm/gup.c                          |    4 +---
 b/arch/sh/mm/gup.c                            |    2 +-
 b/arch/sparc/mm/gup.c                         |    2 +-
 b/arch/x86/mm/gup.c                           |    2 +-
 b/arch/x86/mm/mpx.c                           |    4 ++--
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     |    3 +--
 b/drivers/gpu/drm/radeon/radeon_ttm.c         |    3 +--
 b/drivers/gpu/drm/via/via_dmablit.c           |    3 +--
 b/drivers/infiniband/core/umem.c              |    2 +-
 b/drivers/infiniband/hw/mthca/mthca_memfree.c |    3 +--
 b/drivers/infiniband/hw/qib/qib_user_pages.c  |    3 +--
 b/drivers/infiniband/hw/usnic/usnic_uiom.c    |    2 +-
 b/drivers/media/pci/ivtv/ivtv-udma.c          |    4 ++--
 b/drivers/media/pci/ivtv/ivtv-yuv.c           |   10 ++++------
 b/drivers/media/v4l2-core/videobuf-dma-sg.c   |    3 +--
 b/drivers/misc/mic/scif/scif_rma.c            |    2 --
 b/drivers/misc/sgi-gru/grufault.c             |    3 +--
 b/drivers/scsi/st.c                           |    2 --
 b/drivers/video/fbdev/pvr2fb.c                |    4 ++--
 b/drivers/virt/fsl_hypervisor.c               |    5 ++---
 b/mm/frame_vector.c                           |    2 +-
 b/mm/gup.c                                    |    6 ++++--
 b/mm/ksm.c                                    |    2 +-
 b/mm/mempolicy.c                              |    6 +++---
 b/net/ceph/pagevec.c                          |    2 +-
 b/virt/kvm/kvm_main.c                         |   10 +++++-----
 29 files changed, 44 insertions(+), 64 deletions(-)

diff -puN arch/cris/arch-v32/drivers/cryptocop.c~get_current_user_pages arch/cris/arch-v32/drivers/cryptocop.c
--- a/arch/cris/arch-v32/drivers/cryptocop.c~get_current_user_pages	2016-02-12 10:44:14.365161289 -0800
+++ b/arch/cris/arch-v32/drivers/cryptocop.c	2016-02-12 10:44:14.415163575 -0800
@@ -2719,9 +2719,7 @@ static int cryptocop_ioctl_process(struc
 	/* Acquire the mm page semaphore. */
 	down_read(&current->mm->mmap_sem);
 
-	err = get_user_pages(current,
-			     current->mm,
-			     (unsigned long int)(oper.indata + prev_ix),
+	err = get_user_pages((unsigned long int)(oper.indata + prev_ix),
 			     noinpages,
 			     0,  /* read access only for in data */
 			     0, /* no force */
@@ -2736,9 +2734,7 @@ static int cryptocop_ioctl_process(struc
 	}
 	noinpages = err;
 	if (oper.do_cipher){
-		err = get_user_pages(current,
-				     current->mm,
-				     (unsigned long int)oper.cipher_outdata,
+		err = get_user_pages((unsigned long int)oper.cipher_outdata,
 				     nooutpages,
 				     1, /* write access for out data */
 				     0, /* no force */
diff -puN arch/ia64/kernel/err_inject.c~get_current_user_pages arch/ia64/kernel/err_inject.c
--- a/arch/ia64/kernel/err_inject.c~get_current_user_pages	2016-02-12 10:44:14.367161380 -0800
+++ b/arch/ia64/kernel/err_inject.c	2016-02-12 10:44:14.416163620 -0800
@@ -142,8 +142,7 @@ store_virtual_to_phys(struct device *dev
 	u64 virt_addr=simple_strtoull(buf, NULL, 16);
 	int ret;
 
-        ret = get_user_pages(current, current->mm, virt_addr,
-                        1, VM_READ, 0, NULL, NULL);
+	ret = get_user_pages(virt_addr, 1, VM_READ, 0, NULL, NULL);
 	if (ret<=0) {
 #ifdef ERR_INJ_DEBUG
 		printk("Virtual address %lx is not existing.\n",virt_addr);
diff -puN arch/mips/mm/gup.c~get_current_user_pages arch/mips/mm/gup.c
--- a/arch/mips/mm/gup.c~get_current_user_pages	2016-02-12 10:44:14.368161426 -0800
+++ b/arch/mips/mm/gup.c	2016-02-12 10:44:14.416163620 -0800
@@ -286,8 +286,7 @@ slow_irqon:
 	start += nr << PAGE_SHIFT;
 	pages += nr;
 
-	ret = get_user_pages_unlocked(current, mm, start,
-				      (end - start) >> PAGE_SHIFT,
+	ret = get_user_pages_unlocked(start, (end - start) >> PAGE_SHIFT,
 				      write, 0, pages);
 
 	/* Have to be a bit careful with return values */
diff -puN arch/s390/mm/gup.c~get_current_user_pages arch/s390/mm/gup.c
--- a/arch/s390/mm/gup.c~get_current_user_pages	2016-02-12 10:44:14.370161517 -0800
+++ b/arch/s390/mm/gup.c	2016-02-12 10:44:14.416163620 -0800
@@ -210,7 +210,6 @@ int __get_user_pages_fast(unsigned long
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
 	int nr, ret;
 
 	might_sleep();
@@ -222,8 +221,7 @@ int get_user_pages_fast(unsigned long st
 	/* Try to get the remaining pages with get_user_pages */
 	start += nr << PAGE_SHIFT;
 	pages += nr;
-	ret = get_user_pages_unlocked(current, mm, start,
-			     nr_pages - nr, write, 0, pages);
+	ret = get_user_pages_unlocked(start, nr_pages - nr, write, 0, pages);
 	/* Have to be a bit careful with return values */
 	if (nr > 0)
 		ret = (ret < 0) ? nr : ret + nr;
diff -puN arch/sh/mm/gup.c~get_current_user_pages arch/sh/mm/gup.c
--- a/arch/sh/mm/gup.c~get_current_user_pages	2016-02-12 10:44:14.371161563 -0800
+++ b/arch/sh/mm/gup.c	2016-02-12 10:44:14.417163666 -0800
@@ -257,7 +257,7 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
diff -puN arch/sparc/mm/gup.c~get_current_user_pages arch/sparc/mm/gup.c
--- a/arch/sparc/mm/gup.c~get_current_user_pages	2016-02-12 10:44:14.373161655 -0800
+++ b/arch/sparc/mm/gup.c	2016-02-12 10:44:14.417163666 -0800
@@ -237,7 +237,7 @@ slow:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
diff -puN arch/x86/mm/gup.c~get_current_user_pages arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~get_current_user_pages	2016-02-12 10:44:14.375161746 -0800
+++ b/arch/x86/mm/gup.c	2016-02-12 10:44:14.417163666 -0800
@@ -422,7 +422,7 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 					      (end - start) >> PAGE_SHIFT,
 					      write, 0, pages);
 
diff -puN arch/x86/mm/mpx.c~get_current_user_pages arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~get_current_user_pages	2016-02-12 10:44:14.376161792 -0800
+++ b/arch/x86/mm/mpx.c	2016-02-12 10:44:14.418163712 -0800
@@ -546,8 +546,8 @@ static int mpx_resolve_fault(long __user
 	int nr_pages = 1;
 	int force = 0;
 
-	gup_ret = get_user_pages(current, current->mm, (unsigned long)addr,
-				 nr_pages, write, force, NULL, NULL);
+	gup_ret = get_user_pages((unsigned long)addr, nr_pages, write,
+			force, NULL, NULL);
 	/*
 	 * get_user_pages() returns number of pages gotten.
 	 * 0 means we failed to fault in and get anything,
diff -puN drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c~get_current_user_pages drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c~get_current_user_pages	2016-02-12 10:44:14.378161883 -0800
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c	2016-02-12 10:44:14.418163712 -0800
@@ -518,8 +518,7 @@ static int amdgpu_ttm_tt_pin_userptr(str
 		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
 		struct page **pages = ttm->pages + pinned;
 
-		r = get_user_pages(current, current->mm, userptr, num_pages,
-				   write, 0, pages, NULL);
+		r = get_user_pages(userptr, num_pages, write, 0, pages, NULL);
 		if (r < 0)
 			goto release_pages;
 
diff -puN drivers/gpu/drm/radeon/radeon_ttm.c~get_current_user_pages drivers/gpu/drm/radeon/radeon_ttm.c
--- a/drivers/gpu/drm/radeon/radeon_ttm.c~get_current_user_pages	2016-02-12 10:44:14.380161975 -0800
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c	2016-02-12 10:44:14.419163758 -0800
@@ -554,8 +554,7 @@ static int radeon_ttm_tt_pin_userptr(str
 		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
 		struct page **pages = ttm->pages + pinned;
 
-		r = get_user_pages(current, current->mm, userptr, num_pages,
-				   write, 0, pages, NULL);
+		r = get_user_pages(userptr, num_pages, write, 0, pages, NULL);
 		if (r < 0)
 			goto release_pages;
 
diff -puN drivers/gpu/drm/via/via_dmablit.c~get_current_user_pages drivers/gpu/drm/via/via_dmablit.c
--- a/drivers/gpu/drm/via/via_dmablit.c~get_current_user_pages	2016-02-12 10:44:14.381162020 -0800
+++ b/drivers/gpu/drm/via/via_dmablit.c	2016-02-12 10:44:14.419163758 -0800
@@ -239,8 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t
 	if (NULL == vsg->pages)
 		return -ENOMEM;
 	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(current, current->mm,
-			     (unsigned long)xfer->mem_addr,
+	ret = get_user_pages((unsigned long)xfer->mem_addr,
 			     vsg->num_pages,
 			     (vsg->direction == DMA_FROM_DEVICE),
 			     0, vsg->pages, NULL);
diff -puN drivers/infiniband/core/umem.c~get_current_user_pages drivers/infiniband/core/umem.c
--- a/drivers/infiniband/core/umem.c~get_current_user_pages	2016-02-12 10:44:14.383162112 -0800
+++ b/drivers/infiniband/core/umem.c	2016-02-12 10:44:14.420163803 -0800
@@ -188,7 +188,7 @@ struct ib_umem *ib_umem_get(struct ib_uc
 	sg_list_start = umem->sg_head.sgl;
 
 	while (npages) {
-		ret = get_user_pages(current, current->mm, cur_base,
+		ret = get_user_pages(cur_base,
 				     min_t(unsigned long, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
 				     1, !umem->writable, page_list, vma_list);
diff -puN drivers/infiniband/hw/mthca/mthca_memfree.c~get_current_user_pages drivers/infiniband/hw/mthca/mthca_memfree.c
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c~get_current_user_pages	2016-02-12 10:44:14.385162203 -0800
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c	2016-02-12 10:44:14.420163803 -0800
@@ -472,8 +472,7 @@ int mthca_map_user_db(struct mthca_dev *
 		goto out;
 	}
 
-	ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0,
-			     pages, NULL);
+	ret = get_user_pages(uaddr & PAGE_MASK, 1, 1, 0, pages, NULL);
 	if (ret < 0)
 		goto out;
 
diff -puN drivers/infiniband/hw/qib/qib_user_pages.c~get_current_user_pages drivers/infiniband/hw/qib/qib_user_pages.c
--- a/drivers/infiniband/hw/qib/qib_user_pages.c~get_current_user_pages	2016-02-12 10:44:14.386162249 -0800
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c	2016-02-12 10:44:14.420163803 -0800
@@ -66,8 +66,7 @@ static int __qib_get_user_pages(unsigned
 	}
 
 	for (got = 0; got < num_pages; got += ret) {
-		ret = get_user_pages(current, current->mm,
-				     start_page + got * PAGE_SIZE,
+		ret = get_user_pages(start_page + got * PAGE_SIZE,
 				     num_pages - got, 1, 1,
 				     p + got, NULL);
 		if (ret < 0)
diff -puN drivers/infiniband/hw/usnic/usnic_uiom.c~get_current_user_pages drivers/infiniband/hw/usnic/usnic_uiom.c
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c~get_current_user_pages	2016-02-12 10:44:14.388162340 -0800
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c	2016-02-12 10:44:14.421163849 -0800
@@ -144,7 +144,7 @@ static int usnic_uiom_get_pages(unsigned
 	ret = 0;
 
 	while (npages) {
-		ret = get_user_pages(current, current->mm, cur_base,
+		ret = get_user_pages(cur_base,
 					min_t(unsigned long, npages,
 					PAGE_SIZE / sizeof(struct page *)),
 					1, !writable, page_list, NULL);
diff -puN drivers/media/pci/ivtv/ivtv-udma.c~get_current_user_pages drivers/media/pci/ivtv/ivtv-udma.c
--- a/drivers/media/pci/ivtv/ivtv-udma.c~get_current_user_pages	2016-02-12 10:44:14.389162386 -0800
+++ b/drivers/media/pci/ivtv/ivtv-udma.c	2016-02-12 10:44:14.421163849 -0800
@@ -124,8 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, un
 	}
 
 	/* Get user pages for DMA Xfer */
-	err = get_user_pages_unlocked(current, current->mm,
-			user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
+	err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, 0,
+			1, dma->map);
 
 	if (user_dma.page_count != err) {
 		IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
diff -puN drivers/media/pci/ivtv/ivtv-yuv.c~get_current_user_pages drivers/media/pci/ivtv/ivtv-yuv.c
--- a/drivers/media/pci/ivtv/ivtv-yuv.c~get_current_user_pages	2016-02-12 10:44:14.391162477 -0800
+++ b/drivers/media/pci/ivtv/ivtv-yuv.c	2016-02-12 10:44:14.422163895 -0800
@@ -75,14 +75,12 @@ static int ivtv_yuv_prep_user_dma(struct
 	ivtv_udma_get_page_info (&uv_dma, (unsigned long)args->uv_source, 360 * uv_decode_height);
 
 	/* Get user pages for DMA Xfer */
-	y_pages = get_user_pages_unlocked(current, current->mm,
-				y_dma.uaddr, y_dma.page_count, 0, 1,
-				&dma->map[0]);
+	y_pages = get_user_pages_unlocked(y_dma.uaddr,
+			y_dma.page_count, 0, 1, &dma->map[0]);
 	uv_pages = 0; /* silence gcc. value is set and consumed only if: */
 	if (y_pages == y_dma.page_count) {
-		uv_pages = get_user_pages_unlocked(current, current->mm,
-					uv_dma.uaddr, uv_dma.page_count, 0, 1,
-					&dma->map[y_pages]);
+		uv_pages = get_user_pages_unlocked(uv_dma.uaddr,
+				uv_dma.page_count, 0, 1, &dma->map[y_pages]);
 	}
 
 	if (y_pages != y_dma.page_count || uv_pages != uv_dma.page_count) {
diff -puN drivers/media/v4l2-core/videobuf-dma-sg.c~get_current_user_pages drivers/media/v4l2-core/videobuf-dma-sg.c
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~get_current_user_pages	2016-02-12 10:44:14.393162569 -0800
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c	2016-02-12 10:44:14.422163895 -0800
@@ -181,8 +181,7 @@ static int videobuf_dma_init_user_locked
 	dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
 		data, size, dma->nr_pages);
 
-	err = get_user_pages(current, current->mm,
-			     data & PAGE_MASK, dma->nr_pages,
+	err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
 			     rw == READ, 1, /* force */
 			     dma->pages, NULL);
 
diff -puN drivers/misc/mic/scif/scif_rma.c~get_current_user_pages drivers/misc/mic/scif/scif_rma.c
--- a/drivers/misc/mic/scif/scif_rma.c~get_current_user_pages	2016-02-12 10:44:14.395162660 -0800
+++ b/drivers/misc/mic/scif/scif_rma.c	2016-02-12 10:44:14.423163940 -0800
@@ -1394,8 +1394,6 @@ retry:
 		}
 
 		pinned_pages->nr_pages = get_user_pages(
-				current,
-				mm,
 				(u64)addr,
 				nr_pages,
 				!!(prot & SCIF_PROT_WRITE),
diff -puN drivers/misc/sgi-gru/grufault.c~get_current_user_pages drivers/misc/sgi-gru/grufault.c
--- a/drivers/misc/sgi-gru/grufault.c~get_current_user_pages	2016-02-12 10:44:14.396162706 -0800
+++ b/drivers/misc/sgi-gru/grufault.c	2016-02-12 10:44:14.423163940 -0800
@@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct
 #else
 	*pageshift = PAGE_SHIFT;
 #endif
-	if (get_user_pages
-	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+	if (get_user_pages(vaddr, 1, write, 0, &page, NULL) <= 0)
 		return -EFAULT;
 	*paddr = page_to_phys(page);
 	put_page(page);
diff -puN drivers/scsi/st.c~get_current_user_pages drivers/scsi/st.c
--- a/drivers/scsi/st.c~get_current_user_pages	2016-02-12 10:44:14.398162797 -0800
+++ b/drivers/scsi/st.c	2016-02-12 10:44:14.424163986 -0800
@@ -4817,8 +4817,6 @@ static int sgl_map_user_pages(struct st_
         /* Try to fault in all of the necessary pages */
         /* rw==READ means read from drive, write into memory area */
 	res = get_user_pages_unlocked(
-		current,
-		current->mm,
 		uaddr,
 		nr_pages,
 		rw == READ,
diff -puN drivers/video/fbdev/pvr2fb.c~get_current_user_pages drivers/video/fbdev/pvr2fb.c
--- a/drivers/video/fbdev/pvr2fb.c~get_current_user_pages	2016-02-12 10:44:14.400162889 -0800
+++ b/drivers/video/fbdev/pvr2fb.c	2016-02-12 10:44:14.425164032 -0800
@@ -686,8 +686,8 @@ static ssize_t pvr2fb_write(struct fb_in
 	if (!pages)
 		return -ENOMEM;
 
-	ret = get_user_pages_unlocked(current, current->mm, (unsigned long)buf,
-				      nr_pages, WRITE, 0, pages);
+	ret = get_user_pages_unlocked((unsigned long)buf, nr_pages, WRITE,
+			0, pages);
 
 	if (ret < nr_pages) {
 		nr_pages = ret;
diff -puN drivers/virt/fsl_hypervisor.c~get_current_user_pages drivers/virt/fsl_hypervisor.c
--- a/drivers/virt/fsl_hypervisor.c~get_current_user_pages	2016-02-12 10:44:14.401162935 -0800
+++ b/drivers/virt/fsl_hypervisor.c	2016-02-12 10:44:14.425164032 -0800
@@ -244,9 +244,8 @@ static long ioctl_memcpy(struct fsl_hv_i
 
 	/* Get the physical addresses of the source buffer */
 	down_read(&current->mm->mmap_sem);
-	num_pinned = get_user_pages(current, current->mm,
-		param.local_vaddr - lb_offset, num_pages,
-		(param.source == -1) ? READ : WRITE,
+	num_pinned = get_user_pages(param.local_vaddr - lb_offset,
+		num_pages, (param.source == -1) ? READ : WRITE,
 		0, pages, NULL);
 	up_read(&current->mm->mmap_sem);
 
diff -puN mm/frame_vector.c~get_current_user_pages mm/frame_vector.c
--- a/mm/frame_vector.c~get_current_user_pages	2016-02-12 10:44:14.403163026 -0800
+++ b/mm/frame_vector.c	2016-02-12 10:44:14.426164078 -0800
@@ -58,7 +58,7 @@ int get_vaddr_frames(unsigned long start
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
 		vec->got_ref = true;
 		vec->is_pfns = false;
-		ret = get_user_pages_locked(current, mm, start, nr_frames,
+		ret = get_user_pages_locked(start, nr_frames,
 			write, force, (struct page **)(vec->ptrs), &locked);
 		goto out;
 	}
diff -puN mm/gup.c~get_current_user_pages mm/gup.c
--- a/mm/gup.c~get_current_user_pages	2016-02-12 10:44:14.405163117 -0800
+++ b/mm/gup.c	2016-02-12 10:44:14.426164078 -0800
@@ -936,8 +936,10 @@ long get_user_pages_remote(struct task_s
 EXPORT_SYMBOL(get_user_pages_remote);
 
 /*
- * This is the same as get_user_pages_remote() for the time
- * being.
+ * This is the same as get_user_pages_remote(), just with a
+ * less-flexible calling convention where we assume that the task
+ * and mm being operated on are the current task's.  We also
+ * obviously don't pass FOLL_REMOTE in here.
  */
 long get_user_pages6(unsigned long start, unsigned long nr_pages,
 		int write, int force, struct page **pages,
diff -puN mm/ksm.c~get_current_user_pages mm/ksm.c
--- a/mm/ksm.c~get_current_user_pages	2016-02-12 10:44:14.406163163 -0800
+++ b/mm/ksm.c	2016-02-12 10:44:14.427164123 -0800
@@ -352,7 +352,7 @@ static inline bool ksm_test_exit(struct
 /*
  * We use break_ksm to break COW on a ksm page: it's a stripped down
  *
- *	if (get_user_pages(current, mm, addr, 1, 1, 1, &page, NULL) == 1)
+ *	if (get_user_pages(addr, 1, 1, 1, &page, NULL) == 1)
  *		put_page(page);
  *
  * but taking great care only to touch a ksm page, in a VM_MERGEABLE vma,
diff -puN mm/mempolicy.c~get_current_user_pages mm/mempolicy.c
--- a/mm/mempolicy.c~get_current_user_pages	2016-02-12 10:44:14.408163255 -0800
+++ b/mm/mempolicy.c	2016-02-12 10:44:14.428164169 -0800
@@ -844,12 +844,12 @@ static void get_policy_nodemask(struct m
 	}
 }
 
-static int lookup_node(struct mm_struct *mm, unsigned long addr)
+static int lookup_node(unsigned long addr)
 {
 	struct page *p;
 	int err;
 
-	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	err = get_user_pages(addr & PAGE_MASK, 1, 0, 0, &p, NULL);
 	if (err >= 0) {
 		err = page_to_nid(p);
 		put_page(p);
@@ -904,7 +904,7 @@ static long do_get_mempolicy(int *policy
 
 	if (flags & MPOL_F_NODE) {
 		if (flags & MPOL_F_ADDR) {
-			err = lookup_node(mm, addr);
+			err = lookup_node(addr);
 			if (err < 0)
 				goto out;
 			*policy = err;
diff -puN net/ceph/pagevec.c~get_current_user_pages net/ceph/pagevec.c
--- a/net/ceph/pagevec.c~get_current_user_pages	2016-02-12 10:44:14.410163346 -0800
+++ b/net/ceph/pagevec.c	2016-02-12 10:44:14.428164169 -0800
@@ -24,7 +24,7 @@ struct page **ceph_get_direct_page_vecto
 		return ERR_PTR(-ENOMEM);
 
 	while (got < num_pages) {
-		rc = get_user_pages_unlocked(current, current->mm,
+		rc = get_user_pages_unlocked(
 		    (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
 		    num_pages - got, write_page, 0, pages + got);
 		if (rc < 0)
diff -puN virt/kvm/kvm_main.c~get_current_user_pages virt/kvm/kvm_main.c
--- a/virt/kvm/kvm_main.c~get_current_user_pages	2016-02-12 10:44:14.411163392 -0800
+++ b/virt/kvm/kvm_main.c	2016-02-12 10:44:14.429164215 -0800
@@ -1264,15 +1264,16 @@ unsigned long kvm_vcpu_gfn_to_hva_prot(s
 	return gfn_to_hva_memslot_prot(slot, gfn, writable);
 }
 
-static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
-	unsigned long start, int write, struct page **page)
+static int get_user_page_nowait(unsigned long start, int write,
+		struct page **page)
 {
 	int flags = FOLL_TOUCH | FOLL_NOWAIT | FOLL_HWPOISON | FOLL_GET;
 
 	if (write)
 		flags |= FOLL_WRITE;
 
-	return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
+	return __get_user_pages(current, current->mm, start, 1, flags, page,
+			NULL, NULL);
 }
 
 static inline int check_user_page_hwpoison(unsigned long addr)
@@ -1334,8 +1335,7 @@ static int hva_to_pfn_slow(unsigned long
 
 	if (async) {
 		down_read(&current->mm->mmap_sem);
-		npages = get_user_page_nowait(current, current->mm,
-					      addr, write_fault, page);
+		npages = get_user_page_nowait(addr, write_fault, page);
 		up_read(&current->mm->mmap_sem);
 	} else
 		npages = __get_user_pages_unlocked(current, current->mm, addr, 1,
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 04/33] x86, fpu: add placeholder for Processor Trace XSAVE state
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (2 preceding siblings ...)
  2016-02-12 21:01 ` [PATCH 03/33] mm, gup: switch callers of get_user_pages() to not pass tsk/mm Dave Hansen
@ 2016-02-12 21:01 ` Dave Hansen
  2016-02-18 20:16   ` [tip:mm/pkeys] x86/fpu: Add placeholder for 'Processor Trace' " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 05/33] x86, pkeys: Add Kconfig option Dave Hansen
                   ` (29 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen, ak,
	yu-cheng.yu, fenghua.yu


From: Dave Hansen <dave.hansen@linux.intel.com>

There is an XSAVE state component for Intel Processor Trace (PT).
But, we do not currently use it.

We add a placeholder in the code for it so it is not a mystery and
also so we do not need an explicit enum initialization for Protection
Keys in a moment.

Why don't we use it?

We might end up using this at _some_ point in the future.  But,
this is a "system" state which requires using the currently
unsupported XSAVES feature.  Unlike all the other XSAVE states,
PT state is also not directly tied to a thread.  You might
context-switch between threads, but not want to change any of the
PT state.  Or, you might switch between threads, and *do* want to
change PT state, all depending on what is being traced.

We currently just manually set some MSRs to do this PT context
switching, and it is unclear whether replacing our direct MSR use
with XSAVE will be a net win or loss, both in code complexity and
performance.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: yu-cheng.yu@intel.com
Cc: fenghua.yu@intel.com
---

 b/arch/x86/include/asm/fpu/types.h |    1 +
 b/arch/x86/kernel/fpu/xstate.c     |   10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pt-xstate-bit arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pt-xstate-bit	2016-02-12 10:44:15.475212032 -0800
+++ b/arch/x86/include/asm/fpu/types.h	2016-02-12 10:44:15.479212215 -0800
@@ -108,6 +108,7 @@ enum xfeature {
 	XFEATURE_OPMASK,
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
+	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 
 	XFEATURE_MAX,
 };
diff -puN arch/x86/kernel/fpu/xstate.c~pt-xstate-bit arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pt-xstate-bit	2016-02-12 10:44:15.476212078 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2016-02-12 10:44:15.480212261 -0800
@@ -13,6 +13,11 @@
 
 #include <asm/tlbflush.h>
 
+/*
+ * Although we spell it out in here, the Processor Trace
+ * xfeature is completely unused.  We use other mechanisms
+ * to save/restore PT state in Linux.
+ */
 static const char *xfeature_names[] =
 {
 	"x87 floating point registers"	,
@@ -23,7 +28,7 @@ static const char *xfeature_names[] =
 	"AVX-512 opmask"		,
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
-	"unknown xstate feature"	,
+	"Processor Trace (unused)"	,
 };
 
 /*
@@ -470,7 +475,8 @@ static void check_xstate_against_struct(
 	 * numbers.
 	 */
 	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX)) {
+	    (nr >= XFEATURE_MAX) ||
+	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR)) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 05/33] x86, pkeys: Add Kconfig option
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (3 preceding siblings ...)
  2016-02-12 21:01 ` [PATCH 04/33] x86, fpu: add placeholder for Processor Trace XSAVE state Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:16   ` [tip:mm/pkeys] x86/mm/pkeys: " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 06/33] x86, pkeys: cpuid bit definition Dave Hansen
                   ` (28 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I don't have a strong opinion on whether we need a Kconfig prompt
or not.  Protection Keys has relatively little code associated
with it, and it is not a heavyweight feature to keep enabled.
However, I can imagine that folks would still appreciate being
able to disable it.

Note that, with disabled-features.h, the checks in the code
for protection keys are always the same:

	cpu_has(c, X86_FEATURE_PKU)

With the config option disabled, this essentially turns into an
#ifdef.

We will hide the prompt for now.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/Kconfig |    4 ++++
 1 file changed, 4 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-01-kconfig arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-01-kconfig	2016-02-12 10:44:15.909231872 -0800
+++ b/arch/x86/Kconfig	2016-02-12 10:44:15.913232055 -0800
@@ -1713,6 +1713,10 @@ config X86_INTEL_MPX
 
 	  If unsure, say N.
 
+config X86_INTEL_MEMORY_PROTECTION_KEYS
+	def_bool y
+	depends on CPU_SUP_INTEL && X86_64
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 06/33] x86, pkeys: cpuid bit definition
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (4 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 05/33] x86, pkeys: Add Kconfig option Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:17   ` [tip:mm/pkeys] x86/cpufeature, x86/mm/pkeys: Add protection keys related CPUID definitions tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 07/33] x86, pkeys: define new CR4 bit Dave Hansen
                   ` (27 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There are two CPUID bits for protection keys.  One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys.  Specifically:

	Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
	Protection keys (and the RDPKRU/WRPKRU instructions)

This is because userspace can not see CR4 contents, but it can
see CPUID contents.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:

	CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]

X86_FEATURE_OSPKE is "OSPKU":

	CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]

These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.

Add it to the disabled-features mask when its config option is
off.  Even though we are not using it here, we also extend the
REQUIRED_MASK_BIT_SET() macro to keep it mirroring the
DISABLED_MASK_BIT_SET() version.

This means that in almost all code, you should use:

	cpu_has(c, X86_FEATURE_PKU)

and *not* the CONFIG option.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/cpufeature.h        |   61 +++++++++++++++++++----------
 b/arch/x86/include/asm/disabled-features.h |   15 +++++++
 b/arch/x86/include/asm/required-features.h |    7 +++
 b/arch/x86/kernel/cpu/common.c             |    1 
 4 files changed, 63 insertions(+), 21 deletions(-)

diff -puN arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid arch/x86/include/asm/cpufeature.h
--- a/arch/x86/include/asm/cpufeature.h~pkeys-01-cpuid	2016-02-12 10:44:16.315250432 -0800
+++ b/arch/x86/include/asm/cpufeature.h	2016-02-12 10:44:16.323250798 -0800
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	16	/* N 32-bit words worth of info */
+#define NCAPINTS	17	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -269,6 +269,10 @@
 #define X86_FEATURE_PAUSEFILTER (15*32+10) /* filtered pause intercept */
 #define X86_FEATURE_PFTHRESHOLD (15*32+12) /* pause filter threshold */
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 16 */
+#define X86_FEATURE_PKU		(16*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE	(16*32+ 4) /* OS Protection Keys Enable */
+
 /*
  * BUG word(s)
  */
@@ -307,6 +311,7 @@ enum cpuid_leafs
 	CPUID_8000_0008_EBX,
 	CPUID_6_EAX,
 	CPUID_8000_000A_EDX,
+	CPUID_7_ECX,
 };
 
 #ifdef CONFIG_X86_FEATURE_NAMES
@@ -329,28 +334,42 @@ extern const char * const x86_bug_flags[
 	 test_bit(bit, (unsigned long *)((c)->x86_capability))
 
 #define REQUIRED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & REQUIRED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & REQUIRED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & REQUIRED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & REQUIRED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & REQUIRED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & REQUIRED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & REQUIRED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & REQUIRED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & REQUIRED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & REQUIRED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & REQUIRED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & REQUIRED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & REQUIRED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & REQUIRED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & REQUIRED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & REQUIRED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & REQUIRED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & REQUIRED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & REQUIRED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & REQUIRED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & REQUIRED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & REQUIRED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & REQUIRED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK13)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK14)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK15)) ||	\
+	   (((bit)>>5)==14 && (1UL<<((bit)&31) & REQUIRED_MASK16)) )
 
 #define DISABLED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & DISABLED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & DISABLED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & DISABLED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & DISABLED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & DISABLED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & DISABLED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & DISABLED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & DISABLED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & DISABLED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & DISABLED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & DISABLED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & DISABLED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & DISABLED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & DISABLED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & DISABLED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & DISABLED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & DISABLED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & DISABLED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & DISABLED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & DISABLED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & DISABLED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & DISABLED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & DISABLED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK13)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK14)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK15)) ||	\
+	   (((bit)>>5)==14 && (1UL<<((bit)&31) & DISABLED_MASK16)) )
 
 #define cpu_has(c, bit)							\
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
diff -puN arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid arch/x86/include/asm/disabled-features.h
--- a/arch/x86/include/asm/disabled-features.h~pkeys-01-cpuid	2016-02-12 10:44:16.317250524 -0800
+++ b/arch/x86/include/asm/disabled-features.h	2016-02-12 10:44:16.323250798 -0800
@@ -28,6 +28,14 @@
 # define DISABLE_CENTAUR_MCR	0
 #endif /* CONFIG_X86_64 */
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+# define DISABLE_PKU		(1<<(X86_FEATURE_PKU))
+# define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE))
+#else
+# define DISABLE_PKU		0
+# define DISABLE_OSPKE		0
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -41,5 +49,12 @@
 #define DISABLED_MASK7	0
 #define DISABLED_MASK8	0
 #define DISABLED_MASK9	(DISABLE_MPX)
+#define DISABLED_MASK10	0
+#define DISABLED_MASK11	0
+#define DISABLED_MASK12	0
+#define DISABLED_MASK13	0
+#define DISABLED_MASK14	0
+#define DISABLED_MASK15	0
+#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff -puN arch/x86/include/asm/required-features.h~pkeys-01-cpuid arch/x86/include/asm/required-features.h
--- a/arch/x86/include/asm/required-features.h~pkeys-01-cpuid	2016-02-12 10:44:16.318250570 -0800
+++ b/arch/x86/include/asm/required-features.h	2016-02-12 10:44:16.324250844 -0800
@@ -92,5 +92,12 @@
 #define REQUIRED_MASK7	0
 #define REQUIRED_MASK8	0
 #define REQUIRED_MASK9	0
+#define REQUIRED_MASK10	0
+#define REQUIRED_MASK11	0
+#define REQUIRED_MASK12	0
+#define REQUIRED_MASK13	0
+#define REQUIRED_MASK14	0
+#define REQUIRED_MASK15	0
+#define REQUIRED_MASK16	0
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff -puN arch/x86/kernel/cpu/common.c~pkeys-01-cpuid arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-01-cpuid	2016-02-12 10:44:16.320250661 -0800
+++ b/arch/x86/kernel/cpu/common.c	2016-02-12 10:44:16.324250844 -0800
@@ -611,6 +611,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		c->x86_capability[CPUID_7_0_EBX] = ebx;
 
 		c->x86_capability[CPUID_6_EAX] = cpuid_eax(0x00000006);
+		c->x86_capability[CPUID_7_ECX] = ecx;
 	}
 
 	/* Extended state features: level 0x0000000d */
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 07/33] x86, pkeys: define new CR4 bit
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (5 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 06/33] x86, pkeys: cpuid bit definition Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:17   ` [tip:mm/pkeys] x86/cpu, x86/mm/pkeys: Define " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 08/33] x86, pkeys: add PKRU xsave fields and data structure(s) Dave Hansen
                   ` (26 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There is a new bit in CR4 for enabling protection keys.  We
will actually enable it later in the series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/uapi/asm/processor-flags.h |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4 arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~pkeys-02-cr4	2016-02-12 10:44:16.803272741 -0800
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2016-02-12 10:44:16.807272924 -0800
@@ -118,6 +118,8 @@
 #define X86_CR4_SMEP		_BITUL(X86_CR4_SMEP_BIT)
 #define X86_CR4_SMAP_BIT	21 /* enable SMAP support */
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
+#define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
+#define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 08/33] x86, pkeys: add PKRU xsave fields and data structure(s)
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (6 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 07/33] x86, pkeys: define new CR4 bit Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:17   ` [tip:mm/pkeys] x86/fpu, x86/mm/pkeys: Add PKRU xsave fields and data structures tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 09/33] x86, pkeys: PTE bits for storing protection key Dave Hansen
                   ` (25 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The protection keys register (PKRU) is saved and restored using
xsave.  Define the data structure that we will use to access it
inside the xsave buffer.

Note that we also have to widen the printk of the xsave feature
masks since this is feature 0x200 and we only did two characters
before.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/fpu/types.h  |   11 +++++++++++
 b/arch/x86/include/asm/fpu/xstate.h |    3 ++-
 b/arch/x86/kernel/fpu/xstate.c      |    7 ++++++-
 3 files changed, 19 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/types.h~pkeys-03-xsave arch/x86/include/asm/fpu/types.h
--- a/arch/x86/include/asm/fpu/types.h~pkeys-03-xsave	2016-02-12 10:44:17.212291439 -0800
+++ b/arch/x86/include/asm/fpu/types.h	2016-02-12 10:44:17.218291713 -0800
@@ -109,6 +109,7 @@ enum xfeature {
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
+	XFEATURE_PKRU,
 
 	XFEATURE_MAX,
 };
@@ -121,6 +122,7 @@ enum xfeature {
 #define XFEATURE_MASK_OPMASK		(1 << XFEATURE_OPMASK)
 #define XFEATURE_MASK_ZMM_Hi256		(1 << XFEATURE_ZMM_Hi256)
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
+#define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
@@ -213,6 +215,15 @@ struct avx_512_hi16_state {
 	struct reg_512_bit		hi16_zmm[16];
 } __packed;
 
+/*
+ * State component 9: 32-bit PKRU register.  The state is
+ * 8 bytes long but only 4 bytes is used currently.
+ */
+struct pkru_state {
+	u32				pkru;
+	u32				pad;
+} __packed;
+
 struct xstate_header {
 	u64				xfeatures;
 	u64				xcomp_bv;
diff -puN arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave arch/x86/include/asm/fpu/xstate.h
--- a/arch/x86/include/asm/fpu/xstate.h~pkeys-03-xsave	2016-02-12 10:44:17.213291484 -0800
+++ b/arch/x86/include/asm/fpu/xstate.h	2016-02-12 10:44:17.218291713 -0800
@@ -28,7 +28,8 @@
 				 XFEATURE_MASK_YMM | \
 				 XFEATURE_MASK_OPMASK | \
 				 XFEATURE_MASK_ZMM_Hi256 | \
-				 XFEATURE_MASK_Hi16_ZMM)
+				 XFEATURE_MASK_Hi16_ZMM	 | \
+				 XFEATURE_MASK_PKRU)
 
 /* All currently supported features */
 #define XCNTXT_MASK	(XFEATURE_MASK_LAZY | XFEATURE_MASK_EAGER)
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-03-xsave	2016-02-12 10:44:17.215291576 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2016-02-12 10:44:17.219291759 -0800
@@ -29,6 +29,8 @@ static const char *xfeature_names[] =
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
 	"Processor Trace (unused)"	,
+	"Protection Keys User registers",
+	"unknown xstate feature"	,
 };
 
 /*
@@ -58,6 +60,7 @@ void fpu__xstate_clear_all_cpu_caps(void
 	setup_clear_cpu_cap(X86_FEATURE_AVX512CD);
 	setup_clear_cpu_cap(X86_FEATURE_MPX);
 	setup_clear_cpu_cap(X86_FEATURE_XGETBV1);
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
 }
 
 /*
@@ -236,7 +239,7 @@ static void __init print_xstate_feature(
 	const char *feature_name;
 
 	if (cpu_has_xfeatures(xstate_mask, &feature_name))
-		pr_info("x86/fpu: Supporting XSAVE feature 0x%02Lx: '%s'\n", xstate_mask, feature_name);
+		pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name);
 }
 
 /*
@@ -252,6 +255,7 @@ static void __init print_xstate_features
 	print_xstate_feature(XFEATURE_MASK_OPMASK);
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
+	print_xstate_feature(XFEATURE_MASK_PKRU);
 }
 
 /*
@@ -468,6 +472,7 @@ static void check_xstate_against_struct(
 	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
+	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 09/33] x86, pkeys: PTE bits for storing protection key
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (7 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 08/33] x86, pkeys: add PKRU xsave fields and data structure(s) Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:18   ` [tip:mm/pkeys] x86/mm/pkeys: Add " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 10/33] x86, pkeys: new page fault error code bit: PF_PK Dave Hansen
                   ` (24 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Previous documentation has referred to these 4 bits as "ignored".
That means that software could have made use of them.  But, as
far as I know, the kernel never used them.

They are still ignored when protection keys is not enabled, so
they could theoretically still get used for software purposes.

We also implement "empty" versions so that code that references
to them can be optimized away by the compiler when the config
option is not enabled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/pgtable_types.h |   22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-04-ptebits	2016-02-12 10:44:17.672312467 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2016-02-12 10:44:17.675312604 -0800
@@ -20,13 +20,18 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
+#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
+#define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
+#define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
+#define _PAGE_BIT_PKEY_BIT3	62	/* Protection Keys, bit 4/4 */
+#define _PAGE_BIT_NX		63	/* No execute: only valid after cpuid check */
+
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
-#define _PAGE_BIT_DEVMAP		_PAGE_BIT_SOFTW4
-#define _PAGE_BIT_NX		63	/* No execute: only valid after cpuid check */
+#define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -47,6 +52,17 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2)
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3)
+#else
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 0))
+#endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 10/33] x86, pkeys: new page fault error code bit: PF_PK
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (8 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 09/33] x86, pkeys: PTE bits for storing protection key Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:18   ` [tip:mm/pkeys] x86/mm/pkeys: Add new 'PF_PK' page fault error code bit tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 11/33] x86, pkeys: store protection in high VMA flags Dave Hansen
                   ` (23 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Note: "PK" is how the Intel SDM refers to this bit, so we also
use that nomenclature.

This only defines the bit, it does not plumb it anywhere to be
handled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/fault.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-05-pfec arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-05-pfec	2016-02-12 10:44:18.080331119 -0800
+++ b/arch/x86/mm/fault.c	2016-02-12 10:44:18.084331302 -0800
@@ -33,6 +33,7 @@
  *   bit 2 ==	 0: kernel-mode access	1: user-mode access
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
+ *   bit 5 ==				1: protection keys block access
  */
 enum x86_pf_error_code {
 
@@ -41,6 +42,7 @@ enum x86_pf_error_code {
 	PF_USER		=		1 << 2,
 	PF_RSVD		=		1 << 3,
 	PF_INSTR	=		1 << 4,
+	PF_PK		=		1 << 5,
 };
 
 /*
@@ -916,6 +918,12 @@ static int spurious_fault_check(unsigned
 
 	if ((error_code & PF_INSTR) && !pte_exec(*pte))
 		return 0;
+	/*
+	 * Note: We do not do lazy flushing on protection key
+	 * changes, so no spurious fault will ever set PF_PK.
+	 */
+	if ((error_code & PF_PK))
+		return 1;
 
 	return 1;
 }
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 11/33] x86, pkeys: store protection in high VMA flags
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (9 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 10/33] x86, pkeys: new page fault error code bit: PF_PK Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:19   ` [tip:mm/pkeys] mm/core, x86/mm/pkeys: Store protection bits " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 12/33] x86, pkeys: arch-specific protection bits Dave Hansen
                   ` (22 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures.  The high 32 bits are unused on 64-bit
platforms.  We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.

Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.

This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.

Sparse complains about these constants unless we explicitly
call them "UL".

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/Kconfig   |    1 +
 b/include/linux/mm.h |   11 +++++++++++
 b/mm/Kconfig         |    3 +++
 3 files changed, 15 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-06-eat-high-vma-flags arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-06-eat-high-vma-flags	2016-02-12 10:44:18.494350045 -0800
+++ b/arch/x86/Kconfig	2016-02-12 10:44:18.502350411 -0800
@@ -155,6 +155,7 @@ config X86
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN include/linux/mm.h~pkeys-06-eat-high-vma-flags include/linux/mm.h
--- a/include/linux/mm.h~pkeys-06-eat-high-vma-flags	2016-02-12 10:44:18.496350136 -0800
+++ b/include/linux/mm.h	2016-02-12 10:44:18.503350456 -0800
@@ -170,6 +170,17 @@ extern unsigned int kobjsize(const void
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_HIGH_ARCH_BIT_0	32	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_1	33	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
+#define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
+#define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
+#define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
+#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff -puN mm/Kconfig~pkeys-06-eat-high-vma-flags mm/Kconfig
--- a/mm/Kconfig~pkeys-06-eat-high-vma-flags	2016-02-12 10:44:18.498350228 -0800
+++ b/mm/Kconfig	2016-02-12 10:44:18.503350456 -0800
@@ -669,3 +669,6 @@ config ZONE_DEVICE
 
 config FRAME_VECTOR
 	bool
+
+config ARCH_USES_HIGH_VMA_FLAGS
+	bool
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 12/33] x86, pkeys: arch-specific protection bits
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (10 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 11/33] x86, pkeys: store protection in high VMA flags Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:19   ` [tip:mm/pkeys] x86/mm/pkeys: Add arch-specific VMA " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 13/33] x86, pkeys: pass VMA down in to fault signal generation code Dave Hansen
                   ` (21 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Lots of things seem to do:

        vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot.  So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT).  It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
	1. "prot" argument to system calls
	2. vma->vm_flags, filled from the mmap "prot"
	3. vma->vm_page prot, filled from vma->vm_flags
	4. the PTE itself.

The pseudocode for these for steps are as follows:

	mmap(PROT_PKEY*)
	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
	pte = pfn | vma->vm_page_prot

Note that this provides a new definitions for x86:

	arch_vm_get_page_prot()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/mmu_context.h   |   11 +++++++++++
 b/arch/x86/include/asm/pgtable_types.h |   12 ++++++++++--
 b/arch/x86/include/uapi/asm/mman.h     |   16 ++++++++++++++++
 b/include/linux/mm.h                   |    7 +++++++
 4 files changed, 44 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-07-store-pkey-in-vma arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-07-store-pkey-in-vma	2016-02-12 10:44:18.956371165 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2016-02-12 10:44:18.964371531 -0800
@@ -275,4 +275,15 @@ static inline void arch_unmap(struct mm_
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+	u16 pkey = 0;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
+				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
+	pkey = (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
+#endif
+	return pkey;
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-07-store-pkey-in-vma arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-07-store-pkey-in-vma	2016-02-12 10:44:18.957371211 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2016-02-12 10:44:18.964371531 -0800
@@ -115,7 +115,12 @@
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
 			 _PAGE_DIRTY)
 
-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify.  The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY)
@@ -231,7 +236,10 @@ enum page_cache_mode {
 /* Extracts the PFN from a (pte|pmd|pud|pgd)val_t of a 4KB page */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)
 
-/* Extracts the flags from a (pte|pmd|pud|pgd)val_t of a 4KB page */
+/*
+ *  Extracts the flags from a (pte|pmd|pud|pgd)val_t
+ *  This includes the protection key value.
+ */
 #define PTE_FLAGS_MASK		(~PTE_PFN_MASK)
 
 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-07-store-pkey-in-vma arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-07-store-pkey-in-vma	2016-02-12 10:44:18.959371302 -0800
+++ b/arch/x86/include/uapi/asm/mman.h	2016-02-12 10:44:18.965371577 -0800
@@ -6,6 +6,22 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+#endif
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff -puN include/linux/mm.h~pkeys-07-store-pkey-in-vma include/linux/mm.h
--- a/include/linux/mm.h~pkeys-07-store-pkey-in-vma	2016-02-12 10:44:18.961371394 -0800
+++ b/include/linux/mm.h	2016-02-12 10:44:18.965371577 -0800
@@ -183,6 +183,13 @@ extern unsigned int kobjsize(const void
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
+# define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+# define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+# define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#endif
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 13/33] x86, pkeys: pass VMA down in to fault signal generation code
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (11 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 12/33] x86, pkeys: arch-specific protection bits Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:19   ` [tip:mm/pkeys] x86/mm/pkeys: Pass " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 14/33] signals, pkeys: notify userspace about protection key faults Dave Hansen
                   ` (20 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

During a page fault, we look up the VMA to ensure that the fault
is in a region with a valid mapping.  But, in the top-level page
fault code we don't need the VMA for much else.  Once we have
decided that an access is bad, we are going to send a signal no
matter what and do not need the VMA any more.  So we do not pass
it down in to the signal generation code.

But, for protection keys, we need the VMA.  It tells us *which*
protection key we violated if we get a PF_PK.  So, we need to
pass the VMA down and fill in siginfo->si_pkey.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/fault.c |   50 ++++++++++++++++++++++++++++----------------------
 1 file changed, 28 insertions(+), 22 deletions(-)

diff -puN arch/x86/mm/fault.c~pkeys-08-pass-down-vma arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-08-pass-down-vma	2016-02-12 10:44:19.441393337 -0800
+++ b/arch/x86/mm/fault.c	2016-02-12 10:44:19.445393520 -0800
@@ -171,7 +171,8 @@ is_prefetch(struct pt_regs *regs, unsign
 
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
-		     struct task_struct *tsk, int fault)
+		     struct task_struct *tsk, struct vm_area_struct *vma,
+		     int fault)
 {
 	unsigned lsb = 0;
 	siginfo_t info;
@@ -656,6 +657,8 @@ no_context(struct pt_regs *regs, unsigne
 	struct task_struct *tsk = current;
 	unsigned long flags;
 	int sig;
+	/* No context means no VMA to pass down */
+	struct vm_area_struct *vma = NULL;
 
 	/* Are we prepared to handle this kernel fault? */
 	if (fixup_exception(regs)) {
@@ -679,7 +682,8 @@ no_context(struct pt_regs *regs, unsigne
 			tsk->thread.cr2 = address;
 
 			/* XXX: hwpoison faults will set the wrong code. */
-			force_sig_info_fault(signal, si_code, address, tsk, 0);
+			force_sig_info_fault(signal, si_code, address,
+					     tsk, vma, 0);
 		}
 
 		/*
@@ -756,7 +760,8 @@ show_signal_msg(struct pt_regs *regs, un
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		       unsigned long address, int si_code)
+		       unsigned long address, struct vm_area_struct *vma,
+		       int si_code)
 {
 	struct task_struct *tsk = current;
 
@@ -799,7 +804,7 @@ __bad_area_nosemaphore(struct pt_regs *r
 		tsk->thread.error_code	= error_code;
 		tsk->thread.trap_nr	= X86_TRAP_PF;
 
-		force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
+		force_sig_info_fault(SIGSEGV, si_code, address, tsk, vma, 0);
 
 		return;
 	}
@@ -812,14 +817,14 @@ __bad_area_nosemaphore(struct pt_regs *r
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		     unsigned long address)
+		     unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
+	__bad_area_nosemaphore(regs, error_code, address, vma, SEGV_MAPERR);
 }
 
 static void
 __bad_area(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int si_code)
+	   unsigned long address,  struct vm_area_struct *vma, int si_code)
 {
 	struct mm_struct *mm = current->mm;
 
@@ -829,25 +834,25 @@ __bad_area(struct pt_regs *regs, unsigne
 	 */
 	up_read(&mm->mmap_sem);
 
-	__bad_area_nosemaphore(regs, error_code, address, si_code);
+	__bad_area_nosemaphore(regs, error_code, address, vma, si_code);
 }
 
 static noinline void
 bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
-	__bad_area(regs, error_code, address, SEGV_MAPERR);
+	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
-	  unsigned int fault)
+	  struct vm_area_struct *vma, unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	int code = BUS_ADRERR;
@@ -874,12 +879,13 @@ do_sigbus(struct pt_regs *regs, unsigned
 		code = BUS_MCEERR_AR;
 	}
 #endif
-	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
+	force_sig_info_fault(SIGBUS, code, address, tsk, vma, fault);
 }
 
 static noinline void
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, unsigned int fault)
+	       unsigned long address, struct vm_area_struct *vma,
+	       unsigned int fault)
 {
 	if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
 		no_context(regs, error_code, address, 0, 0);
@@ -903,9 +909,9 @@ mm_fault_error(struct pt_regs *regs, uns
 	} else {
 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
 			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
+			do_sigbus(regs, error_code, address, vma, fault);
 		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, vma);
 		else
 			BUG();
 	}
@@ -1119,7 +1125,7 @@ __do_page_fault(struct pt_regs *regs, un
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock:
 		 */
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 
 		return;
 	}
@@ -1132,7 +1138,7 @@ __do_page_fault(struct pt_regs *regs, un
 		pgtable_bad(regs, error_code, address);
 
 	if (unlikely(smap_violation(error_code, regs))) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1141,7 +1147,7 @@ __do_page_fault(struct pt_regs *regs, un
 	 * in a region with pagefaults disabled then we must not take the fault
 	 */
 	if (unlikely(faulthandler_disabled() || !mm)) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1185,7 +1191,7 @@ __do_page_fault(struct pt_regs *regs, un
 	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
 		if ((error_code & PF_USER) == 0 &&
 		    !search_exception_tables(regs->ip)) {
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, NULL);
 			return;
 		}
 retry:
@@ -1233,7 +1239,7 @@ retry:
 	 */
 good_area:
 	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
+		bad_area_access_error(regs, error_code, address, vma);
 		return;
 	}
 
@@ -1271,7 +1277,7 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, fault);
+		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
 	}
 
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 14/33] signals, pkeys: notify userspace about protection key faults
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (12 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 13/33] x86, pkeys: pass VMA down in to fault signal generation code Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:20   ` [tip:mm/pkeys] signals, pkeys: Notify " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 15/33] x86, pkeys: fill in pkey field in siginfo Dave Hansen
                   ` (19 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

A protection key fault is very similar to any other access error.
There must be a VMA, etc...  We even want to take the same action
(SIGSEGV) that we do with a normal access fault.

However, we do need to let userspace know that something is
different.  We do this the same way what we did with SEGV_BNDERR
with Memory Protection eXtensions (MPX): define a new SEGV code:
SEGV_PKUERR.

We add a siginfo field: si_pkey that reveals to userspace which
protection key was set on the PTE that we faulted on.  There is
no other easy way for userspace to figure this out.  They could
parse smaps but that would be a bit cruel.

We share space with in siginfo with _addr_bnd.  #BR faults from
MPX are completely separate from page faults (#PF) that trigger
from protection key violations, so we never need both at the same
time.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

The x86 code to actually fill in the siginfo is in the next
patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/include/uapi/asm-generic/siginfo.h |   17 ++++++++++++-----
 b/kernel/signal.c                    |    4 ++++
 2 files changed, 16 insertions(+), 5 deletions(-)

diff -puN include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo-core include/uapi/asm-generic/siginfo.h
--- a/include/uapi/asm-generic/siginfo.h~pkeys-09-siginfo-core	2016-02-12 10:44:19.856412308 -0800
+++ b/include/uapi/asm-generic/siginfo.h	2016-02-12 10:44:19.861412537 -0800
@@ -91,10 +91,15 @@ typedef struct siginfo {
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
 			short _addr_lsb; /* LSB of the reported address */
-			struct {
-				void __user *_lower;
-				void __user *_upper;
-			} _addr_bnd;
+			union {
+				/* used when si_code=SEGV_BNDERR */
+				struct {
+					void __user *_lower;
+					void __user *_upper;
+				} _addr_bnd;
+				/* used when si_code=SEGV_PKUERR */
+				u64 _pkey;
+			};
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -137,6 +142,7 @@ typedef struct siginfo {
 #define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_lower	_sifields._sigfault._addr_bnd._lower
 #define si_upper	_sifields._sigfault._addr_bnd._upper
+#define si_pkey		_sifields._sigfault._pkey
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 #ifdef __ARCH_SIGSYS
@@ -206,7 +212,8 @@ typedef struct siginfo {
 #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
 #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
 #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
-#define NSIGSEGV	3
+#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection key checks */
+#define NSIGSEGV	4
 
 /*
  * SIGBUS si_codes
diff -puN kernel/signal.c~pkeys-09-siginfo-core kernel/signal.c
--- a/kernel/signal.c~pkeys-09-siginfo-core	2016-02-12 10:44:19.857412354 -0800
+++ b/kernel/signal.c	2016-02-12 10:44:19.862412583 -0800
@@ -2709,6 +2709,10 @@ int copy_siginfo_to_user(siginfo_t __use
 			err |= __put_user(from->si_upper, &to->si_upper);
 		}
 #endif
+#ifdef SEGV_PKUERR
+		if (from->si_signo == SIGSEGV && from->si_code == SEGV_PKUERR)
+			err |= __put_user(from->si_pkey, &to->si_pkey);
+#endif
 		break;
 	case __SI_CHLD:
 		err |= __put_user(from->si_pid, &to->si_pid);
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 15/33] x86, pkeys: fill in pkey field in siginfo
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (13 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 14/33] signals, pkeys: notify userspace about protection key faults Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:20   ` [tip:mm/pkeys] x86/mm/pkeys: Fill " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 16/33] x86, pkeys: add functions to fetch PKRU Dave Hansen
                   ` (18 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This fills in the new siginfo field: si_pkey to indicate to
userspace which protection key was set on the PTE that we faulted
on.

Note though that *ALL* protection key faults have to be generated
by a valid, present PTE at some point.  But this code does no PTE
lookups which seeds odd.  The reason is that we take advantage of
the way we generate PTEs from VMAs.  All PTEs under a VMA share
some attributes.  For instance, they are _all_ either PROT_READ
*OR* PROT_NONE.  They also always share a protection key, so we
never have to walk the page tables; we just use the VMA.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/pgtable_types.h |    5 ++
 b/arch/x86/mm/fault.c                  |   64 ++++++++++++++++++++++++++++++++-
 2 files changed, 68 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo-x86 arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-09-siginfo-x86	2016-02-12 10:44:20.290432149 -0800
+++ b/arch/x86/include/asm/pgtable_types.h	2016-02-12 10:44:20.295432377 -0800
@@ -65,6 +65,11 @@
 #endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
+#define _PAGE_PKEY_MASK (_PAGE_PKEY_BIT0 | \
+			 _PAGE_PKEY_BIT1 | \
+			 _PAGE_PKEY_BIT2 | \
+			 _PAGE_PKEY_BIT3)
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else
diff -puN arch/x86/mm/fault.c~pkeys-09-siginfo-x86 arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-09-siginfo-x86	2016-02-12 10:44:20.292432240 -0800
+++ b/arch/x86/mm/fault.c	2016-02-12 10:44:20.296432423 -0800
@@ -15,12 +15,14 @@
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 
+#include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
 #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
 #include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
+#include <asm/mmu_context.h>		/* vma_pkey()			*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -169,6 +171,56 @@ is_prefetch(struct pt_regs *regs, unsign
 	return prefetch;
 }
 
+/*
+ * A protection key fault means that the PKRU value did not allow
+ * access to some PTE.  Userspace can figure out what PKRU was
+ * from the XSAVE state, and this function fills out a field in
+ * siginfo so userspace can discover which protection key was set
+ * on the PTE.
+ *
+ * If we get here, we know that the hardware signaled a PF_PK
+ * fault and that there was a VMA once we got in the fault
+ * handler.  It does *not* guarantee that the VMA we find here
+ * was the one that we faulted on.
+ *
+ * 1. T1   : mprotect_key(foo, PAGE_SIZE, pkey=4);
+ * 2. T1   : set PKRU to deny access to pkey=4, touches page
+ * 3. T1   : faults...
+ * 4.    T2: mprotect_key(foo, PAGE_SIZE, pkey=5);
+ * 5. T1   : enters fault handler, takes mmap_sem, etc...
+ * 6. T1   : reaches here, sees vma_pkey(vma)=5, when we really
+ *	     faulted on a pte with its pkey=4.
+ */
+static void fill_sig_info_pkey(int si_code, siginfo_t *info,
+		struct vm_area_struct *vma)
+{
+	/* This is effectively an #ifdef */
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	/* Fault not from Protection Keys: nothing to do */
+	if (si_code != SEGV_PKUERR)
+		return;
+	/*
+	 * force_sig_info_fault() is called from a number of
+	 * contexts, some of which have a VMA and some of which
+	 * do not.  The PF_PK handing happens after we have a
+	 * valid VMA, so we should never reach this without a
+	 * valid VMA.
+	 */
+	if (!vma) {
+		WARN_ONCE(1, "PKU fault with no VMA passed in");
+		info->si_pkey = 0;
+		return;
+	}
+	/*
+	 * si_pkey should be thought of as a strong hint, but not
+	 * absolutely guranteed to be 100% accurate because of
+	 * the race explained above.
+	 */
+	info->si_pkey = vma_pkey(vma);
+}
+
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 		     struct task_struct *tsk, struct vm_area_struct *vma,
@@ -187,6 +239,8 @@ force_sig_info_fault(int si_signo, int s
 		lsb = PAGE_SHIFT;
 	info.si_addr_lsb = lsb;
 
+	fill_sig_info_pkey(si_code, &info, vma);
+
 	force_sig_info(si_signo, &info, tsk);
 }
 
@@ -847,7 +901,15 @@ static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
+	/*
+	 * This OSPKE check is not strictly necessary at runtime.
+	 * But, doing it this way allows compiler optimizations
+	 * if pkeys are compiled out.
+	 */
+	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
+	else
+		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 16/33] x86, pkeys: add functions to fetch PKRU
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (14 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 15/33] x86, pkeys: fill in pkey field in siginfo Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:21   ` [tip:mm/pkeys] x86/mm/pkeys: Add " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 17/33] mm: factor out VMA fault permission checking Dave Hansen
                   ` (17 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This adds the raw instruction to access PKRU as well as some
accessor functions that correctly handle when the CPU does not
support the instruction.  We don't use it here, but we will use
read_pkru() in the next patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/pgtable.h       |    8 ++++++++
 b/arch/x86/include/asm/special_insns.h |   22 ++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff -puN arch/x86/include/asm/pgtable.h~pkeys-10-kernel-pkru-instructions arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-10-kernel-pkru-instructions	2016-02-12 10:44:20.729452217 -0800
+++ b/arch/x86/include/asm/pgtable.h	2016-02-12 10:44:20.734452446 -0800
@@ -99,6 +99,14 @@ static inline int pte_dirty(pte_t pte)
 	return pte_flags(pte) & _PAGE_DIRTY;
 }
 
+
+static inline u32 read_pkru(void)
+{
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		return __read_pkru();
+	return 0;
+}
+
 static inline int pte_young(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_ACCESSED;
diff -puN arch/x86/include/asm/special_insns.h~pkeys-10-kernel-pkru-instructions arch/x86/include/asm/special_insns.h
--- a/arch/x86/include/asm/special_insns.h~pkeys-10-kernel-pkru-instructions	2016-02-12 10:44:20.731452309 -0800
+++ b/arch/x86/include/asm/special_insns.h	2016-02-12 10:44:20.735452492 -0800
@@ -98,6 +98,28 @@ static inline void native_write_cr8(unsi
 }
 #endif
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static inline u32 __read_pkru(void)
+{
+	u32 ecx = 0;
+	u32 edx, pkru;
+
+	/*
+	 * "rdpkru" instruction.  Places PKRU contents in to EAX,
+	 * clears EDX and requires that ecx=0.
+	 */
+	asm volatile(".byte 0x0f,0x01,0xee\n\t"
+		     : "=a" (pkru), "=d" (edx)
+		     : "c" (ecx));
+	return pkru;
+}
+#else
+static inline u32 __read_pkru(void)
+{
+	return 0;
+}
+#endif
+
 static inline void native_wbinvd(void)
 {
 	asm volatile("wbinvd": : :"memory");
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 17/33] mm: factor out VMA fault permission checking
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (15 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 16/33] x86, pkeys: add functions to fetch PKRU Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:21   ` [tip:mm/pkeys] mm/gup: Factor " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 18/33] x86, mm: simplify get_user_pages() PTE bit handling Dave Hansen
                   ` (16 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This code matches a fault condition up with the VMA and ensures
that the VMA allows the fault to be handled instead of just
erroring out.

We will be extending this in a moment to comprehend protection
keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/mm/gup.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff -puN mm/gup.c~pkeys-10-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-10-pte-fault	2016-02-12 10:44:21.164472103 -0800
+++ b/mm/gup.c	2016-02-12 10:44:21.167472240 -0800
@@ -610,6 +610,18 @@ next_page:
 }
 EXPORT_SYMBOL(__get_user_pages);
 
+bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
+{
+	vm_flags_t vm_flags;
+
+	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+
+	if (!(vm_flags & vma->vm_flags))
+		return false;
+
+	return true;
+}
+
 /*
  * fixup_user_fault() - manually resolve a user page fault
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -645,7 +657,6 @@ int fixup_user_fault(struct task_struct
 		     bool *unlocked)
 {
 	struct vm_area_struct *vma;
-	vm_flags_t vm_flags;
 	int ret, major = 0;
 
 	if (unlocked)
@@ -656,8 +667,7 @@ retry:
 	if (!vma || address < vma->vm_start)
 		return -EFAULT;
 
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
-	if (!(vm_flags & vma->vm_flags))
+	if (!vma_permits_fault(vma, fault_flags))
 		return -EFAULT;
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 18/33] x86, mm: simplify get_user_pages() PTE bit handling
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (16 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 17/33] mm: factor out VMA fault permission checking Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:21   ` [tip:mm/pkeys] x86/mm/gup: Simplify " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 19/33] x86, pkeys: check VMAs and PTEs for protection keys Dave Hansen
                   ` (15 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The current get_user_pages() code is a wee bit more complicated
than it needs to be for pte bit checking.  Currently, it establishes
a mask of required pte _PAGE_* bits and ensures that the pte it
goes after has all those bits.

This consolidates the three identical copies of this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/gup.c |   38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff -puN arch/x86/mm/gup.c~pkeys-12-gup-swizzle arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-12-gup-swizzle	2016-02-12 10:44:21.572490755 -0800
+++ b/arch/x86/mm/gup.c	2016-02-12 10:44:21.575490892 -0800
@@ -75,6 +75,24 @@ static void undo_dev_pagemap(int *nr, in
 }
 
 /*
+ * 'pteval' can come from a pte, pmd or pud.  We only check
+ * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
+ * same value on all 3 types.
+ */
+static inline int pte_allows_gup(unsigned long pteval, int write)
+{
+	unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
+
+	if (write)
+		need_pte_bits |= _PAGE_RW;
+
+	if ((pteval & need_pte_bits) != need_pte_bits)
+		return 0;
+
+	return 1;
+}
+
+/*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
  * register pressure.
@@ -83,14 +101,9 @@ static noinline int gup_pte_range(pmd_t
 		unsigned long end, int write, struct page **pages, int *nr)
 {
 	struct dev_pagemap *pgmap = NULL;
-	unsigned long mask;
 	int nr_start = *nr;
 	pte_t *ptep;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-
 	ptep = pte_offset_map(&pmd, addr);
 	do {
 		pte_t pte = gup_get_pte(ptep);
@@ -110,7 +123,8 @@ static noinline int gup_pte_range(pmd_t
 				pte_unmap(ptep);
 				return 0;
 			}
-		} else if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+		} else if (!pte_allows_gup(pte_val(pte), write) ||
+			   pte_special(pte)) {
 			pte_unmap(ptep);
 			return 0;
 		}
@@ -164,14 +178,10 @@ static int __gup_device_huge_pmd(pmd_t p
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pmd_flags(pmd) & mask) != mask)
+	if (!pte_allows_gup(pmd_val(pmd), write))
 		return 0;
 
 	VM_BUG_ON(!pfn_valid(pmd_pfn(pmd)));
@@ -231,14 +241,10 @@ static int gup_pmd_range(pud_t pud, unsi
 static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pud_flags(pud) & mask) != mask)
+	if (!pte_allows_gup(pud_val(pud), write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pud_flags(pud) & _PAGE_SPECIAL);
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 19/33] x86, pkeys: check VMAs and PTEs for protection keys
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (17 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 18/33] x86, mm: simplify get_user_pages() PTE bit handling Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:22   ` [tip:mm/pkeys] mm/gup, x86/mm/pkeys: Check " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 20/33] mm: do not enforce PKEY permissions on "foreign" mm access Dave Hansen
                   ` (14 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action.  For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.

We try to do the same thing for protection keys.  Basically, we
try to make sure that if a user does this:

	mprotect(ptr, size, PROT_NONE);
	*ptr = foo;

they see the same effects with protection keys when they do this:

	mprotect(ptr, size, PROT_READ|PROT_WRITE);
	set_pkey(ptr, size, 4);
	wrpkru(0xffffff3f); // access disable pkey 4
	*ptr = foo;

The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.

We add two functions and expose them to generic code:

	arch_pte_access_permitted(pte_flags, write)
	arch_vma_access_permitted(vma, write)

These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.

But, there are also cases where we do not want to respect
protection keys.  When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/powerpc/include/asm/mmu_context.h   |   11 ++++++
 b/arch/s390/include/asm/mmu_context.h      |   11 ++++++
 b/arch/unicore32/include/asm/mmu_context.h |   11 ++++++
 b/arch/x86/include/asm/mmu_context.h       |   49 +++++++++++++++++++++++++++++
 b/arch/x86/include/asm/pgtable.h           |   29 +++++++++++++++++
 b/arch/x86/mm/fault.c                      |   21 +++++++++++-
 b/arch/x86/mm/gup.c                        |    5 ++
 b/include/asm-generic/mm_hooks.h           |   11 ++++++
 b/mm/gup.c                                 |   18 ++++++++--
 b/mm/memory.c                              |    4 ++
 10 files changed, 166 insertions(+), 4 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-13-pte-fault arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-13-pte-fault	2016-02-12 10:44:21.985509635 -0800
+++ b/arch/powerpc/include/asm/mmu_context.h	2016-02-12 10:44:22.002510412 -0800
@@ -148,5 +148,16 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-13-pte-fault arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-13-pte-fault	2016-02-12 10:44:21.986509681 -0800
+++ b/arch/s390/include/asm/mmu_context.h	2016-02-12 10:44:22.003510458 -0800
@@ -130,4 +130,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __S390_MMU_CONTEXT_H */
diff -puN arch/unicore32/include/asm/mmu_context.h~pkeys-13-pte-fault arch/unicore32/include/asm/mmu_context.h
--- a/arch/unicore32/include/asm/mmu_context.h~pkeys-13-pte-fault	2016-02-12 10:44:21.988509772 -0800
+++ b/arch/unicore32/include/asm/mmu_context.h	2016-02-12 10:44:22.003510458 -0800
@@ -97,4 +97,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-13-pte-fault arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-13-pte-fault	2016-02-12 10:44:21.990509864 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2016-02-12 10:44:22.003510458 -0800
@@ -286,4 +286,53 @@ static inline int vma_pkey(struct vm_are
 	return pkey;
 }
 
+static inline bool __pkru_allows_pkey(u16 pkey, bool write)
+{
+	u32 pkru = read_pkru();
+
+	if (!__pkru_allows_read(pkru, pkey))
+		return false;
+	if (write && !__pkru_allows_write(pkru, pkey))
+		return false;
+
+	return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to PKRU for other
+ * processes or any way to tell *which * PKRU in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+	if (!current->mm)
+		return true;
+	/*
+	 * Should PKRU be enforced on the access to this VMA?  If
+	 * the VMA is from another process, then PKRU has no
+	 * relevance and should not be enforced.
+	 */
+	if (current->mm != vma->vm_mm)
+		return true;
+
+	return false;
+}
+
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* allow access if the VMA is not one from this process */
+	if (vma_is_foreign(vma))
+		return true;
+	return __pkru_allows_pkey(vma_pkey(vma), write);
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	return __pkru_allows_pkey(pte_flags_pkey(pte_flags(pte)), write);
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff -puN arch/x86/include/asm/pgtable.h~pkeys-13-pte-fault arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-13-pte-fault	2016-02-12 10:44:21.991509909 -0800
+++ b/arch/x86/include/asm/pgtable.h	2016-02-12 10:44:22.004510503 -0800
@@ -919,6 +919,35 @@ static inline pte_t pte_swp_clear_soft_d
 }
 #endif
 
+#define PKRU_AD_BIT 0x1
+#define PKRU_WD_BIT 0x2
+
+static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+}
+
+static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	/*
+	 * Access-disable disables writes too so we need to check
+	 * both bits here.
+	 */
+	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+}
+
+static inline u16 pte_flags_pkey(unsigned long pte_flags)
+{
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/* ifdef to avoid doing 59-bit shift on 32-bit values */
+	return (pte_flags & _PAGE_PKEY_MASK) >> _PAGE_BIT_PKEY_BIT0;
+#else
+	return 0;
+#endif
+}
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff -puN arch/x86/mm/fault.c~pkeys-13-pte-fault arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-13-pte-fault	2016-02-12 10:44:21.993510001 -0800
+++ b/arch/x86/mm/fault.c	2016-02-12 10:44:22.004510503 -0800
@@ -897,6 +897,16 @@ bad_area(struct pt_regs *regs, unsigned
 	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
+static inline bool bad_area_access_from_pkeys(unsigned long error_code,
+		struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return false;
+	if (error_code & PF_PK)
+		return true;
+	return false;
+}
+
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
@@ -906,7 +916,7 @@ bad_area_access_error(struct pt_regs *re
 	 * But, doing it this way allows compiler optimizations
 	 * if pkeys are compiled out.
 	 */
-	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+	if (bad_area_access_from_pkeys(error_code, vma))
 		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
 	else
 		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
@@ -1081,6 +1091,15 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/*
+	 * Access or read was blocked by protection keys. We do
+	 * this check before any others because we do not want
+	 * to, for instance, confuse a protection-key-denied
+	 * write with one for which we should do a COW.
+	 */
+	if (error_code & PF_PK)
+		return 1;
+
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
diff -puN arch/x86/mm/gup.c~pkeys-13-pte-fault arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c~pkeys-13-pte-fault	2016-02-12 10:44:21.994510046 -0800
+++ b/arch/x86/mm/gup.c	2016-02-12 10:44:22.005510549 -0800
@@ -11,6 +11,7 @@
 #include <linux/swap.h>
 #include <linux/memremap.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 
 static inline pte_t gup_get_pte(pte_t *ptep)
@@ -89,6 +90,10 @@ static inline int pte_allows_gup(unsigne
 	if ((pteval & need_pte_bits) != need_pte_bits)
 		return 0;
 
+	/* Check memory protection keys permissions. */
+	if (!__pkru_allows_pkey(pte_flags_pkey(pteval), write))
+		return 0;
+
 	return 1;
 }
 
diff -puN include/asm-generic/mm_hooks.h~pkeys-13-pte-fault include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-13-pte-fault	2016-02-12 10:44:21.996510138 -0800
+++ b/include/asm-generic/mm_hooks.h	2016-02-12 10:44:22.005510549 -0800
@@ -26,4 +26,15 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif	/* _ASM_GENERIC_MM_HOOKS_H */
diff -puN mm/gup.c~pkeys-13-pte-fault mm/gup.c
--- a/mm/gup.c~pkeys-13-pte-fault	2016-02-12 10:44:21.997510184 -0800
+++ b/mm/gup.c	2016-02-12 10:44:22.006510595 -0800
@@ -15,6 +15,7 @@
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
@@ -444,6 +445,8 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
+	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+		return -EFAULT;
 	return 0;
 }
 
@@ -612,13 +615,19 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-	vm_flags_t vm_flags;
-
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+	bool write = !!(fault_flags & FAULT_FLAG_WRITE);
+	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
 		return false;
 
+	/*
+	 * The architecture might have a hardware protection
+	 * mechanism other than read/write that can deny access
+	 */
+	if (!arch_vma_access_permitted(vma, write))
+		return false;
+
 	return true;
 }
 
@@ -1172,6 +1181,9 @@ static int gup_pte_range(pmd_t pmd, unsi
 			pte_protnone(pte) || (write && !pte_write(pte)))
 			goto pte_unmap;
 
+		if (!arch_pte_access_permitted(pte, write))
+			goto pte_unmap;
+
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 		head = compound_head(page);
diff -puN mm/memory.c~pkeys-13-pte-fault mm/memory.c
--- a/mm/memory.c~pkeys-13-pte-fault	2016-02-12 10:44:21.999510275 -0800
+++ b/mm/memory.c	2016-02-12 10:44:22.007510641 -0800
@@ -65,6 +65,7 @@
 #include <linux/userfaultfd_k.h>
 
 #include <asm/io.h>
+#include <asm/mmu_context.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
@@ -3357,6 +3358,9 @@ static int __handle_mm_fault(struct mm_s
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+		return VM_FAULT_SIGSEGV;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 20/33] mm: do not enforce PKEY permissions on "foreign" mm access
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (18 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 19/33] x86, pkeys: check VMAs and PTEs for protection keys Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-12 21:02 ` [PATCH 21/33] x86, pkeys: optimize fault handling in access_error() Dave Hansen
                   ` (13 subsequent siblings)
  33 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen, linux-arch


From: Dave Hansen <dave.hansen@linux.intel.com>

We try to enforce protection keys in software the same way that we
do in hardware.  (See long example below).

But, we only want to do this when accessing our *own* process's
memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
tried to PTRACE_POKE a target process which just happened to have
some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
debugger access to that memory.  PKRU is fundamentally a
thread-local structure and we do not want to enforce it on access
to _another_ thread's data.

This gets especially tricky when we have workqueues or other
delayed-work mechanisms that might run in a random process's context.
We can check that we only enforce pkeys when operating on our *own* mm,
but delayed work gets performed when a random user context is active.
We might end up with a situation where a delayed-work gup fails when
running randomly under its "own" task but succeeds when running under
another process.  We want to avoid that.

To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
fault flag: FAULT_FLAG_REMOTE.  They indicate that we are
walking an mm which is not guranteed to be the same as
current->mm and should not be subject to protection key
enforcement.

Thanks to Jerome Glisse for pointing out this scenario.

*** Why do we enforce protection keys in software?? ***

Imagine that we disabled access to the memory pointer to by 'buf'.
The, we implemented sys_write() like this:

	sys_read(fd, buf, len...)
	{
		struct page *page = follow_page(buf);
		void *buf_mapped = kmap(page);
		memcpy(buf_mapped, fd_data, len);
		...
	}

This writes to 'buf' via a *kernel* mapping, without a protection
key.  While this implementation does the same thing:

	sys_read(fd, buf, len...)
	{
		copy_to_user(buf, fd_data, len);
		...
	}

but would hit a protection key fault because the userspace 'buf'
mapping has a protection key set.

To provide consistency, and to make key-protected memory work
as much like mprotect()ed memory as possible, we try to enforce
the same protections as the hardware would when the *kernel* walks
the page tables (and other mm structures).

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-arch@vger.kernel.org
---

 b/arch/powerpc/include/asm/mmu_context.h   |    3 ++-
 b/arch/s390/include/asm/mmu_context.h      |    3 ++-
 b/arch/unicore32/include/asm/mmu_context.h |    3 ++-
 b/arch/x86/include/asm/mmu_context.h       |    5 +++--
 b/drivers/iommu/amd_iommu_v2.c             |    1 +
 b/include/asm-generic/mm_hooks.h           |    3 ++-
 b/include/linux/mm.h                       |    1 +
 b/mm/gup.c                                 |   15 ++++++++++-----
 b/mm/ksm.c                                 |   10 ++++++++--
 b/mm/memory.c                              |    3 ++-
 10 files changed, 33 insertions(+), 14 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-14-gup-fault-foreign-flag arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-14-gup-fault-foreign-flag	2016-02-12 10:44:22.650540035 -0800
+++ b/arch/powerpc/include/asm/mmu_context.h	2016-02-12 10:44:22.668540858 -0800
@@ -148,7 +148,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-14-gup-fault-foreign-flag arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-14-gup-fault-foreign-flag	2016-02-12 10:44:22.652540127 -0800
+++ b/arch/s390/include/asm/mmu_context.h	2016-02-12 10:44:22.669540904 -0800
@@ -130,7 +130,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/unicore32/include/asm/mmu_context.h~pkeys-14-gup-fault-foreign-flag arch/unicore32/include/asm/mmu_context.h
--- a/arch/unicore32/include/asm/mmu_context.h~pkeys-14-gup-fault-foreign-flag	2016-02-12 10:44:22.654540218 -0800
+++ b/arch/unicore32/include/asm/mmu_context.h	2016-02-12 10:44:22.669540904 -0800
@@ -97,7 +97,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-14-gup-fault-foreign-flag arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-14-gup-fault-foreign-flag	2016-02-12 10:44:22.655540264 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2016-02-12 10:44:22.669540904 -0800
@@ -322,10 +322,11 @@ static inline bool vma_is_foreign(struct
 	return false;
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* allow access if the VMA is not one from this process */
-	if (vma_is_foreign(vma))
+	if (foreign || vma_is_foreign(vma))
 		return true;
 	return __pkru_allows_pkey(vma_pkey(vma), write);
 }
diff -puN drivers/iommu/amd_iommu_v2.c~pkeys-14-gup-fault-foreign-flag drivers/iommu/amd_iommu_v2.c
--- a/drivers/iommu/amd_iommu_v2.c~pkeys-14-gup-fault-foreign-flag	2016-02-12 10:44:22.657540355 -0800
+++ b/drivers/iommu/amd_iommu_v2.c	2016-02-12 10:44:22.670540949 -0800
@@ -526,6 +526,7 @@ static void do_fault(struct work_struct
 		flags |= FAULT_FLAG_USER;
 	if (fault->flags & PPR_FAULT_WRITE)
 		flags |= FAULT_FLAG_WRITE;
+	flags |= FAULT_FLAG_REMOTE;
 
 	down_read(&mm->mmap_sem);
 	vma = find_extend_vma(mm, address);
diff -puN include/asm-generic/mm_hooks.h~pkeys-14-gup-fault-foreign-flag include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-14-gup-fault-foreign-flag	2016-02-12 10:44:22.658540401 -0800
+++ b/include/asm-generic/mm_hooks.h	2016-02-12 10:44:22.670540949 -0800
@@ -26,7 +26,8 @@ static inline void arch_bprm_mm_init(str
 {
 }
 
-static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN include/linux/mm.h~pkeys-14-gup-fault-foreign-flag include/linux/mm.h
--- a/include/linux/mm.h~pkeys-14-gup-fault-foreign-flag	2016-02-12 10:44:22.660540492 -0800
+++ b/include/linux/mm.h	2016-02-12 10:44:22.671540995 -0800
@@ -251,6 +251,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
+#define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff -puN mm/gup.c~pkeys-14-gup-fault-foreign-flag mm/gup.c
--- a/mm/gup.c~pkeys-14-gup-fault-foreign-flag	2016-02-12 10:44:22.662540584 -0800
+++ b/mm/gup.c	2016-02-12 10:44:22.672541041 -0800
@@ -365,6 +365,8 @@ static int faultin_page(struct task_stru
 		return -ENOENT;
 	if (*flags & FOLL_WRITE)
 		fault_flags |= FAULT_FLAG_WRITE;
+	if (*flags & FOLL_REMOTE)
+		fault_flags |= FAULT_FLAG_REMOTE;
 	if (nonblocking)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
@@ -415,11 +417,13 @@ static int faultin_page(struct task_stru
 static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 {
 	vm_flags_t vm_flags = vma->vm_flags;
+	int write = (gup_flags & FOLL_WRITE);
+	int foreign = (gup_flags & FOLL_REMOTE);
 
 	if (vm_flags & (VM_IO | VM_PFNMAP))
 		return -EFAULT;
 
-	if (gup_flags & FOLL_WRITE) {
+	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
@@ -445,7 +449,7 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
-	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+	if (!arch_vma_access_permitted(vma, write, foreign))
 		return -EFAULT;
 	return 0;
 }
@@ -615,7 +619,8 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-	bool write = !!(fault_flags & FAULT_FLAG_WRITE);
+	bool write   = !!(fault_flags & FAULT_FLAG_WRITE);
+	bool foreign = !!(fault_flags & FAULT_FLAG_REMOTE);
 	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
@@ -623,9 +628,9 @@ bool vma_permits_fault(struct vm_area_st
 
 	/*
 	 * The architecture might have a hardware protection
-	 * mechanism other than read/write that can deny access
+	 * mechanism other than read/write that can deny access.
 	 */
-	if (!arch_vma_access_permitted(vma, write))
+	if (!arch_vma_access_permitted(vma, write, foreign))
 		return false;
 
 	return true;
diff -puN mm/ksm.c~pkeys-14-gup-fault-foreign-flag mm/ksm.c
--- a/mm/ksm.c~pkeys-14-gup-fault-foreign-flag	2016-02-12 10:44:22.663540630 -0800
+++ b/mm/ksm.c	2016-02-12 10:44:22.673541087 -0800
@@ -359,6 +359,10 @@ static inline bool ksm_test_exit(struct
  * in case the application has unmapped and remapped mm,addr meanwhile.
  * Could a ksm page appear anywhere else?  Actually yes, in a VM_PFNMAP
  * mmap of /dev/mem or /dev/kmem, where we would not want to touch it.
+ *
+ * FAULT_FLAG/FOLL_REMOTE are because we do this outside the context
+ * of the process that owns 'vma'.  We also do not want to enforce
+ * protection keys here anyway.
  */
 static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 {
@@ -367,12 +371,14 @@ static int break_ksm(struct vm_area_stru
 
 	do {
 		cond_resched();
-		page = follow_page(vma, addr, FOLL_GET | FOLL_MIGRATION);
+		page = follow_page(vma, addr,
+				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
 		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
 			ret = handle_mm_fault(vma->vm_mm, vma, addr,
-							FAULT_FLAG_WRITE);
+							FAULT_FLAG_WRITE |
+							FAULT_FLAG_REMOTE);
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
diff -puN mm/memory.c~pkeys-14-gup-fault-foreign-flag mm/memory.c
--- a/mm/memory.c~pkeys-14-gup-fault-foreign-flag	2016-02-12 10:44:22.665540721 -0800
+++ b/mm/memory.c	2016-02-12 10:44:22.674541132 -0800
@@ -3358,7 +3358,8 @@ static int __handle_mm_fault(struct mm_s
 	pmd_t *pmd;
 	pte_t *pte;
 
-	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+					    flags & FAULT_FLAG_REMOTE))
 		return VM_FAULT_SIGSEGV;
 
 	if (unlikely(is_vm_hugetlb_page(vma)))
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 21/33] x86, pkeys: optimize fault handling in access_error()
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (19 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 20/33] mm: do not enforce PKEY permissions on "foreign" mm access Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:23   ` [tip:mm/pkeys] x86/mm/pkeys: Optimize " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 22/33] x86, pkeys: differentiate instruction fetches Dave Hansen
                   ` (12 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We might not strictly have to make modifictions to
access_error() to check the VMA here.

If we do not, we will do this:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault(), allocates and maps page, sets pte.pkey=K
4. return to userspace
5. touch instruction reexecutes, but triggers PF_PK
6. do PKEY signal

What happens with this patch applied:
1. app sets VMA pkey to K
2. app touches a !present page
3. do_page_fault() notices that K is inaccessible
4. do PKEY signal

We basically skip the fault that does an allocation.

So what this lets us do is protect areas from even being
*populated* unless it is accessible according to protection
keys.  That seems handy to me and makes protection keys work
more like an mprotect()'d mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/mm/fault.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-15-access_error arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-15-access_error	2016-02-12 10:44:23.285569064 -0800
+++ b/arch/x86/mm/fault.c	2016-02-12 10:44:23.288569201 -0800
@@ -900,10 +900,16 @@ bad_area(struct pt_regs *regs, unsigned
 static inline bool bad_area_access_from_pkeys(unsigned long error_code,
 		struct vm_area_struct *vma)
 {
+	/* This code is always called on the current mm */
+	bool foreign = false;
+
 	if (!boot_cpu_has(X86_FEATURE_OSPKE))
 		return false;
 	if (error_code & PF_PK)
 		return true;
+	/* this checks permission keys on the VMA: */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+		return true;
 	return false;
 }
 
@@ -1091,6 +1097,8 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/* This is only called for the current mm, so: */
+	bool foreign = false;
 	/*
 	 * Access or read was blocked by protection keys. We do
 	 * this check before any others because we do not want
@@ -1099,6 +1107,13 @@ access_error(unsigned long error_code, s
 	 */
 	if (error_code & PF_PK)
 		return 1;
+	/*
+	 * Make sure to check the VMA so that we do not perform
+	 * faults just to hit a PF_PK as soon as we fill in a
+	 * page.
+	 */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+		return 1;
 
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 22/33] x86, pkeys: differentiate instruction fetches
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (20 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 21/33] x86, pkeys: optimize fault handling in access_error() Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:23   ` [tip:mm/pkeys] mm/core, x86/mm/pkeys: Differentiate " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 23/33] x86, pkeys: dump PKRU with other kernel registers Dave Hansen
                   ` (11 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

As discussed earlier, we attempt to enforce protection keys in
software.

However, the code checks all faults to ensure that they are not
violating protection key permissions.  It was assumed that all
faults are either write faults where we check PKRU[key].WD (write
disable) or read faults where we check the AD (access disable)
bit.

But, there is a third category of faults for protection keys:
instruction faults.  Instruction faults never run afoul of
protection keys because they do not affect instruction fetches.

So, plumb the PF_INSTR bit down in to the
arch_vma_access_permitted() function where we do the protection
key checks.

We also add a new FAULT_FLAG_INSTRUCTION.  This is because
handle_mm_fault() is not passed the architecture-specific
error_code where we keep PF_INSTR, so we need to encode the
instruction fetch information in to the arch-generic fault
flags.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

---

 b/arch/powerpc/include/asm/mmu_context.h |    2 +-
 b/arch/s390/include/asm/mmu_context.h    |    2 +-
 b/arch/x86/include/asm/mmu_context.h     |    5 ++++-
 b/arch/x86/mm/fault.c                    |    8 ++++++--
 b/include/asm-generic/mm_hooks.h         |    2 +-
 b/include/linux/mm.h                     |    1 +
 b/mm/gup.c                               |   11 +++++++++--
 b/mm/memory.c                            |    1 +
 8 files changed, 24 insertions(+), 8 deletions(-)

diff -puN arch/powerpc/include/asm/mmu_context.h~pkeys-16-allow-execute-on-unreadable arch/powerpc/include/asm/mmu_context.h
--- a/arch/powerpc/include/asm/mmu_context.h~pkeys-16-allow-execute-on-unreadable	2016-02-12 10:44:23.698587944 -0800
+++ b/arch/powerpc/include/asm/mmu_context.h	2016-02-12 10:44:23.713588630 -0800
@@ -149,7 +149,7 @@ static inline void arch_bprm_mm_init(str
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/s390/include/asm/mmu_context.h~pkeys-16-allow-execute-on-unreadable arch/s390/include/asm/mmu_context.h
--- a/arch/s390/include/asm/mmu_context.h~pkeys-16-allow-execute-on-unreadable	2016-02-12 10:44:23.700588036 -0800
+++ b/arch/s390/include/asm/mmu_context.h	2016-02-12 10:44:23.713588630 -0800
@@ -131,7 +131,7 @@ static inline void arch_bprm_mm_init(str
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN arch/x86/include/asm/mmu_context.h~pkeys-16-allow-execute-on-unreadable arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-16-allow-execute-on-unreadable	2016-02-12 10:44:23.701588081 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2016-02-12 10:44:23.714588675 -0800
@@ -323,8 +323,11 @@ static inline bool vma_is_foreign(struct
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
+	/* pkeys never affect instruction fetches */
+	if (execute)
+		return true;
 	/* allow access if the VMA is not one from this process */
 	if (foreign || vma_is_foreign(vma))
 		return true;
diff -puN arch/x86/mm/fault.c~pkeys-16-allow-execute-on-unreadable arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-16-allow-execute-on-unreadable	2016-02-12 10:44:23.703588173 -0800
+++ b/arch/x86/mm/fault.c	2016-02-12 10:44:23.714588675 -0800
@@ -908,7 +908,8 @@ static inline bool bad_area_access_from_
 	if (error_code & PF_PK)
 		return true;
 	/* this checks permission keys on the VMA: */
-	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE),
+				(error_code & PF_INSTR), foreign))
 		return true;
 	return false;
 }
@@ -1112,7 +1113,8 @@ access_error(unsigned long error_code, s
 	 * faults just to hit a PF_PK as soon as we fill in a
 	 * page.
 	 */
-	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE),
+				(error_code & PF_INSTR), foreign))
 		return 1;
 
 	if (error_code & PF_WRITE) {
@@ -1267,6 +1269,8 @@ __do_page_fault(struct pt_regs *regs, un
 
 	if (error_code & PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
+	if (error_code & PF_INSTR)
+		flags |= FAULT_FLAG_INSTRUCTION;
 
 	/*
 	 * When running in the kernel we expect faults to occur only to
diff -puN include/asm-generic/mm_hooks.h~pkeys-16-allow-execute-on-unreadable include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~pkeys-16-allow-execute-on-unreadable	2016-02-12 10:44:23.705588264 -0800
+++ b/include/asm-generic/mm_hooks.h	2016-02-12 10:44:23.715588721 -0800
@@ -27,7 +27,7 @@ static inline void arch_bprm_mm_init(str
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff -puN include/linux/mm.h~pkeys-16-allow-execute-on-unreadable include/linux/mm.h
--- a/include/linux/mm.h~pkeys-16-allow-execute-on-unreadable	2016-02-12 10:44:23.706588310 -0800
+++ b/include/linux/mm.h	2016-02-12 10:44:23.716588767 -0800
@@ -252,6 +252,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
+#define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff -puN mm/gup.c~pkeys-16-allow-execute-on-unreadable mm/gup.c
--- a/mm/gup.c~pkeys-16-allow-execute-on-unreadable	2016-02-12 10:44:23.708588401 -0800
+++ b/mm/gup.c	2016-02-12 10:44:23.716588767 -0800
@@ -449,7 +449,11 @@ static int check_vma_flags(struct vm_are
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
-	if (!arch_vma_access_permitted(vma, write, foreign))
+	/*
+	 * gups are always data accesses, not instruction
+	 * fetches, so execute=false here
+	 */
+	if (!arch_vma_access_permitted(vma, write, false, foreign))
 		return -EFAULT;
 	return 0;
 }
@@ -629,8 +633,11 @@ bool vma_permits_fault(struct vm_area_st
 	/*
 	 * The architecture might have a hardware protection
 	 * mechanism other than read/write that can deny access.
+	 *
+	 * gup always represents data access, not instruction
+	 * fetches, so execute=false here:
 	 */
-	if (!arch_vma_access_permitted(vma, write, foreign))
+	if (!arch_vma_access_permitted(vma, write, false, foreign))
 		return false;
 
 	return true;
diff -puN mm/memory.c~pkeys-16-allow-execute-on-unreadable mm/memory.c
--- a/mm/memory.c~pkeys-16-allow-execute-on-unreadable	2016-02-12 10:44:23.710588493 -0800
+++ b/mm/memory.c	2016-02-12 10:44:23.717588813 -0800
@@ -3359,6 +3359,7 @@ static int __handle_mm_fault(struct mm_s
 	pte_t *pte;
 
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_REMOTE))
 		return VM_FAULT_SIGSEGV;
 
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 23/33] x86, pkeys: dump PKRU with other kernel registers
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (21 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 22/33] x86, pkeys: differentiate instruction fetches Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:24   ` [tip:mm/pkeys] x86/mm/pkeys: Dump " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 24/33] x86, pkeys: dump pkey from VMA in /proc/pid/smaps Dave Hansen
                   ` (10 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Protection Keys never affect kernel mappings.  But, they can
affect whether the kernel will fault when it touches a user
mapping.  The kernel doesn't touch user mappings without some
careful choreography and these accesses don't generally result in
oopses.  But, if one does, we definitely want to have PKRU
available so we can figure out if protection keys played a role.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/kernel/process_64.c |    2 ++
 1 file changed, 2 insertions(+)

diff -puN arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~pkeys-30-kernel-error-dumps	2016-02-12 10:44:24.287614870 -0800
+++ b/arch/x86/kernel/process_64.c	2016-02-12 10:44:24.291615053 -0800
@@ -116,6 +116,8 @@ void __show_regs(struct pt_regs *regs, i
 	printk(KERN_DEFAULT "DR0: %016lx DR1: %016lx DR2: %016lx\n", d0, d1, d2);
 	printk(KERN_DEFAULT "DR3: %016lx DR6: %016lx DR7: %016lx\n", d3, d6, d7);
 
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		printk(KERN_DEFAULT "PKRU: %08x\n", read_pkru());
 }
 
 void release_thread(struct task_struct *dead_task)
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 24/33] x86, pkeys: dump pkey from VMA in /proc/pid/smaps
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (22 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 23/33] x86, pkeys: dump PKRU with other kernel registers Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:24   ` [tip:mm/pkeys] x86/mm/pkeys: Dump pkey from VMA in /proc/pid/ smaps tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 25/33] x86, pkeys: add Kconfig prompt to existing config option Dave Hansen
                   ` (9 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen, vbabka


From: Dave Hansen <dave.hansen@linux.intel.com>

The protection key can now be just as important as read/write
permissions on a VMA.  We need some debug mechanism to help
figure out if it is in play.  smaps seems like a logical
place to expose it.

arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.

We also use no #ifdef.  If protection keys is .config'd out we
will effectively get the same function as if we used the weak
generic function.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: vbabka@suse.cz
---

 b/arch/x86/kernel/setup.c |    9 +++++++++
 b/fs/proc/task_mmu.c      |   14 ++++++++++++++
 2 files changed, 23 insertions(+)

diff -puN arch/x86/kernel/setup.c~pkeys-40-smaps arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~pkeys-40-smaps	2016-02-12 10:44:24.696633567 -0800
+++ b/arch/x86/kernel/setup.c	2016-02-12 10:44:24.701633796 -0800
@@ -112,6 +112,7 @@
 #include <asm/alternative.h>
 #include <asm/prom.h>
 #include <asm/microcode.h>
+#include <asm/mmu_context.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1282,3 +1283,11 @@ static int __init register_kernel_offset
 	return 0;
 }
 __initcall(register_kernel_offset_dumper);
+
+void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
+}
diff -puN fs/proc/task_mmu.c~pkeys-40-smaps fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~pkeys-40-smaps	2016-02-12 10:44:24.697633613 -0800
+++ b/fs/proc/task_mmu.c	2016-02-12 10:44:24.701633796 -0800
@@ -660,11 +660,20 @@ static void show_smap_vma_flags(struct s
 		[ilog2(VM_MERGEABLE)]	= "mg",
 		[ilog2(VM_UFFD_MISSING)]= "um",
 		[ilog2(VM_UFFD_WP)]	= "uw",
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+		/* These come out via ProtectionKey: */
+		[ilog2(VM_PKEY_BIT0)]	= "",
+		[ilog2(VM_PKEY_BIT1)]	= "",
+		[ilog2(VM_PKEY_BIT2)]	= "",
+		[ilog2(VM_PKEY_BIT3)]	= "",
+#endif
 	};
 	size_t i;
 
 	seq_puts(m, "VmFlags: ");
 	for (i = 0; i < BITS_PER_LONG; i++) {
+		if (!mnemonics[i][0])
+			continue;
 		if (vma->vm_flags & (1UL << i)) {
 			seq_printf(m, "%c%c ",
 				   mnemonics[i][0], mnemonics[i][1]);
@@ -702,6 +711,10 @@ static int smaps_hugetlb_range(pte_t *pt
 }
 #endif /* HUGETLB_PAGE */
 
+void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+}
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
@@ -783,6 +796,7 @@ static int show_smap(struct seq_file *m,
 		   (vma->vm_flags & VM_LOCKED) ?
 			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
 
+	arch_show_smap(m, vma);
 	show_smap_vma_flags(m, vma);
 	m_cache_vma(m, vma);
 	return 0;
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 25/33] x86, pkeys: add Kconfig prompt to existing config option
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (23 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 24/33] x86, pkeys: dump pkey from VMA in /proc/pid/smaps Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:24   ` [tip:mm/pkeys] x86/mm/pkeys: Add " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 26/33] x86, pkeys: actually enable Memory Protection Keys in CPU Dave Hansen
                   ` (8 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

I don't have a strong opinion on whether we need this or not.
Protection Keys has relatively little code associated with it,
and it is not a heavyweight feature to keep enabled.  However,
I can imagine that folks would still appreciate being able to
disable it.

Here's the option if folks want it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-40-kconfig-prompt arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-40-kconfig-prompt	2016-02-12 10:44:25.134653590 -0800
+++ b/arch/x86/Kconfig	2016-02-12 10:44:25.138653773 -0800
@@ -1715,8 +1715,18 @@ config X86_INTEL_MPX
 	  If unsure, say N.
 
 config X86_INTEL_MEMORY_PROTECTION_KEYS
+	prompt "Intel Memory Protection Keys"
 	def_bool y
+	# Note: only available in 64-bit mode
 	depends on CPU_SUP_INTEL && X86_64
+	---help---
+	  Memory Protection Keys provides a mechanism for enforcing
+	  page-based protections, but without requiring modification of the
+	  page tables when an application changes protection domains.
+
+	  For details, see Documentation/x86/protection-keys.txt
+
+	  If unsure, say y.
 
 config EFI
 	bool "EFI runtime service support"
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 26/33] x86, pkeys: actually enable Memory Protection Keys in CPU
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (24 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 25/33] x86, pkeys: add Kconfig prompt to existing config option Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:25   ` [tip:mm/pkeys] x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 27/33] mm, multi-arch: pass a protection key in to calc_vm_flag_bits() Dave Hansen
                   ` (7 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This sets the bit in 'cr4' to actually enable the protection
keys feature.  We also include a boot-time disable for the
feature "nopku".

Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid
bit to appear set.  At this point in boot, identify_cpu()
has already run the actual CPUID instructions and populated
the "cpu features" structures.  We need to go back and
re-run identify_cpu() to make sure it gets updated values.

We *could* simply re-populate the 11th word of the cpuid
data, but this is probably quick enough.

Also note that with the cpu_has() check and X86_FEATURE_PKU
present in disabled-features.h, we do not need an #ifdef
for setup_pku().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/Documentation/kernel-parameters.txt |    3 ++
 b/arch/x86/kernel/cpu/common.c        |   41 ++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-50-should-be-last-patch	2016-02-12 10:44:25.545672379 -0800
+++ b/arch/x86/kernel/cpu/common.c	2016-02-12 10:44:25.551672653 -0800
@@ -288,6 +288,46 @@ static __always_inline void setup_smap(s
 }
 
 /*
+ * Protection Keys are not available in 32-bit mode.
+ */
+static bool pku_disabled;
+static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+{
+	if (!cpu_has(c, X86_FEATURE_PKU))
+		return;
+	if (pku_disabled)
+		return;
+
+	cr4_set_bits(X86_CR4_PKE);
+	/*
+	 * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+	 * cpuid bit to be set.  We need to ensure that we
+	 * update that bit in this CPU's "cpu_info".
+	 */
+	get_cpu_cap(c);
+}
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static __init int setup_disable_pku(char *arg)
+{
+	/*
+	 * Do not clear the X86_FEATURE_PKU bit.  All of the
+	 * runtime checks are against OSPKE so clearing the
+	 * bit does nothing.
+	 *
+	 * This way, we will see "pku" in cpuinfo, but not
+	 * "ospke", which is exactly what we want.  It shows
+	 * that the CPU has PKU, but the OS has not enabled it.
+	 * This happens to be exactly how a system would look
+	 * if we disabled the config option.
+	 */
+	pr_info("x86: 'nopku' specified, disabling Memory Protection Keys\n");
+	pku_disabled = true;
+	return 1;
+}
+__setup("nopku", setup_disable_pku);
+#endif /* CONFIG_X86_64 */
+
+/*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
  * software.  Add those features to this table to auto-disable them.
@@ -944,6 +984,7 @@ static void identify_cpu(struct cpuinfo_
 	init_hypervisor(c);
 	x86_init_rdrand(c);
 	x86_init_cache_qos(c);
+	setup_pku(c);
 
 	/*
 	 * Clear/Set all flags overriden by options, need do it
diff -puN Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~pkeys-50-should-be-last-patch	2016-02-12 10:44:25.547672471 -0800
+++ b/Documentation/kernel-parameters.txt	2016-02-12 10:44:25.552672699 -0800
@@ -976,6 +976,9 @@ bytes respectively. Such letter suffixes
 			See Documentation/x86/intel_mpx.txt for more
 			information about the feature.
 
+	nopku		[X86] Disable Memory Protection Keys CPU feature found
+			in some Intel CPUs.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 27/33] mm, multi-arch: pass a protection key in to calc_vm_flag_bits()
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (25 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 26/33] x86, pkeys: actually enable Memory Protection Keys in CPU Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:25   ` [tip:mm/pkeys] mm/core, arch, powerpc: Pass " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 28/33] x86, pkeys: add arch_validate_pkey() Dave Hansen
                   ` (6 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen, linux-api, linux-arch


From: Dave Hansen <dave.hansen@linux.intel.com>

This plumbs a protection key through calc_vm_flag_bits().  We
could have done this in calc_vm_prot_bits(), but I did not feel
super strongly which way to go.  It was pretty arbitrary which
one to use.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
---

 b/arch/powerpc/include/asm/mman.h  |    5 +++--
 b/drivers/char/agp/frontend.c      |    2 +-
 b/drivers/staging/android/ashmem.c |    4 ++--
 b/include/linux/mman.h             |    6 +++---
 b/mm/mmap.c                        |    2 +-
 b/mm/mprotect.c                    |    2 +-
 b/mm/nommu.c                       |    2 +-
 7 files changed, 12 insertions(+), 11 deletions(-)

diff -puN arch/powerpc/include/asm/mman.h~pkeys-70-calc_vm_prot_bits arch/powerpc/include/asm/mman.h
--- a/arch/powerpc/include/asm/mman.h~pkeys-70-calc_vm_prot_bits	2016-02-12 10:44:25.996692997 -0800
+++ b/arch/powerpc/include/asm/mman.h	2016-02-12 10:44:26.009693591 -0800
@@ -18,11 +18,12 @@
  * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
  * here.  How important is the optimization?
  */
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot)
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+		unsigned long pkey)
 {
 	return (prot & PROT_SAO) ? VM_SAO : 0;
 }
-#define arch_calc_vm_prot_bits(prot) arch_calc_vm_prot_bits(prot)
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
diff -puN drivers/char/agp/frontend.c~pkeys-70-calc_vm_prot_bits drivers/char/agp/frontend.c
--- a/drivers/char/agp/frontend.c~pkeys-70-calc_vm_prot_bits	2016-02-12 10:44:25.998693088 -0800
+++ b/drivers/char/agp/frontend.c	2016-02-12 10:44:26.009693591 -0800
@@ -156,7 +156,7 @@ static pgprot_t agp_convert_mmap_flags(i
 {
 	unsigned long prot_bits;
 
-	prot_bits = calc_vm_prot_bits(prot) | VM_SHARED;
+	prot_bits = calc_vm_prot_bits(prot, 0) | VM_SHARED;
 	return vm_get_page_prot(prot_bits);
 }
 
diff -puN drivers/staging/android/ashmem.c~pkeys-70-calc_vm_prot_bits drivers/staging/android/ashmem.c
--- a/drivers/staging/android/ashmem.c~pkeys-70-calc_vm_prot_bits	2016-02-12 10:44:25.999693134 -0800
+++ b/drivers/staging/android/ashmem.c	2016-02-12 10:44:26.010693636 -0800
@@ -372,8 +372,8 @@ static int ashmem_mmap(struct file *file
 	}
 
 	/* requested protection bits must match our allowed protection mask */
-	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask)) &
-		     calc_vm_prot_bits(PROT_MASK))) {
+	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask, 0)) &
+		     calc_vm_prot_bits(PROT_MASK, 0))) {
 		ret = -EPERM;
 		goto out;
 	}
diff -puN include/linux/mman.h~pkeys-70-calc_vm_prot_bits include/linux/mman.h
--- a/include/linux/mman.h~pkeys-70-calc_vm_prot_bits	2016-02-12 10:44:26.001693225 -0800
+++ b/include/linux/mman.h	2016-02-12 10:44:26.010693636 -0800
@@ -35,7 +35,7 @@ static inline void vm_unacct_memory(long
  */
 
 #ifndef arch_calc_vm_prot_bits
-#define arch_calc_vm_prot_bits(prot) 0
+#define arch_calc_vm_prot_bits(prot, pkey) 0
 #endif
 
 #ifndef arch_vm_get_page_prot
@@ -70,12 +70,12 @@ static inline int arch_validate_prot(uns
  * Combine the mmap "prot" argument into "vm_flags" used internally.
  */
 static inline unsigned long
-calc_vm_prot_bits(unsigned long prot)
+calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
-	       arch_calc_vm_prot_bits(prot);
+	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
 /*
diff -puN mm/mmap.c~pkeys-70-calc_vm_prot_bits mm/mmap.c
--- a/mm/mmap.c~pkeys-70-calc_vm_prot_bits	2016-02-12 10:44:26.002693271 -0800
+++ b/mm/mmap.c	2016-02-12 10:44:26.011693682 -0800
@@ -1313,7 +1313,7 @@ unsigned long do_mmap(struct file *file,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
diff -puN mm/mprotect.c~pkeys-70-calc_vm_prot_bits mm/mprotect.c
--- a/mm/mprotect.c~pkeys-70-calc_vm_prot_bits	2016-02-12 10:44:26.004693362 -0800
+++ b/mm/mprotect.c	2016-02-12 10:44:26.012693728 -0800
@@ -378,7 +378,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot);
+	vm_flags = calc_vm_prot_bits(prot, 0);
 
 	down_write(&current->mm->mmap_sem);
 
diff -puN mm/nommu.c~pkeys-70-calc_vm_prot_bits mm/nommu.c
--- a/mm/nommu.c~pkeys-70-calc_vm_prot_bits	2016-02-12 10:44:26.006693454 -0800
+++ b/mm/nommu.c	2016-02-12 10:44:26.012693728 -0800
@@ -1082,7 +1082,7 @@ static unsigned long determine_vm_flags(
 {
 	unsigned long vm_flags;
 
-	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags);
+	vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
 	/* vm_flags |= mm->def_flags; */
 
 	if (!(capabilities & NOMMU_MAP_DIRECT)) {
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 28/33] x86, pkeys: add arch_validate_pkey()
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (26 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 27/33] mm, multi-arch: pass a protection key in to calc_vm_flag_bits() Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:25   ` [tip:mm/pkeys] mm/core, x86/mm/pkeys: Add arch_validate_pkey() tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 29/33] x86: separate out LDT init from context init Dave Hansen
                   ` (5 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The syscall-level code is passed a protection key and need to
return an appropriate error code if the protection key is bogus.
We will be using this in subsequent patches.

Note that this also begins a series of arch-specific calls that
we need to expose in otherwise arch-independent code.  We create
a linux/pkeys.h header where we will put *all* the stubs for
these functions.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/Kconfig             |    1 +
 b/arch/x86/include/asm/pkeys.h |    6 ++++++
 b/include/linux/pkeys.h        |   25 +++++++++++++++++++++++++
 b/mm/Kconfig                   |    2 ++
 4 files changed, 34 insertions(+)

diff -puN /dev/null arch/x86/include/asm/pkeys.h
--- /dev/null	2015-12-10 15:28:13.322405854 -0800
+++ b/arch/x86/include/asm/pkeys.h	2016-02-12 10:44:26.560718780 -0800
@@ -0,0 +1,6 @@
+#ifndef _ASM_X86_PKEYS_H
+#define _ASM_X86_PKEYS_H
+
+#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)
+
+#endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/Kconfig~pkeys-71-arch_validate_pkey arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-71-arch_validate_pkey	2016-02-12 10:44:26.555718551 -0800
+++ b/arch/x86/Kconfig	2016-02-12 10:44:26.561718825 -0800
@@ -156,6 +156,7 @@ config X86
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN /dev/null include/linux/pkeys.h
--- /dev/null	2015-12-10 15:28:13.322405854 -0800
+++ b/include/linux/pkeys.h	2016-02-12 10:44:26.561718825 -0800
@@ -0,0 +1,25 @@
+#ifndef _LINUX_PKEYS_H
+#define _LINUX_PKEYS_H
+
+#include <linux/mm_types.h>
+#include <asm/mmu_context.h>
+
+#ifdef CONFIG_ARCH_HAS_PKEYS
+#include <asm/pkeys.h>
+#else /* ! CONFIG_ARCH_HAS_PKEYS */
+#define arch_max_pkey() (1)
+#endif /* ! CONFIG_ARCH_HAS_PKEYS */
+
+/*
+ * This is called from mprotect_pkey().
+ *
+ * Returns true if the protection keys is valid.
+ */
+static inline bool validate_pkey(int pkey)
+{
+	if (pkey < 0)
+		return false;
+	return (pkey < arch_max_pkey());
+}
+
+#endif /* _LINUX_PKEYS_H */
diff -puN mm/Kconfig~pkeys-71-arch_validate_pkey mm/Kconfig
--- a/mm/Kconfig~pkeys-71-arch_validate_pkey	2016-02-12 10:44:26.557718643 -0800
+++ b/mm/Kconfig	2016-02-12 10:44:26.562718871 -0800
@@ -672,3 +672,5 @@ config FRAME_VECTOR
 
 config ARCH_USES_HIGH_VMA_FLAGS
 	bool
+config ARCH_HAS_PKEYS
+	bool
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 29/33] x86: separate out LDT init from context init
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (27 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 28/33] x86, pkeys: add arch_validate_pkey() Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:26   ` [tip:mm/pkeys] x86/mm: Factor " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 30/33] x86, fpu: allow setting of XSAVE state Dave Hansen
                   ` (4 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The arch-specific mm_context_t is a great place to put
protection-key allocation state.

But, we need to initialize the allocation state because pkey 0 is
always "allocated".  All of the runtime initialization of
mm_context_t is done in *_ldt() manipulation functions.  This
renames the existing LDT functions like this:

	init_new_context() -> init_new_context_ldt()
	destroy_context() -> destroy_context_ldt()

and makes init_new_context() and destroy_context() available for
generic use.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/mmu_context.h |   21 ++++++++++++++++-----
 b/arch/x86/kernel/ldt.c              |    4 ++--
 2 files changed, 18 insertions(+), 7 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~pkeys-72-init-ldt-extricate arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~pkeys-72-init-ldt-extricate	2016-02-12 10:44:27.036740540 -0800
+++ b/arch/x86/include/asm/mmu_context.h	2016-02-12 10:44:27.041740768 -0800
@@ -52,15 +52,15 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
-void destroy_context(struct mm_struct *mm);
+int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
+void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline int init_new_context(struct task_struct *tsk,
-				   struct mm_struct *mm)
+static inline int init_new_context_ldt(struct task_struct *tsk,
+				       struct mm_struct *mm)
 {
 	return 0;
 }
-static inline void destroy_context(struct mm_struct *mm) {}
+static inline void destroy_context_ldt(struct mm_struct *mm) {}
 #endif
 
 static inline void load_mm_ldt(struct mm_struct *mm)
@@ -104,6 +104,17 @@ static inline void enter_lazy_tlb(struct
 #endif
 }
 
+static inline int init_new_context(struct task_struct *tsk,
+				   struct mm_struct *mm)
+{
+	init_new_context_ldt(tsk, mm);
+	return 0;
+}
+static inline void destroy_context(struct mm_struct *mm)
+{
+	destroy_context_ldt(mm);
+}
+
 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			     struct task_struct *tsk)
 {
diff -puN arch/x86/kernel/ldt.c~pkeys-72-init-ldt-extricate arch/x86/kernel/ldt.c
--- a/arch/x86/kernel/ldt.c~pkeys-72-init-ldt-extricate	2016-02-12 10:44:27.037740585 -0800
+++ b/arch/x86/kernel/ldt.c	2016-02-12 10:44:27.041740768 -0800
@@ -103,7 +103,7 @@ static void free_ldt_struct(struct ldt_s
  * we do not have to muck with descriptors here, that is
  * done in switch_mm() as needed.
  */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
+int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct ldt_struct *new_ldt;
 	struct mm_struct *old_mm;
@@ -144,7 +144,7 @@ out_unlock:
  *
  * 64bit: Don't touch the LDT register - we're already in the next thread.
  */
-void destroy_context(struct mm_struct *mm)
+void destroy_context_ldt(struct mm_struct *mm)
 {
 	free_ldt_struct(mm->context.ldt);
 	mm->context.ldt = NULL;
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 30/33] x86, fpu: allow setting of XSAVE state
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (28 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 29/33] x86: separate out LDT init from context init Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:26   ` [tip:mm/pkeys] x86/fpu: Allow " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 31/33] x86, pkeys: allow kernel to modify user pkey rights register Dave Hansen
                   ` (3 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

We want to modify the Protection Key rights inside the kernel, so
we need to change PKRU's contents.  But, if we do a plain
'wrpkru', when we return to userspace we might do an XRSTOR and
wipe out the kernel's 'wrpkru'.  So, we need to go after PKRU in
the xsave buffer.

We do this by:
1. Ensuring that we have the XSAVE registers (fpregs) in the
   kernel FPU buffer (fpstate)
2. Looking up the location of a given state in the buffer
3. Filling in the stat
4. Ensuring that the hardware knows that state is present there
   (basically that the 'init optimization' is not in place).
5. Copying the newly-modified state back to the registers if
   necessary.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/fpu/internal.h |    2 
 b/arch/x86/kernel/fpu/core.c          |   63 +++++++++++++++++++++
 b/arch/x86/kernel/fpu/xstate.c        |   98 +++++++++++++++++++++++++++++++++-
 3 files changed, 161 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/fpu/internal.h~pkeys-76-xsave-set arch/x86/include/asm/fpu/internal.h
--- a/arch/x86/include/asm/fpu/internal.h~pkeys-76-xsave-set	2016-02-12 10:44:27.469760334 -0800
+++ b/arch/x86/include/asm/fpu/internal.h	2016-02-12 10:44:27.475760609 -0800
@@ -24,6 +24,8 @@
 extern void fpu__activate_curr(struct fpu *fpu);
 extern void fpu__activate_fpstate_read(struct fpu *fpu);
 extern void fpu__activate_fpstate_write(struct fpu *fpu);
+extern void fpu__current_fpstate_write_begin(void);
+extern void fpu__current_fpstate_write_end(void);
 extern void fpu__save(struct fpu *fpu);
 extern void fpu__restore(struct fpu *fpu);
 extern int  fpu__restore_sig(void __user *buf, int ia32_frame);
diff -puN arch/x86/kernel/fpu/core.c~pkeys-76-xsave-set arch/x86/kernel/fpu/core.c
--- a/arch/x86/kernel/fpu/core.c~pkeys-76-xsave-set	2016-02-12 10:44:27.470760380 -0800
+++ b/arch/x86/kernel/fpu/core.c	2016-02-12 10:44:27.476760654 -0800
@@ -352,6 +352,69 @@ void fpu__activate_fpstate_write(struct
 }
 
 /*
+ * This function must be called before we write the current
+ * task's fpstate.
+ *
+ * This call gets the current FPU register state and moves
+ * it in to the 'fpstate'.  Preemption is disabled so that
+ * no writes to the 'fpstate' can occur from context
+ * swiches.
+ *
+ * Must be followed by a fpu__current_fpstate_write_end().
+ */
+void fpu__current_fpstate_write_begin(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	/*
+	 * Ensure that the context-switching code does not write
+	 * over the fpstate while we are doing our update.
+	 */
+	preempt_disable();
+
+	/*
+	 * Move the fpregs in to the fpu's 'fpstate'.
+	 */
+	fpu__activate_fpstate_read(fpu);
+
+	/*
+	 * The caller is about to write to 'fpu'.  Ensure that no
+	 * CPU thinks that its fpregs match the fpstate.  This
+	 * ensures we will not be lazy and skip a XRSTOR in the
+	 * future.
+	 */
+	fpu->last_cpu = -1;
+}
+
+/*
+ * This function must be paired with fpu__current_fpstate_write_begin()
+ *
+ * This will ensure that the modified fpstate gets placed back in
+ * the fpregs if necessary.
+ *
+ * Note: This function may be called whether or not an _actual_
+ * write to the fpstate occurred.
+ */
+void fpu__current_fpstate_write_end(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	/*
+	 * 'fpu' now has an updated copy of the state, but the
+	 * registers may still be out of date.  Update them with
+	 * an XRSTOR if they are active.
+	 */
+	if (fpregs_active())
+		copy_kernel_to_fpregs(&fpu->state);
+
+	/*
+	 * Our update is done and the fpregs/fpstate are in sync
+	 * if necessary.  Context switches can happen again.
+	 */
+	preempt_enable();
+}
+
+/*
  * 'fpu__restore()' is called to copy FPU registers from
  * the FPU fpstate to the live hw registers and to activate
  * access to the hardware registers, so that FPU instructions
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-76-xsave-set arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-76-xsave-set	2016-02-12 10:44:27.472760471 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2016-02-12 10:44:27.476760654 -0800
@@ -679,6 +679,19 @@ void fpu__resume_cpu(void)
 }
 
 /*
+ * Given an xstate feature mask, calculate where in the xsave
+ * buffer the state is.  Callers should ensure that the buffer
+ * is valid.
+ *
+ * Note: does not work for compacted buffers.
+ */
+void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+{
+	int feature_nr = fls64(xstate_feature_mask) - 1;
+
+	return (void *)xsave + xstate_comp_offsets[feature_nr];
+}
+/*
  * Given the xsave area and a state inside, this function returns the
  * address of the state.
  *
@@ -698,7 +711,6 @@ void fpu__resume_cpu(void)
  */
 void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
 {
-	int feature_nr = fls64(xstate_feature) - 1;
 	/*
 	 * Do we even *have* xsave state?
 	 */
@@ -726,7 +738,7 @@ void *get_xsave_addr(struct xregs_state
 	if (!(xsave->header.xfeatures & xstate_feature))
 		return NULL;
 
-	return (void *)xsave + xstate_comp_offsets[feature_nr];
+	return __raw_xsave_addr(xsave, xstate_feature);
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
 
@@ -761,3 +773,85 @@ const void *get_xsave_field_ptr(int xsav
 
 	return get_xsave_addr(&fpu->state.xsave, xsave_state);
 }
+
+
+/*
+ * Set xfeatures (aka XSTATE_BV) bit for a feature that we want
+ * to take out of its "init state".  This will ensure that an
+ * XRSTOR actually restores the state.
+ */
+static void fpu__xfeature_set_non_init(struct xregs_state *xsave,
+		int xstate_feature_mask)
+{
+	xsave->header.xfeatures |= xstate_feature_mask;
+}
+
+/*
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ *	@xsave_state: state which is defined in xsave.h (e.g. XFEATURE_MASK_FP,
+ *	XFEATURE_MASK_SSE, etc...)
+ *	@xsave_state_ptr: a pointer to a copy of the state that you would
+ *	like written in to the current task's FPU xsave state.  This pointer
+ *	must not be located in the current tasks's xsave area.
+ * Output:
+ *	address of the state in the xsave area or NULL if the state
+ *	is not present or is in its 'init state'.
+ */
+static void fpu__xfeature_set_state(int xstate_feature_mask,
+		void *xstate_feature_src, size_t len)
+{
+	struct xregs_state *xsave = &current->thread.fpu.state.xsave;
+	struct fpu *fpu = &current->thread.fpu;
+	void *dst;
+
+	if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
+		WARN_ONCE(1, "%s() attempted with no xsave support", __func__);
+		return;
+	}
+
+	/*
+	 * Tell the FPU code that we need the FPU state to be in
+	 * 'fpu' (not in the registers), and that we need it to
+	 * be stable while we write to it.
+	 */
+	fpu__current_fpstate_write_begin();
+
+	/*
+	 * This method *WILL* *NOT* work for compact-format
+	 * buffers.  If the 'xstate_feature_mask' is unset in
+	 * xcomp_bv then we may need to move other feature state
+	 * "up" in the buffer.
+	 */
+	if (xsave->header.xcomp_bv & xstate_feature_mask) {
+		WARN_ON_ONCE(1);
+		goto out;
+	}
+
+	/* find the location in the xsave buffer of the desired state */
+	dst = __raw_xsave_addr(&fpu->state.xsave, xstate_feature_mask);
+
+	/*
+	 * Make sure that the pointer being passed in did not
+	 * come from the xsave buffer itself.
+	 */
+	WARN_ONCE(xstate_feature_src == dst, "set from xsave buffer itself");
+
+	/* put the caller-provided data in the location */
+	memcpy(dst, xstate_feature_src, len);
+
+	/*
+	 * Mark the xfeature so that the CPU knows there is state
+	 * in the buffer now.
+	 */
+	fpu__xfeature_set_non_init(xsave, xstate_feature_mask);
+out:
+	/*
+	 * We are done writing to the 'fpu'.  Reenable preeption
+	 * and (possibly) move the fpstate back in to the fpregs.
+	 */
+	fpu__current_fpstate_write_end();
+}
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 31/33] x86, pkeys: allow kernel to modify user pkey rights register
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (29 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 30/33] x86, fpu: allow setting of XSAVE state Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:27   ` [tip:mm/pkeys] x86/mm/pkeys: Allow " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 32/33] x86, pkeys: create an x86 arch_calc_vm_prot_bits() for VMA flags Dave Hansen
                   ` (2 subsequent siblings)
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

The Protection Key Rights for User memory (PKRU) is a 32-bit
user-accessible register.  It contains two bits for each
protection key: one to write-disable (WD) access to memory
covered by the key and another to access-disable (AD).

Userspace can read/write the register with the RDPKRU and WRPKRU
instructions.  But, the register is saved and restored with the
XSAVE family of instructions, which means we have to treat it
like a floating point register.

The kernel needs to write to the register if it wants to
implement execute-only memory or if it implements a system call
to change PKRU.

To do this, we need to create a 'pkru_state' buffer, read the old
contents in to it, modify it, and then tell the FPU code that
there is modified data in there so it can (possibly) move the
buffer back in to the registers.

This uses the fpu__xfeature_set_state() function that we defined
in the previous patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/asm/pgtable.h |    5 +-
 b/arch/x86/include/asm/pkeys.h   |    3 +
 b/arch/x86/kernel/fpu/xstate.c   |   74 +++++++++++++++++++++++++++++++++++++++
 b/include/linux/pkeys.h          |    5 ++
 4 files changed, 85 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/pgtable.h~pkeys-77-arch_set_user_pkey_access arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~pkeys-77-arch_set_user_pkey_access	2016-02-12 10:44:27.929781363 -0800
+++ b/arch/x86/include/asm/pgtable.h	2016-02-12 10:44:27.938781774 -0800
@@ -921,16 +921,17 @@ static inline pte_t pte_swp_clear_soft_d
 
 #define PKRU_AD_BIT 0x1
 #define PKRU_WD_BIT 0x2
+#define PKRU_BITS_PER_PKEY 2
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * 2;
+	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
 	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * 2;
+	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
 	/*
 	 * Access-disable disables writes too so we need to check
 	 * both bits here.
diff -puN arch/x86/include/asm/pkeys.h~pkeys-77-arch_set_user_pkey_access arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-77-arch_set_user_pkey_access	2016-02-12 10:44:27.931781454 -0800
+++ b/arch/x86/include/asm/pkeys.h	2016-02-12 10:44:27.938781774 -0800
@@ -3,4 +3,7 @@
 
 #define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)
 
+extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+
 #endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-77-arch_set_user_pkey_access arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-77-arch_set_user_pkey_access	2016-02-12 10:44:27.933781546 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2016-02-12 10:44:27.938781774 -0800
@@ -5,6 +5,7 @@
  */
 #include <linux/compat.h>
 #include <linux/cpu.h>
+#include <linux/pkeys.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -855,3 +856,76 @@ out:
 	 */
 	fpu__current_fpstate_write_end();
 }
+
+#define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
+#define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+
+/*
+ * This will go out and modify the XSAVE buffer so that PKRU is
+ * set to a particular state for access to 'pkey'.
+ *
+ * PKRU state does affect kernel access to user memory.  We do
+ * not modfiy PKRU *itself* here, only the XSAVE state that will
+ * be restored in to PKRU when we return back to userspace.
+ */
+int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val)
+{
+	struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+	struct pkru_state *old_pkru_state;
+	struct pkru_state new_pkru_state;
+	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+	u32 new_pkru_bits = 0;
+
+	if (!validate_pkey(pkey))
+		return -EINVAL;
+	/*
+	 * This check implies XSAVE support.  OSPKE only gets
+	 * set if we enable XSAVE and we enable PKU in XCR0.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return -EINVAL;
+
+	/* Set the bits we need in PKRU  */
+	if (init_val & PKEY_DISABLE_ACCESS)
+		new_pkru_bits |= PKRU_AD_BIT;
+	if (init_val & PKEY_DISABLE_WRITE)
+		new_pkru_bits |= PKRU_WD_BIT;
+
+	/* Shift the bits in to the correct place in PKRU for pkey. */
+	new_pkru_bits <<= pkey_shift;
+
+	/* Locate old copy of the state in the xsave buffer */
+	old_pkru_state = get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+
+	/*
+	 * When state is not in the buffer, it is in the init
+	 * state, set it manually.  Otherwise, copy out the old
+	 * state.
+	 */
+	if (!old_pkru_state)
+		new_pkru_state.pkru = 0;
+	else
+		new_pkru_state.pkru = old_pkru_state->pkru;
+
+	/* mask off any old bits in place */
+	new_pkru_state.pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+	/* Set the newly-requested bits */
+	new_pkru_state.pkru |= new_pkru_bits;
+
+	/*
+	 * We could theoretically live without zeroing pkru.pad.
+	 * The current XSAVE feature state definition says that
+	 * only bytes 0->3 are used.  But we do not want to
+	 * chance leaking kernel stack out to userspace in case a
+	 * memcpy() of the whole xsave buffer was done.
+	 *
+	 * They're in the same cacheline anyway.
+	 */
+	new_pkru_state.pad = 0;
+
+	fpu__xfeature_set_state(XFEATURE_MASK_PKRU, &new_pkru_state,
+			sizeof(new_pkru_state));
+
+	return 0;
+}
diff -puN include/linux/pkeys.h~pkeys-77-arch_set_user_pkey_access include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-77-arch_set_user_pkey_access	2016-02-12 10:44:27.934781591 -0800
+++ b/include/linux/pkeys.h	2016-02-12 10:44:27.939781820 -0800
@@ -4,6 +4,11 @@
 #include <linux/mm_types.h>
 #include <asm/mmu_context.h>
 
+#define PKEY_DISABLE_ACCESS	0x1
+#define PKEY_DISABLE_WRITE	0x2
+#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
+				 PKEY_DISABLE_WRITE)
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 #include <asm/pkeys.h>
 #else /* ! CONFIG_ARCH_HAS_PKEYS */
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 32/33] x86, pkeys: create an x86 arch_calc_vm_prot_bits() for VMA flags
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (30 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 31/33] x86, pkeys: allow kernel to modify user pkey rights register Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-18 20:27   ` [tip:mm/pkeys] x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits () " tip-bot for Dave Hansen
  2016-02-12 21:02 ` [PATCH 33/33] x86, pkeys: execute-only support Dave Hansen
  2016-02-16  9:29 ` [PATCH 00/33] x86: Memory Protection Keys (v10) Ingo Molnar
  33 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

calc_vm_prot_bits() takes PROT_{READ,WRITE,EXECUTE} bits and
turns them in to the vma->vm_flags/VM_* bits.  We need to do a
similar thing for protection keys.

We take a protection key (4 bits) and encode it in to the 4
VM_PKEY_* bits.

Note: this code is not new.  It was simply a part of the
mprotect_pkey() patch in the past.  I broke it out for use
in the execute-only support.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---

 b/arch/x86/include/uapi/asm/mman.h |    6 ++++++
 1 file changed, 6 insertions(+)

diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-78-arch_calc_vm_prot_bits arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-78-arch_calc_vm_prot_bits	2016-02-12 10:44:28.418803718 -0800
+++ b/arch/x86/include/uapi/asm/mman.h	2016-02-12 10:44:28.421803855 -0800
@@ -20,6 +20,12 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot, key) (		\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
 #include <asm-generic/mman.h>
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 33/33] x86, pkeys: execute-only support
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (31 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 32/33] x86, pkeys: create an x86 arch_calc_vm_prot_bits() for VMA flags Dave Hansen
@ 2016-02-12 21:02 ` Dave Hansen
  2016-02-17 21:27   ` Kees Cook
  2016-02-18 20:27   ` [tip:mm/pkeys] mm/core, x86/mm/pkeys: Add execute-only protection keys support tip-bot for Dave Hansen
  2016-02-16  9:29 ` [PATCH 00/33] x86: Memory Protection Keys (v10) Ingo Molnar
  33 siblings, 2 replies; 84+ messages in thread
From: Dave Hansen @ 2016-02-12 21:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, x86, torvalds, Dave Hansen, dave.hansen, akpm, keescook, luto


From: Dave Hansen <dave.hansen@linux.intel.com>

Protection keys provide new page-based protection in hardware.
But, they have an interesting attribute: they only affect data
accesses and never affect instruction fetches.  That means that
if we set up some memory which is set as "access-disabled" via
protection keys, we can still execute from it.

This patch uses protection keys to set up mappings to do just that.
If a user calls:

	mmap(..., PROT_EXEC);
or
	mprotect(ptr, sz, PROT_EXEC);

(note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
notice this, and set a special protection key on the memory.  It
also sets the appropriate bits in the Protection Keys User Rights
(PKRU) register so that the memory becomes unreadable and
unwritable.

I haven't found any userspace that does this today.  With this
facility in place, we expect userspace to move to use it
eventually.  Userspace _could_ start doing this today.  Any
PROT_EXEC calls get converted to PROT_READ inside the kernel, and
would transparently be upgraded to "true" PROT_EXEC with this
code.  IOW, userspace never has to do any PROT_EXEC runtime
detection.

This feature provides enhanced protection against leaking
executable memory contents.  This helps thwart attacks which are
attempting to find ROP gadgets on the fly.

But, the security provided by this approach is not comprehensive.
The PKRU register which controls access permissions is a normal
user register writable from unprivileged userspace.  An attacker
who can execute the 'wrpkru' instruction can easily disable the
protection provided by this feature.

The protection key that is used for execute-only support is
permanently dedicated at compile time.  This is fine for now
because there is currently no API to set a protection key other
than this one.

Despite there being a constant PKRU value across the entire
system, we do not set it unless this feature is in use in a
process.  That is to preserve the PKRU XSAVE 'init state',
which can lead to faster context switches.

PKRU *is* a user register and the kernel is modifying it.  That
means that code doing:

	pkru = rdpkru()
	pkru |= 0x100;
	mmap(..., PROT_EXEC);
	wrpkru(pkru);

could lose the bits in PKRU that enforce execute-only
permissions.  To avoid this, we suggest avoiding ever calling
mmap() or mprotect() when the PKRU value is expected to be
unstable.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: keescook@google.com
Cc: luto@amacapital.net
---

 b/arch/x86/include/asm/pkeys.h |   25 ++++++++++
 b/arch/x86/kernel/fpu/xstate.c |    2 
 b/arch/x86/mm/Makefile         |    2 
 b/arch/x86/mm/fault.c          |   10 ++++
 b/arch/x86/mm/pkeys.c          |  101 +++++++++++++++++++++++++++++++++++++++++
 b/include/linux/pkeys.h        |    3 +
 b/mm/mmap.c                    |   10 +++-
 b/mm/mprotect.c                |    8 +--
 8 files changed, 154 insertions(+), 7 deletions(-)

diff -puN arch/x86/include/asm/pkeys.h~pkeys-79-xonly arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-79-xonly	2016-02-12 10:45:12.465817219 -0800
+++ b/arch/x86/include/asm/pkeys.h	2016-02-12 10:45:12.478817813 -0800
@@ -6,4 +6,29 @@
 extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
 
+/*
+ * Try to dedicate one of the protection keys to be used as an
+ * execute-only protection key.
+ */
+#define PKEY_DEDICATED_EXECUTE_ONLY 15
+extern int __execute_only_pkey(struct mm_struct *mm);
+static inline int execute_only_pkey(struct mm_struct *mm)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return 0;
+
+	return __execute_only_pkey(mm);
+}
+
+extern int __arch_override_mprotect_pkey(struct vm_area_struct *vma,
+		int prot, int pkey);
+static inline int arch_override_mprotect_pkey(struct vm_area_struct *vma,
+		int prot, int pkey)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return 0;
+
+	return __arch_override_mprotect_pkey(vma, prot, pkey);
+}
+
 #endif /*_ASM_X86_PKEYS_H */
diff -puN arch/x86/kernel/fpu/xstate.c~pkeys-79-xonly arch/x86/kernel/fpu/xstate.c
--- a/arch/x86/kernel/fpu/xstate.c~pkeys-79-xonly	2016-02-12 10:45:12.466817265 -0800
+++ b/arch/x86/kernel/fpu/xstate.c	2016-02-12 10:45:12.478817813 -0800
@@ -877,8 +877,6 @@ int arch_set_user_pkey_access(struct tas
 	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
 	u32 new_pkru_bits = 0;
 
-	if (!validate_pkey(pkey))
-		return -EINVAL;
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
 	 * set if we enable XSAVE and we enable PKU in XCR0.
diff -puN arch/x86/mm/fault.c~pkeys-79-xonly arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-79-xonly	2016-02-12 10:45:12.468817356 -0800
+++ b/arch/x86/mm/fault.c	2016-02-12 10:45:12.479817859 -0800
@@ -1108,6 +1108,16 @@ access_error(unsigned long error_code, s
 	 */
 	if (error_code & PF_PK)
 		return 1;
+
+	if (!(error_code & PF_INSTR)) {
+		/*
+		 * Assume all accesses require either read or execute
+		 * permissions.  This is not an instruction access, so
+		 * it requires read permissions.
+		 */
+		if (!(vma->vm_flags & VM_READ))
+			return 1;
+	}
 	/*
 	 * Make sure to check the VMA so that we do not perform
 	 * faults just to hit a PF_PK as soon as we fill in a
diff -puN arch/x86/mm/Makefile~pkeys-79-xonly arch/x86/mm/Makefile
--- a/arch/x86/mm/Makefile~pkeys-79-xonly	2016-02-12 10:45:12.469817402 -0800
+++ b/arch/x86/mm/Makefile	2016-02-12 10:45:12.479817859 -0800
@@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
 obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 
 obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
+obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
+
diff -puN /dev/null arch/x86/mm/pkeys.c
--- /dev/null	2015-12-10 15:28:13.322405854 -0800
+++ b/arch/x86/mm/pkeys.c	2016-02-12 10:45:12.479817859 -0800
@@ -0,0 +1,101 @@
+/*
+ * Intel Memory Protection Keys management
+ * Copyright (c) 2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
+#include <linux/pkeys.h>                /* PKEY_*                       */
+#include <uapi/asm-generic/mman-common.h>
+
+#include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
+#include <asm/mmu_context.h>            /* vma_pkey()                   */
+#include <asm/fpu/internal.h>           /* fpregs_active()              */
+
+int __execute_only_pkey(struct mm_struct *mm)
+{
+	int ret;
+
+	/*
+	 * We do not want to go through the relatively costly
+	 * dance to set PKRU if we do not need to.  Check it
+	 * first and assume that if the execute-only pkey is
+	 * write-disabled that we do not have to set it
+	 * ourselves.  We need preempt off so that nobody
+	 * can make fpregs inactive.
+	 */
+	preempt_disable();
+	if (fpregs_active() &&
+	    !__pkru_allows_read(read_pkru(), PKEY_DEDICATED_EXECUTE_ONLY)) {
+		preempt_enable();
+		return PKEY_DEDICATED_EXECUTE_ONLY;
+	}
+	preempt_enable();
+	ret = arch_set_user_pkey_access(current, PKEY_DEDICATED_EXECUTE_ONLY,
+			PKEY_DISABLE_ACCESS);
+	/*
+	 * If the PKRU-set operation failed somehow, just return
+	 * 0 and effectively disable execute-only support.
+	 */
+	if (ret)
+		return 0;
+
+	return PKEY_DEDICATED_EXECUTE_ONLY;
+}
+
+static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma)
+{
+	/* Do this check first since the vm_flags should be hot */
+	if ((vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) != VM_EXEC)
+		return false;
+	if (vma_pkey(vma) != PKEY_DEDICATED_EXECUTE_ONLY)
+		return false;
+
+	return true;
+}
+
+/*
+ * This is only called for *plain* mprotect calls.
+ */
+int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey)
+{
+	/*
+	 * Is this an mprotect_pkey() call?  If so, never
+	 * override the value that came from the user.
+	 */
+	if (pkey != -1)
+		return pkey;
+	/*
+	 * Look for a protection-key-drive execute-only mapping
+	 * which is now being given permissions that are not
+	 * execute-only.  Move it back to the default pkey.
+	 */
+	if (vma_is_pkey_exec_only(vma) &&
+	    (prot & (PROT_READ|PROT_WRITE))) {
+		return 0;
+	}
+	/*
+	 * The mapping is execute-only.  Go try to get the
+	 * execute-only protection key.  If we fail to do that,
+	 * fall through as if we do not have execute-only
+	 * support.
+	 */
+	if (prot == PROT_EXEC) {
+		pkey = execute_only_pkey(vma->vm_mm);
+		if (pkey > 0)
+			return pkey;
+	}
+	/*
+	 * This is a vanilla, non-pkey mprotect (or we failed to
+	 * setup execute-only), inherit the pkey from the VMA we
+	 * are working on.
+	 */
+	return vma_pkey(vma);
+}
diff -puN include/linux/pkeys.h~pkeys-79-xonly include/linux/pkeys.h
--- a/include/linux/pkeys.h~pkeys-79-xonly	2016-02-12 10:45:12.471817493 -0800
+++ b/include/linux/pkeys.h	2016-02-12 10:45:12.480817905 -0800
@@ -13,6 +13,9 @@
 #include <asm/pkeys.h>
 #else /* ! CONFIG_ARCH_HAS_PKEYS */
 #define arch_max_pkey() (1)
+#define execute_only_pkey(mm) (0)
+#define arch_override_mprotect_pkey(vma, prot, pkey) (0)
+#define PKEY_DEDICATED_EXECUTE_ONLY 0
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 /*
diff -puN mm/mmap.c~pkeys-79-xonly mm/mmap.c
--- a/mm/mmap.c~pkeys-79-xonly	2016-02-12 10:45:12.473817585 -0800
+++ b/mm/mmap.c	2016-02-12 10:45:12.481817951 -0800
@@ -43,6 +43,7 @@
 #include <linux/printk.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/moduleparam.h>
+#include <linux/pkeys.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1270,6 +1271,7 @@ unsigned long do_mmap(struct file *file,
 			unsigned long pgoff, unsigned long *populate)
 {
 	struct mm_struct *mm = current->mm;
+	int pkey = 0;
 
 	*populate = 0;
 
@@ -1309,11 +1311,17 @@ unsigned long do_mmap(struct file *file,
 	if (offset_in_page(addr))
 		return addr;
 
+	if (prot == PROT_EXEC) {
+		pkey = execute_only_pkey(mm);
+		if (pkey < 0)
+			pkey = 0;
+	}
+
 	/* Do simple checking here so the lower-level routines won't have
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
diff -puN mm/mprotect.c~pkeys-79-xonly mm/mprotect.c
--- a/mm/mprotect.c~pkeys-79-xonly	2016-02-12 10:45:12.475817676 -0800
+++ b/mm/mprotect.c	2016-02-12 10:45:12.481817951 -0800
@@ -24,6 +24,7 @@
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
 #include <linux/ksm.h>
+#include <linux/pkeys.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -352,7 +353,7 @@ fail:
 SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 		unsigned long, prot)
 {
-	unsigned long vm_flags, nstart, end, tmp, reqprot;
+	unsigned long nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
 	int error = -EINVAL;
 	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
@@ -378,8 +379,6 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot, 0);
-
 	down_write(&current->mm->mmap_sem);
 
 	vma = find_vma(current->mm, start);
@@ -409,10 +408,11 @@ SYSCALL_DEFINE3(mprotect, unsigned long,
 
 	for (nstart = start ; ; ) {
 		unsigned long newflags;
+		int pkey = arch_override_mprotect_pkey(vma, prot, -1);
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
-		newflags = vm_flags;
+		newflags = calc_vm_prot_bits(prot, pkey);
 		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */
_

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 01/33] mm: introduce get_user_pages_remote()
  2016-02-12 21:01 ` [PATCH 01/33] mm: introduce get_user_pages_remote() Dave Hansen
@ 2016-02-15  6:09   ` Balbir Singh
  2016-02-15 16:29     ` Dave Hansen
  2016-02-15  6:14   ` Srikar Dronamraju
  2016-02-16 12:14   ` [tip:x86/pkeys] mm/gup: Introduce get_user_pages_remote() tip-bot for Dave Hansen
  2 siblings, 1 reply; 84+ messages in thread
From: Balbir Singh @ 2016-02-15  6:09 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: linux-mm, x86, torvalds, dave.hansen, srikar, vbabka, akpm,
	kirill.shutemov, aarcange, n-horiguchi, jack

On Fri, 2016-02-12 at 13:01 -0800, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> For protection keys, we need to understand whether protections
> should be enforced in software or not.  In general, we enforce
> protections when working on our own task, but not when on others.
> We call these "current" and "remote" operations.
> 
> This patch introduces a new get_user_pages() variant:
> 
>         get_user_pages_remote()
> 
> Which is a replacement for when get_user_pages() is called on
> non-current tsk/mm.
> 

In summary then get_user_pages_remote() do not enforce protections?

Balbir Singh.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 01/33] mm: introduce get_user_pages_remote()
  2016-02-12 21:01 ` [PATCH 01/33] mm: introduce get_user_pages_remote() Dave Hansen
  2016-02-15  6:09   ` Balbir Singh
@ 2016-02-15  6:14   ` Srikar Dronamraju
  2016-02-16 12:14   ` [tip:x86/pkeys] mm/gup: Introduce get_user_pages_remote() tip-bot for Dave Hansen
  2 siblings, 0 replies; 84+ messages in thread
From: Srikar Dronamraju @ 2016-02-15  6:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, x86, torvalds, dave.hansen, vbabka, akpm,
	kirill.shutemov, aarcange, n-horiguchi, jack

> diff -puN kernel/events/uprobes.c~introduce-get_user_pages_remote kernel/events/uprobes.c
> --- a/kernel/events/uprobes.c~introduce-get_user_pages_remote	2016-02-12 10:44:13.178107026 -0800
> +++ b/kernel/events/uprobes.c	2016-02-12 10:44:13.193107711 -0800
> @@ -299,7 +299,7 @@ int uprobe_write_opcode(struct mm_struct
> 
>  retry:
>  	/* Read the page with vaddr into memory */
> -	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
> +	ret = get_user_pages_remote(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
>  	if (ret <= 0)
>  		return ret;
> 
> @@ -1700,7 +1700,13 @@ static int is_trap_at_addr(struct mm_str
>  	if (likely(result == 0))
>  		goto out;
> 
> -	result = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
> +	/*
> +	 * The NULL 'tsk' here ensures that any faults that occur here
> +	 * will not be accounted to the task.  'mm' *is* current->mm,
> +	 * but we treat this as a 'remote' access since it is
> +	 * essentially a kernel access to the memory.
> +	 */
> +	result = get_user_pages_remote(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
>  	if (result < 0)
>  		return result;
> 

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 01/33] mm: introduce get_user_pages_remote()
  2016-02-15  6:09   ` Balbir Singh
@ 2016-02-15 16:29     ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2016-02-15 16:29 UTC (permalink / raw)
  To: Balbir Singh, linux-kernel
  Cc: linux-mm, x86, torvalds, dave.hansen, srikar, vbabka, akpm,
	kirill.shutemov, aarcange, n-horiguchi, jack

On 02/14/2016 10:09 PM, Balbir Singh wrote:
>> > For protection keys, we need to understand whether protections
>> > should be enforced in software or not.  In general, we enforce
>> > protections when working on our own task, but not when on others.
>> > We call these "current" and "remote" operations.
>> > 
>> > This patch introduces a new get_user_pages() variant:
>> > 
>> >         get_user_pages_remote()
>> > 
>> > Which is a replacement for when get_user_pages() is called on
>> > non-current tsk/mm.
>> > 
> In summary then get_user_pages_remote() do not enforce protections?

Yes, exactly.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 02/33] mm: overload get_user_pages() functions
  2016-02-12 21:01 ` [PATCH 02/33] mm: overload get_user_pages() functions Dave Hansen
@ 2016-02-16  8:36   ` Ingo Molnar
  2016-02-17 18:15     ` Dave Hansen
  2016-02-18 20:15   ` [tip:mm/pkeys] mm/gup: Overload " tip-bot for Dave Hansen
  1 sibling, 1 reply; 84+ messages in thread
From: Ingo Molnar @ 2016-02-16  8:36 UTC (permalink / raw)
  To: Dave Hansen, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, x86, torvalds, dave.hansen, srikar,
	vbabka, akpm, kirill.shutemov, aarcange, n-horiguchi, jack


* Dave Hansen <dave@sr71.net> wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The concept here was a suggestion from Ingo.  The implementation
> horrors are all mine.
> 
> This allows get_user_pages(), get_user_pages_unlocked(), and
> get_user_pages_locked() to be called with or without the
> leading tsk/mm arguments.  We will give a compile-time warning
> about the old style being __deprecated and we will also
> WARN_ON() if the non-remote version is used for a remote-style
> access.

So at minimum this should be WARN_ON_ONCE(), to make it easier to recover some 
meaningful kernel log from such incidents.

But:

> Doing this, folks will get nice warnings and will not break the
> build.  This should be nice for -next and will hopefully let
> developers fix up their own code instead of maintainers needing
> to do it at merge time.
> 
> The way we do this is hideous.  It uses the __VA_ARGS__ macro
> functionality to call different functions based on the number
> of arguments passed to the macro.
> 
> There's an additional hack to ensure that our EXPORT_SYMBOL()
> of the deprecated symbols doesn't trigger a warning.
> 
> We should be able to remove this mess as soon as -rc1 hits in
> the release after this is merged.

So when I suggested this then it looked a _lot_ cleanear to me, in my head!

OTOH this, if factored out a bit perhaps, could be the basis for a useful 
technical model to do 'phased in, -next invariant' prototype migrations in the 
future, especially when it involves lots of subsystems.

Strictly only in cases where -rc1 will truly get rid of the __VA_ARGS__ hackery - 
which we'd do in this case.

Nevertheless I'd love to have a high level buy-in from either Linus or Andrew that 
we can do it this way, as the hackery looks very hideous...

The alternative would be to allow the -next churn and to allow the occasional 
(fairly trivial but tester-disruptive) build breakage.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/33] x86: Memory Protection Keys (v10)
  2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
                   ` (32 preceding siblings ...)
  2016-02-12 21:02 ` [PATCH 33/33] x86, pkeys: execute-only support Dave Hansen
@ 2016-02-16  9:29 ` Ingo Molnar
  33 siblings, 0 replies; 84+ messages in thread
From: Ingo Molnar @ 2016-02-16  9:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, x86, torvalds, linux-api, linux-arch,
	aarcange, akpm, jack, kirill.shutemov, n-horiguchi, vbabka


* Dave Hansen <dave@sr71.net> wrote:

>  81 files changed, 1393 insertions(+), 233 deletions(-)

The MIPS defconfig cross-build failed for me - log attached below.

Thanks,

	Ingo

===============>
  CC      security/keys/request_key_auth.o
In file included from /home/mingo/tip/arch/mips/include/asm/termios.h:12:0,
                 from /home/mingo/tip/include/uapi/linux/termios.h:5,
                 from /home/mingo/tip/include/linux/tty.h:6,
                 from /home/mingo/tip/kernel/signal.c:18:
/home/mingo/tip/kernel/signal.c: In function 'copy_siginfo_to_user':
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:441:15: note: in definition of macro '__put_user_nocheck'
  __typeof__(*(ptr)) __pu_val;     \
               ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'const struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:444:14: note: in definition of macro '__put_user_nocheck'
  __pu_val = (x);       \
              ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:28: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                            ^
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:430:10: note: in definition of macro '__put_user_common'
  switch (size) {       \
          ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:446:3: note: in expansion of macro '__put_kernel_common'
   __put_kernel_common(ptr, size);    \
   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:214:2: note: in expansion of macro '__put_user_nocheck'
  __put_user_nocheck((x), (ptr), sizeof(*(ptr)))
  ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:241:51: note: in definition of macro '__m'
 #define __m(x) (*(struct __large_struct __user *)(x))
                                                   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:431:10: note: in expansion of macro '__put_data_asm'
  case 1: __put_data_asm(user_sb, ptr); break;   \
          ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:384:40: note: in expansion of macro '__put_user_common'
 #define __put_kernel_common(ptr, size) __put_user_common(ptr, size)
                                        ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:446:3: note: in expansion of macro '__put_kernel_common'
   __put_kernel_common(ptr, size);    \
   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:214:2: note: in expansion of macro '__put_user_nocheck'
  __put_user_nocheck((x), (ptr), sizeof(*(ptr)))
  ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:241:51: note: in definition of macro '__m'
 #define __m(x) (*(struct __large_struct __user *)(x))
                                                   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:432:10: note: in expansion of macro '__put_data_asm'
  case 2: __put_data_asm(user_sh, ptr); break;   \
          ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:384:40: note: in expansion of macro '__put_user_common'
 #define __put_kernel_common(ptr, size) __put_user_common(ptr, size)
                                        ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:446:3: note: in expansion of macro '__put_kernel_common'
   __put_kernel_common(ptr, size);    \
   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:214:2: note: in expansion of macro '__put_user_nocheck'
  __put_user_nocheck((x), (ptr), sizeof(*(ptr)))
  ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:241:51: note: in definition of macro '__m'
 #define __m(x) (*(struct __large_struct __user *)(x))
                                                   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:433:10: note: in expansion of macro '__put_data_asm'
  case 4: __put_data_asm(user_sw, ptr); break;   \
          ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:384:40: note: in expansion of macro '__put_user_common'
 #define __put_kernel_common(ptr, size) __put_user_common(ptr, size)
                                        ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:446:3: note: in expansion of macro '__put_kernel_common'
   __put_kernel_common(ptr, size);    \
   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:214:2: note: in expansion of macro '__put_user_nocheck'
  __put_user_nocheck((x), (ptr), sizeof(*(ptr)))
  ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:505:34: note: in definition of macro '__put_data_asm_ll32'
  : "0" (0), "r" (__pu_val), "r" (ptr),    \
                                  ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:434:10: note: in expansion of macro '__PUT_DW'
  case 8: __PUT_DW(user_sd, ptr); break;    \
          ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:384:40: note: in expansion of macro '__put_user_common'
 #define __put_kernel_common(ptr, size) __put_user_common(ptr, size)
                                        ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:446:3: note: in expansion of macro '__put_kernel_common'
   __put_kernel_common(ptr, size);    \
   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:214:2: note: in expansion of macro '__put_user_nocheck'
  __put_user_nocheck((x), (ptr), sizeof(*(ptr)))
  ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:430:10: note: in definition of macro '__put_user_common'
  switch (size) {       \
          ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:214:2: note: in expansion of macro '__put_user_nocheck'
  __put_user_nocheck((x), (ptr), sizeof(*(ptr)))
  ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:241:51: note: in definition of macro '__m'
 #define __m(x) (*(struct __large_struct __user *)(x))
                                                   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:431:10: note: in expansion of macro '__put_data_asm'
  case 1: __put_data_asm(user_sb, ptr); break;   \
          ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:449:3: note: in expansion of macro '__put_user_common'
   __put_user_common(ptr, size);    \
   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:214:2: note: in expansion of macro '__put_user_nocheck'
  __put_user_nocheck((x), (ptr), sizeof(*(ptr)))
  ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
  CC      security/keys/user_defined.o
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:241:51: note: in definition of macro '__m'
 #define __m(x) (*(struct __large_struct __user *)(x))
                                                   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:432:10: note: in expansion of macro '__put_data_asm'
  case 2: __put_data_asm(user_sh, ptr); break;   \
          ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:449:3: note: in expansion of macro '__put_user_common'
   __put_user_common(ptr, size);    \
   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:214:2: note: in expansion of macro '__put_user_nocheck'
  __put_user_nocheck((x), (ptr), sizeof(*(ptr)))
  ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:241:51: note: in definition of macro '__m'
 #define __m(x) (*(struct __large_struct __user *)(x))
                                                   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:433:10: note: in expansion of macro '__put_data_asm'
  case 4: __put_data_asm(user_sw, ptr); break;   \
          ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:449:3: note: in expansion of macro '__put_user_common'
   __put_user_common(ptr, size);    \
   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:214:2: note: in expansion of macro '__put_user_nocheck'
  __put_user_nocheck((x), (ptr), sizeof(*(ptr)))
  ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
/home/mingo/tip/include/uapi/asm-generic/siginfo.h:145:37: error: 'struct <anonymous>' has no member named '_pkey'
 #define si_pkey  _sifields._sigfault._pkey
                                     ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:505:34: note: in definition of macro '__put_data_asm_ll32'
  : "0" (0), "r" (__pu_val), "r" (ptr),    \
                                  ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:434:10: note: in expansion of macro '__PUT_DW'
  case 8: __PUT_DW(user_sd, ptr); break;    \
          ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:449:3: note: in expansion of macro '__put_user_common'
   __put_user_common(ptr, size);    \
   ^
/home/mingo/tip/arch/mips/include/asm/uaccess.h:214:2: note: in expansion of macro '__put_user_nocheck'
  __put_user_nocheck((x), (ptr), sizeof(*(ptr)))
  ^
/home/mingo/tip/kernel/signal.c:2714:11: note: in expansion of macro '__put_user'
    err |= __put_user(from->si_pkey, &to->si_pkey);
           ^
/home/mingo/tip/kernel/signal.c:2714:42: note: in expansion of macro 'si_pkey'
    err |= __put_user(from->si_pkey, &to->si_pkey);
                                          ^
  CC      mm/mmzone.o
  CC      mm/vmstat.o
  UPD     kernel/config_data.h
  CC      mm/backing-dev.o
  CC      security/keys/proc.o
  CC      fs/open.o
  CC      fs/read_write.o
  CC      mm/mm_init.o
  CC      fs/file_table.o
  CC      security/keys/sysctl.o
  LD      arch/mips/sgi-ip22/built-in.o
  CC      mm/mmu_context.o
  CC      fs/super.o
  CC      fs/char_dev.o
  CC      fs/stat.o
/home/mingo/tip/scripts/Makefile.build:258: recipe for target 'kernel/signal.o' failed
make[2]: *** [kernel/signal.o] Error 1
make[2]: *** Waiting for unfinished jobs....
  CC      mm/percpu.o
  CC      fs/exec.o
  CC      kernel/locking/percpu-rwsem.o
  CC      kernel/locking/rwsem.o
  CC      fs/pipe.o
  CC      mm/slab_common.o
  LD      arch/mips/vdso/built-in.o
  CC      mm/compaction.o
  CC      kernel/locking/rwsem-spinlock.o
  CC      kernel/locking/rtmutex.o
  CC      kernel/irq/autoprobe.o
  CC      fs/namei.o
  CC      fs/fcntl.o
  CC      kernel/irq/irqdomain.o
  CC      kernel/rcu/sync.o
  CC      kernel/rcu/srcu.o
  CC      kernel/rcu/tiny.o
  CC      kernel/time/posix-timers.o
  CC      mm/vmacache.o
  CC      kernel/sched/idle_task.o
  CC      kernel/sched/fair.o
  CC      kernel/sched/rt.o
  CC      fs/ioctl.o
  CC      mm/interval_tree.o
  CC      kernel/sched/deadline.o
  CC      fs/readdir.o
  CC      mm/workingset.o
  CC      mm/list_lru.o
  CC      mm/debug.o
  CC      crypto/api.o
  CC      crypto/cipher.o
  CC      mm/gup.o
  CC      mm/highmem.o
  CC      kernel/irq/proc.o
  CC      kernel/sched/stop_task.o
  CC      kernel/sched/wait.o
  CC      mm/memory.o
  CC      kernel/sched/completion.o
  CC      kernel/sched/idle.o
  CC      mm/mincore.o
  CC      fs/select.o
  CC      kernel/sched/debug.o
  CC      kernel/time/posix-cpu-timers.o
  CC      fs/dcache.o
  CC      fs/inode.o
  CC      kernel/time/timekeeping.o
  CC      fs/attr.o
  CC      mm/mmap.o
  CC      mm/mprotect.o
  CC      kernel/time/ntp.o
  CC      mm/mlock.o
  CC      crypto/compress.o
  CC      fs/file.o
  CC      fs/bad_inode.o
  CC      mm/mremap.o
  CC      kernel/time/clocksource.o
  CC      fs/filesystems.o
  CC      mm/msync.o
  CC      mm/rmap.o
  LD      kernel/power/built-in.o
  CC      block/bio.o
  CC      kernel/time/jiffies.o
  CC      mm/vmalloc.o
  CC      kernel/time/timer_list.o
  CC      crypto/memneq.o
  CC      fs/namespace.o
  CC      mm/pagewalk.o
  CC      block/blk-core.o
  CC      crypto/crypto_wq.o
  CC      block/elevator.o
  CC      fs/seq_file.o
  LD      sound/built-in.o
  CC      kernel/time/timeconv.o
  CC      kernel/time/timecounter.o
  CC      mm/pgtable-generic.o
  CC      crypto/algapi.o
  CC      crypto/scatterwalk.o
  CC      block/blk-tag.o
  CC      kernel/time/posix-clock.o
  CC      fs/xattr.o
  LD      init/mounts.o
  CC      kernel/time/clockevents.o
  LD      ipc/built-in.o
  LD      init/built-in.o
  CC      kernel/time/alarmtimer.o
  CC      fs/libfs.o
  CC      kernel/time/tick-common.o
  CC      mm/process_vm_access.o
  CC      crypto/proc.o
  CC      mm/bootmem.o
  CC      mm/init-mm.o
  CC      block/blk-sysfs.o
  CC      fs/fs-writeback.o
  CC      kernel/time/sched_clock.o
  CC      block/blk-flush.o
  CC      fs/pnode.o
  LD      firmware/built-in.o
  CC      block/blk-settings.o
  CC      crypto/aead.o
  CC      block/blk-ioc.o
  CC      block/blk-map.o
  CC      fs/splice.o
  CC      kernel/time/tick-oneshot.o
  CC      mm/fadvise.o
  CC      mm/madvise.o
  CC      block/blk-exec.o
  CC      fs/sync.o
  LD      kernel/bpf/built-in.o
  CC      kernel/time/tick-sched.o
  CC      crypto/ablkcipher.o
  CC      crypto/blkcipher.o
  CC      fs/utimes.o
  CC      crypto/skcipher.o
  CC      fs/stack.o
  CC      fs/fs_struct.o
  LD      arch/mips/power/built-in.o
  CC      block/blk-merge.o
  LD      kernel/locking/built-in.o
  CC      mm/page_io.o
  CC      fs/statfs.o
  LD      security/keys/built-in.o
  CC      mm/memblock.o
  LD      security/built-in.o
  CC      fs/nsfs.o
  CC      fs/fs_pin.o
  CC      mm/swap_state.o
  CC      crypto/chainiv.o
  CC      fs/buffer.o
  LD      arch/mips/kernel/built-in.o
  CC      mm/swapfile.o
  CC      crypto/eseqiv.o
  CC      fs/block_dev.o
  CC      mm/migrate.o
  CC      crypto/ahash.o
  CC      crypto/pcompress.o
  CC      mm/slab.o
  CC      mm/dmapool.o
  CC      crypto/akcipher.o
  LD      arch/mips/mm/built-in.o
  CC      crypto/shash.o
  CC      fs/direct-io.o
  LD      kernel/irq/built-in.o
  CC      block/blk-softirq.o
  LD      arch/mips/built-in.o
  CC      crypto/algboss.o
  CC      block/blk-timeout.o
  CC      crypto/testmgr.o
  CC      block/blk-lib.o
  CC      crypto/hmac.o
  CC      block/blk-mq.o
  CC      crypto/crypto_null.o
  CC      crypto/md5.o
  CC      block/blk-mq-sysfs.o
  LD      kernel/rcu/built-in.o
  CC      block/blk-mq-tag.o
  CC      block/blk-mq-cpu.o
  CC      fs/mpage.o
  CC      block/blk-mq-cpumap.o
  LD      drivers/auxdisplay/built-in.o
  CC      block/ioctl.o
  LD      drivers/amba/built-in.o
  CC      block/scsi_ioctl.o
  CC      block/partition-generic.o
  CC      fs/proc_namespace.o
  LD      drivers/block/built-in.o
  CC      crypto/aes_generic.o
  CC      block/badblocks.o
  CC      block/genhd.o
  CC      block/ioprio.o
  CC      block/bsg-lib.o
  CC      block/bsg.o
  LD      drivers/bus/built-in.o
  CC      block/noop-iosched.o
  CC      crypto/crc32c_generic.o
  LD      drivers/block/aoe/built-in.o
  CC      block/deadline-iosched.o
  CC      block/partitions/check.o
  LD      arch/mips/fw/arc/built-in.o
  CC      block/partitions/msdos.o
  CC      net/socket.o
  CC      drivers/cdrom/cdrom.o
  CC      arch/mips/fw/arc/arc_con.o
  CC      block/cfq-iosched.o
  CC      crypto/rng.o
  LD      crypto/crypto.o
  LD      fs/exofs/built-in.o
  LD      fs/autofs4/built-in.o
  LD      net/802/built-in.o
  CC      drivers/char/mem.o
  LD      fs/exportfs/built-in.o
  CC      fs/devpts/inode.o
  CC      drivers/char/random.o
  CC      drivers/char/misc.o
  LD      net/dns_resolver/built-in.o
  LD      fs/efs/built-in.o
  LD      drivers/char/agp/built-in.o
  LD      fs/coda/built-in.o
  CC      drivers/base/component.o
  CC      block/partitions/sgi.o
  CC      arch/mips/fw/arc/cmdline.o
  CC      block/partitions/efi.o
  LD      drivers/clk/bcm/built-in.o
  CC      drivers/base/core.o
  CC      drivers/clocksource/i8253.o
  CC      net/ethernet/eth.o
  LD      drivers/clk/built-in.o
  LD      fs/ext2/built-in.o
  LD      drivers/connector/built-in.o
  CC      arch/mips/fw/arc/env.o
  LD      fs/cifs/built-in.o
  CC      net/key/af_key.o
  CC      arch/mips/fw/arc/file.o
  LD      drivers/crypto/built-in.o
  LD      fs/fuse/built-in.o
  LD      drivers/firewire/built-in.o
  CC      drivers/base/bus.o
  CC      arch/mips/fw/arc/identify.o
  CC      lib/lockref.o
  CC      net/netlink/af_netlink.o
  LD      arch/mips/fw/lib/built-in.o
  CC      arch/mips/fw/lib/cmdline.o
  LD      fs/fat/built-in.o
  CC      net/packet/af_packet.o
  CC      net/core/request_sock.o
  CC      net/core/sock.o
  CC      arch/mips/fw/arc/init.o
  CC      net/core/skbuff.o
  LD      kernel/printk/built-in.o
  CC      net/core/datagram.o
  CC      net/core/stream.o
  CC      net/core/scm.o
  LD      fs/isofs/built-in.o
  CC      fs/jbd2/commit.o
  CC      fs/jbd2/transaction.o
  CC      fs/kernfs/mount.o
  CC      net/netfilter/nf_log.o
  CC      net/core/gen_stats.o
  CC      net/netfilter/core.o
  LD      net/phonet/built-in.o
  CC      fs/kernfs/inode.o
  CC      fs/jbd2/recovery.o
  CC      fs/jbd2/revoke.o
  CC      fs/jbd2/journal.o
  LD      drivers/firmware/broadcom/built-in.o
  CC      fs/jbd2/checkpoint.o
  LD      drivers/gpio/built-in.o
  CC      fs/kernfs/dir.o
  CC      arch/mips/lib/iomap.o
  LD      fs/nfs_common/built-in.o
  LD      fs/nls/built-in.o
  LD      fs/minix/built-in.o
  CC      arch/mips/fw/arc/misc.o
  CC      fs/kernfs/file.o
  CC      fs/ext4/balloc.o
  LD      drivers/firmware/built-in.o
  CC      fs/kernfs/symlink.o
  CC      net/core/gen_estimator.o
  CC      drivers/base/dd.o
  AR      arch/mips/fw/lib/lib.a
  CC      net/core/secure_seq.o
  CC      net/core/net_namespace.o
  LD      fs/lockd/built-in.o
  CC      net/core/flow_dissector.o
  CC      arch/mips/lib/dump_tlb.o
  CC      net/netfilter/nf_queue.o
  CC      net/core/dev.o
  CC      net/core/sysctl_net_core.o
  CC      net/netfilter/nf_sockopt.o
  CC      net/core/dev_addr_lists.o
  CC      net/core/dst.o
  CC      net/core/neighbour.o
  CC      arch/mips/fw/arc/promlib.o
  CC      net/core/ethtool.o
  LD      net/rfkill/built-in.o
  CC      net/core/rtnetlink.o
  CC      net/core/netevent.o
  LD      fs/nfsd/built-in.o
  CC      net/core/utils.o
  LD      crypto/cryptomgr.o
  CC      net/core/link_watch.o
  CC      net/core/sock_diag.o
  CC      net/core/filter.o
  CC      net/core/tso.o
  CC      net/core/dev_ioctl.o
  LD      fs/nfs/built-in.o
  CC      lib/div64.o
  CC      lib/bcd.o
  CC      drivers/base/syscore.o
  CC      fs/ext4/dir.o
  CC      net/ipv6/af_inet6.o
  CC      net/ipv6/anycast.o
  CC      fs/ext4/bitmap.o
  CC      net/sched/sch_generic.o
  CC      net/ipv6/ip6_output.o
  LD      fs/omfs/built-in.o
  CC      arch/mips/fw/arc/salone.o
  LD      drivers/clocksource/built-in.o
  CC      net/ipv4/inetpeer.o
  CC      net/ipv4/route.o
  LD      fs/devpts/devpts.o
  CC      arch/mips/lib/ashrdi3.o
  CC      arch/mips/lib/ashldi3.o
  CC      lib/sort.o
/home/mingo/tip/mm/page_alloc.c: In function 'free_area_init_nodes':
/home/mingo/tip/mm/page_alloc.c:5712:34: warning: array subscript is below array bounds [-Warray-bounds]
    arch_zone_highest_possible_pfn[i-1];
                                  ^
  CC      net/ipv6/ip6_input.o
  LD      kernel/sched/built-in.o
  LD      fs/devpts/built-in.o
  CC      net/sched/sch_mq.o
  CC      fs/notify/fsnotify.o
  LD      virt/lib/built-in.o
  CC      fs/ext4/file.o
  CC      drivers/hid/hid-core.o
  CC      fs/ext4/fsync.o
  CC      fs/ext4/ialloc.o
  LD      virt/built-in.o
  CC      lib/parser.o
  CC      arch/mips/fw/arc/time.o
  LD      drivers/gpu/vga/built-in.o
  CC      net/sched/sch_api.o
  LD      drivers/gpu/drm/bridge/built-in.o
  LD      drivers/hsi/clients/built-in.o
  CC      net/ipv4/protocol.o
  CC      arch/mips/math-emu/cp1emu.o
  LD      kernel/time/built-in.o
  CC      fs/notify/notification.o
  CC      drivers/base/driver.o
  LD      drivers/gpu/drm/i2c/built-in.o
  CC      drivers/hid/hid-input.o
  CC      drivers/base/class.o
  CC      arch/mips/lib/bswapsi.o
  CC      arch/mips/lib/bswapdi.o
  CC      arch/mips/lib/cmpdi2.o
  LD      block/partitions/built-in.o
/home/mingo/tip/Makefile:950: recipe for target 'kernel' failed
make[1]: *** [kernel] Error 2
make[1]: *** Waiting for unfinished jobs....
  CC      arch/mips/math-emu/ieee754dp.o
  LD      drivers/hsi/controllers/built-in.o
  CC      drivers/base/platform.o
  CC      drivers/hid/hidraw.o
  CC      arch/mips/lib/lshrdi3.o
  CC      arch/mips/lib/ucmpdi2.o
  LD      drivers/gpu/drm/tilcdc/built-in.o
  LD      drivers/gpu/drm/panel/built-in.o
  LD      drivers/hsi/built-in.o
  AS      arch/mips/lib/csum_partial.o
  CC      fs/proc/task_mmu.o
  LD      drivers/hwtracing/intel_th/built-in.o

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [tip:x86/pkeys] mm/gup: Introduce get_user_pages_remote()
  2016-02-12 21:01 ` [PATCH 01/33] mm: introduce get_user_pages_remote() Dave Hansen
  2016-02-15  6:09   ` Balbir Singh
  2016-02-15  6:14   ` Srikar Dronamraju
@ 2016-02-16 12:14   ` tip-bot for Dave Hansen
  2016-02-20  6:25     ` Konstantin Khlebnikov
  2 siblings, 1 reply; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-16 12:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, riel, dave, dvlasenk, kirill.shutemov, luto, brgerst,
	n-horiguchi, torvalds, bp, mingo, akpm, dave.hansen, vbabka,
	peterz, srikar, linux-kernel, hpa, aarcange

Commit-ID:  1e9877902dc7e11d2be038371c6fbf2dfcd469d7
Gitweb:     http://git.kernel.org/tip/1e9877902dc7e11d2be038371c6fbf2dfcd469d7
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:01:54 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Feb 2016 10:04:09 +0100

mm/gup: Introduce get_user_pages_remote()

For protection keys, we need to understand whether protections
should be enforced in software or not.  In general, we enforce
protections when working on our own task, but not when on others.
We call these "current" and "remote" operations.

This patch introduces a new get_user_pages() variant:

        get_user_pages_remote()

Which is a replacement for when get_user_pages() is called on
non-current tsk/mm.

We also introduce a new gup flag: FOLL_REMOTE which can be used
for the "__" gup variants to get this new behavior.

The uprobes is_trap_at_addr() location holds mmap_sem and
calls get_user_pages(current->mm) on an instruction address.  This
makes it a pretty unique gup caller.  Being an instruction access
and also really originating from the kernel (vs. the app), I opted
to consider this a 'remote' access where protection keys will not
be enforced.

Without protection keys, this patch should not change any behavior.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 drivers/gpu/drm/etnaviv/etnaviv_gem.c   |  6 +++---
 drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +++++-----
 drivers/infiniband/core/umem_odp.c      |  8 ++++----
 fs/exec.c                               |  8 ++++++--
 include/linux/mm.h                      |  5 +++++
 kernel/events/uprobes.c                 | 10 ++++++++--
 mm/gup.c                                | 27 ++++++++++++++++++++++-----
 mm/memory.c                             |  2 +-
 mm/process_vm_access.c                  | 11 ++++++++---
 security/tomoyo/domain.c                |  9 ++++++++-
 virt/kvm/async_pf.c                     |  8 +++++++-
 11 files changed, 77 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
index 4b519e4..97d4457 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
@@ -753,9 +753,9 @@ static struct page **etnaviv_gem_userptr_do_get_pages(
 
 	down_read(&mm->mmap_sem);
 	while (pinned < npages) {
-		ret = get_user_pages(task, mm, ptr, npages - pinned,
-				     !etnaviv_obj->userptr.ro, 0,
-				     pvec + pinned, NULL);
+		ret = get_user_pages_remote(task, mm, ptr, npages - pinned,
+					    !etnaviv_obj->userptr.ro, 0,
+					    pvec + pinned, NULL);
 		if (ret < 0)
 			break;
 
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 59e45b3..90dbf81 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -584,11 +584,11 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
 
 		down_read(&mm->mmap_sem);
 		while (pinned < npages) {
-			ret = get_user_pages(work->task, mm,
-					     obj->userptr.ptr + pinned * PAGE_SIZE,
-					     npages - pinned,
-					     !obj->userptr.read_only, 0,
-					     pvec + pinned, NULL);
+			ret = get_user_pages_remote(work->task, mm,
+					obj->userptr.ptr + pinned * PAGE_SIZE,
+					npages - pinned,
+					!obj->userptr.read_only, 0,
+					pvec + pinned, NULL);
 			if (ret < 0)
 				break;
 
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e69bf26..75077a0 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -572,10 +572,10 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 user_virt, u64 bcnt,
 		 * complex (and doesn't gain us much performance in most use
 		 * cases).
 		 */
-		npages = get_user_pages(owning_process, owning_mm, user_virt,
-					gup_num_pages,
-					access_mask & ODP_WRITE_ALLOWED_BIT, 0,
-					local_page_list, NULL);
+		npages = get_user_pages_remote(owning_process, owning_mm,
+				user_virt, gup_num_pages,
+				access_mask & ODP_WRITE_ALLOWED_BIT,
+				0, local_page_list, NULL);
 		up_read(&owning_mm->mmap_sem);
 
 		if (npages < 0)
diff --git a/fs/exec.c b/fs/exec.c
index dcd4ac7..d885b98 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -198,8 +198,12 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 			return NULL;
 	}
 #endif
-	ret = get_user_pages(current, bprm->mm, pos,
-			1, write, 1, &page, NULL);
+	/*
+	 * We are doing an exec().  'current' is the process
+	 * doing the exec and bprm->mm is the new process's mm.
+	 */
+	ret = get_user_pages_remote(current, bprm->mm, pos, 1, write,
+			1, &page, NULL);
 	if (ret <= 0)
 		return NULL;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b1d4b8c..faf3b70 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1225,6 +1225,10 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		      unsigned long start, unsigned long nr_pages,
 		      unsigned int foll_flags, struct page **pages,
 		      struct vm_area_struct **vmas, int *nonblocking);
+long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    struct vm_area_struct **vmas);
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		    unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages,
@@ -2170,6 +2174,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 #define FOLL_MLOCK	0x1000	/* lock present pages */
+#define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0167679..8eef5f5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -299,7 +299,7 @@ int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
 
 retry:
 	/* Read the page with vaddr into memory */
-	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
+	ret = get_user_pages_remote(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
 	if (ret <= 0)
 		return ret;
 
@@ -1700,7 +1700,13 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 	if (likely(result == 0))
 		goto out;
 
-	result = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
+	/*
+	 * The NULL 'tsk' here ensures that any faults that occur here
+	 * will not be accounted to the task.  'mm' *is* current->mm,
+	 * but we treat this as a 'remote' access since it is
+	 * essentially a kernel access to the memory.
+	 */
+	result = get_user_pages_remote(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
 	if (result < 0)
 		return result;
 
diff --git a/mm/gup.c b/mm/gup.c
index 7bf19ff..36ca850 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -870,7 +870,7 @@ long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
 EXPORT_SYMBOL(get_user_pages_unlocked);
 
 /*
- * get_user_pages() - pin user pages in memory
+ * get_user_pages_remote() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
  *		NULL if faults are not to be recorded.
  * @mm:		mm_struct of target mm
@@ -924,12 +924,29 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  * should use get_user_pages because it cannot pass
  * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, unsigned long nr_pages, int write,
-		int force, struct page **pages, struct vm_area_struct **vmas)
+long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas)
 {
 	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
-				       pages, vmas, NULL, false, FOLL_TOUCH);
+				       pages, vmas, NULL, false,
+				       FOLL_TOUCH | FOLL_REMOTE);
+}
+EXPORT_SYMBOL(get_user_pages_remote);
+
+/*
+ * This is the same as get_user_pages_remote() for the time
+ * being.
+ */
+long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas)
+{
+	return __get_user_pages_locked(tsk, mm, start, nr_pages,
+				       write, force, pages, vmas, NULL, false,
+				       FOLL_TOUCH);
 }
 EXPORT_SYMBOL(get_user_pages);
 
diff --git a/mm/memory.c b/mm/memory.c
index 38090ca..8bfbad0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3685,7 +3685,7 @@ static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 		void *maddr;
 		struct page *page = NULL;
 
-		ret = get_user_pages(tsk, mm, addr, 1,
+		ret = get_user_pages_remote(tsk, mm, addr, 1,
 				write, 1, &page, &vma);
 		if (ret <= 0) {
 #ifndef CONFIG_HAVE_IOREMAP_PROT
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 5d453e5..07514d4 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -98,9 +98,14 @@ static int process_vm_rw_single_vec(unsigned long addr,
 		int pages = min(nr_pages, max_pages_per_loop);
 		size_t bytes;
 
-		/* Get the pages we're interested in */
-		pages = get_user_pages_unlocked(task, mm, pa, pages,
-						vm_write, 0, process_pages);
+		/*
+		 * Get the pages we're interested in.  We must
+		 * add FOLL_REMOTE because task/mm might not
+		 * current/current->mm
+		 */
+		pages = __get_user_pages_unlocked(task, mm, pa, pages,
+						  vm_write, 0, process_pages,
+						  FOLL_REMOTE);
 		if (pages <= 0)
 			return -EFAULT;
 
diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c
index 3865145..ade7c6c 100644
--- a/security/tomoyo/domain.c
+++ b/security/tomoyo/domain.c
@@ -874,7 +874,14 @@ bool tomoyo_dump_page(struct linux_binprm *bprm, unsigned long pos,
 	}
 	/* Same with get_arg_page(bprm, pos, 0) in fs/exec.c */
 #ifdef CONFIG_MMU
-	if (get_user_pages(current, bprm->mm, pos, 1, 0, 1, &page, NULL) <= 0)
+	/*
+	 * This is called at execve() time in order to dig around
+	 * in the argv/environment of the new proceess
+	 * (represented by bprm).  'current' is the process doing
+	 * the execve().
+	 */
+	if (get_user_pages_remote(current, bprm->mm, pos, 1,
+				0, 1, &page, NULL) <= 0)
 		return false;
 #else
 	page = bprm->page[pos / PAGE_SIZE];
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 3531599..d604e87 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -79,7 +79,13 @@ static void async_pf_execute(struct work_struct *work)
 
 	might_sleep();
 
-	get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
+	/*
+	 * This work is run asynchromously to the task which owns
+	 * mm and might be done in another context, so we must
+	 * use FOLL_REMOTE.
+	 */
+	__get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL, FOLL_REMOTE);
+
 	kvm_async_page_present_sync(vcpu, apf);
 
 	spin_lock(&vcpu->async_pf.lock);

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH 02/33] mm: overload get_user_pages() functions
  2016-02-16  8:36   ` Ingo Molnar
@ 2016-02-17 18:15     ` Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: Dave Hansen @ 2016-02-17 18:15 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, x86, dave.hansen, srikar, vbabka,
	kirill.shutemov, aarcange, n-horiguchi, jack

On 02/16/2016 12:36 AM, Ingo Molnar wrote:
>> > From: Dave Hansen <dave.hansen@linux.intel.com>
>> > 
>> > The concept here was a suggestion from Ingo.  The implementation
>> > horrors are all mine.
>> > 
>> > This allows get_user_pages(), get_user_pages_unlocked(), and
>> > get_user_pages_locked() to be called with or without the
>> > leading tsk/mm arguments.  We will give a compile-time warning
>> > about the old style being __deprecated and we will also
>> > WARN_ON() if the non-remote version is used for a remote-style
>> > access.
> So at minimum this should be WARN_ON_ONCE(), to make it easier to recover some 
> meaningful kernel log from such incidents.

I went to go fix this in the code but realized that I coded it up as
WARN_ONCE().  The description was just imprecise.  So I won't be sending
a code fix for this.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 33/33] x86, pkeys: execute-only support
  2016-02-12 21:02 ` [PATCH 33/33] x86, pkeys: execute-only support Dave Hansen
@ 2016-02-17 21:27   ` Kees Cook
  2016-02-17 21:33     ` Dave Hansen
  2016-02-17 22:17     ` Andy Lutomirski
  2016-02-18 20:27   ` [tip:mm/pkeys] mm/core, x86/mm/pkeys: Add execute-only protection keys support tip-bot for Dave Hansen
  1 sibling, 2 replies; 84+ messages in thread
From: Kees Cook @ 2016-02-17 21:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, Linux-MM, x86, Linus Torvalds, Dave Hansen, Andrew Morton,
	Andy Lutomirski

On Fri, Feb 12, 2016 at 1:02 PM, Dave Hansen <dave@sr71.net> wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Protection keys provide new page-based protection in hardware.
> But, they have an interesting attribute: they only affect data
> accesses and never affect instruction fetches.  That means that
> if we set up some memory which is set as "access-disabled" via
> protection keys, we can still execute from it.
>
> This patch uses protection keys to set up mappings to do just that.
> If a user calls:
>
>         mmap(..., PROT_EXEC);
> or
>         mprotect(ptr, sz, PROT_EXEC);
>
> (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
> notice this, and set a special protection key on the memory.  It
> also sets the appropriate bits in the Protection Keys User Rights
> (PKRU) register so that the memory becomes unreadable and
> unwritable.
>
> I haven't found any userspace that does this today.  With this
> facility in place, we expect userspace to move to use it
> eventually.  Userspace _could_ start doing this today.  Any
> PROT_EXEC calls get converted to PROT_READ inside the kernel, and
> would transparently be upgraded to "true" PROT_EXEC with this
> code.  IOW, userspace never has to do any PROT_EXEC runtime
> detection.

Random thought while skimming email:

Is there a way to detect this feature's availability without userspace
having to set up a segv handler and attempting to read a
PROT_EXEC-only region? (i.e. cpu flag for protection keys, or a way to
check the protection to see if PROT_READ got added automatically,
etc?)

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 33/33] x86, pkeys: execute-only support
  2016-02-17 21:27   ` Kees Cook
@ 2016-02-17 21:33     ` Dave Hansen
  2016-02-17 21:36       ` Kees Cook
  2016-02-17 22:17     ` Andy Lutomirski
  1 sibling, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-17 21:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux-MM, x86, Linus Torvalds, Dave Hansen, Andrew Morton,
	Andy Lutomirski

On 02/17/2016 01:27 PM, Kees Cook wrote:
> Is there a way to detect this feature's availability without userspace
> having to set up a segv handler and attempting to read a
> PROT_EXEC-only region? (i.e. cpu flag for protection keys, or a way to
> check the protection to see if PROT_READ got added automatically,
> etc?)

You can kinda do it with /proc/$pid/(s)maps.  Here's smaps, for instance:

> 00401000-00402000 --xp 00001000 08:14 4897479                            /root/pkeys/pkey-xonly
> Size:                  4 kB
> Rss:                   4 kB
...
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Locked:                0 kB
> ProtectionKey:        15
> VmFlags: ex mr mw me dw 

You can see "--x" and the ProtectionKey itself being nonzero.  That's a
reasonable indication.  There's also the "OSPKE" cpuid bit which only
shows up when the kernel has enabled protection keys.  This is
_separate_ from the bit that says whether the processor support pkeys.

I check them in test code like this:

> static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
>                                 unsigned int *ecx, unsigned int *edx)
> {
>         /* ecx is often an input as well as an output. */
>         asm volatile(
>                 "cpuid;"
>                 : "=a" (*eax),
>                   "=b" (*ebx),
>                   "=c" (*ecx),
>                   "=d" (*edx)
>                 : "0" (*eax), "2" (*ecx));
> }
> 
> /* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx) */
> #define X86_FEATURE_PKU        (1<<3) /* Protection Keys for Userspace */
> #define X86_FEATURE_OSPKE      (1<<4) /* OS Protection Keys Enable */
> 
> static inline int cpu_has_pku(void)
> {
>         unsigned int eax;
>         unsigned int ebx;
>         unsigned int ecx;
>         unsigned int edx;
>         eax = 0x7;
>         ecx = 0x0;
>         __cpuid(&eax, &ebx, &ecx, &edx);
> 
>         if (!(ecx & X86_FEATURE_PKU)) {
>                 dprintf2("cpu does not have PKU\n");
>                 return 0;
>         }
>         if (!(ecx & X86_FEATURE_OSPKE)) {
>                 dprintf2("cpu does not have OSPKE\n");
>                 return 0;
>         }
>         return 1;
> }

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 33/33] x86, pkeys: execute-only support
  2016-02-17 21:33     ` Dave Hansen
@ 2016-02-17 21:36       ` Kees Cook
  0 siblings, 0 replies; 84+ messages in thread
From: Kees Cook @ 2016-02-17 21:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, Linux-MM, x86, Linus Torvalds, Dave Hansen, Andrew Morton,
	Andy Lutomirski

On Wed, Feb 17, 2016 at 1:33 PM, Dave Hansen <dave@sr71.net> wrote:
> On 02/17/2016 01:27 PM, Kees Cook wrote:
>> Is there a way to detect this feature's availability without userspace
>> having to set up a segv handler and attempting to read a
>> PROT_EXEC-only region? (i.e. cpu flag for protection keys, or a way to
>> check the protection to see if PROT_READ got added automatically,
>> etc?)
>
> You can kinda do it with /proc/$pid/(s)maps.  Here's smaps, for instance:
>
>> 00401000-00402000 --xp 00001000 08:14 4897479                            /root/pkeys/pkey-xonly
>> Size:                  4 kB
>> Rss:                   4 kB
> ...
>> KernelPageSize:        4 kB
>> MMUPageSize:           4 kB
>> Locked:                0 kB
>> ProtectionKey:        15
>> VmFlags: ex mr mw me dw

Ah-ha, perfect. Thanks!

> You can see "--x" and the ProtectionKey itself being nonzero.  That's a
> reasonable indication.  There's also the "OSPKE" cpuid bit which only
> shows up when the kernel has enabled protection keys.  This is
> _separate_ from the bit that says whether the processor support pkeys.
>
> I check them in test code like this:
>
>> static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
>>                                 unsigned int *ecx, unsigned int *edx)
>> {
>>         /* ecx is often an input as well as an output. */
>>         asm volatile(
>>                 "cpuid;"
>>                 : "=a" (*eax),
>>                   "=b" (*ebx),
>>                   "=c" (*ecx),
>>                   "=d" (*edx)
>>                 : "0" (*eax), "2" (*ecx));
>> }
>>
>> /* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx) */
>> #define X86_FEATURE_PKU        (1<<3) /* Protection Keys for Userspace */
>> #define X86_FEATURE_OSPKE      (1<<4) /* OS Protection Keys Enable */
>>
>> static inline int cpu_has_pku(void)
>> {
>>         unsigned int eax;
>>         unsigned int ebx;
>>         unsigned int ecx;
>>         unsigned int edx;
>>         eax = 0x7;
>>         ecx = 0x0;
>>         __cpuid(&eax, &ebx, &ecx, &edx);
>>
>>         if (!(ecx & X86_FEATURE_PKU)) {
>>                 dprintf2("cpu does not have PKU\n");
>>                 return 0;
>>         }
>>         if (!(ecx & X86_FEATURE_OSPKE)) {
>>                 dprintf2("cpu does not have OSPKE\n");
>>                 return 0;
>>         }
>>         return 1;
>> }
>

Great, thanks for the example!

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 33/33] x86, pkeys: execute-only support
  2016-02-17 21:27   ` Kees Cook
  2016-02-17 21:33     ` Dave Hansen
@ 2016-02-17 22:17     ` Andy Lutomirski
  2016-02-17 22:53       ` Dave Hansen
  1 sibling, 1 reply; 84+ messages in thread
From: Andy Lutomirski @ 2016-02-17 22:17 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dave Hansen, X86 ML, Linux-MM, Andrew Morton, LKML,
	Linus Torvalds, Dave Hansen

On Feb 17, 2016 1:27 PM, "Kees Cook" <keescook@google.com> wrote:
>
> On Fri, Feb 12, 2016 at 1:02 PM, Dave Hansen <dave@sr71.net> wrote:
> >
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> >
> > Protection keys provide new page-based protection in hardware.
> > But, they have an interesting attribute: they only affect data
> > accesses and never affect instruction fetches.  That means that
> > if we set up some memory which is set as "access-disabled" via
> > protection keys, we can still execute from it.
> >
> > This patch uses protection keys to set up mappings to do just that.
> > If a user calls:
> >
> >         mmap(..., PROT_EXEC);
> > or
> >         mprotect(ptr, sz, PROT_EXEC);
> >
> > (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
> > notice this, and set a special protection key on the memory.  It
> > also sets the appropriate bits in the Protection Keys User Rights
> > (PKRU) register so that the memory becomes unreadable and
> > unwritable.
> >
> > I haven't found any userspace that does this today.  With this
> > facility in place, we expect userspace to move to use it
> > eventually.  Userspace _could_ start doing this today.  Any
> > PROT_EXEC calls get converted to PROT_READ inside the kernel, and
> > would transparently be upgraded to "true" PROT_EXEC with this
> > code.  IOW, userspace never has to do any PROT_EXEC runtime
> > detection.
>
> Random thought while skimming email:
>
> Is there a way to detect this feature's availability without userspace
> having to set up a segv handler and attempting to read a
> PROT_EXEC-only region? (i.e. cpu flag for protection keys, or a way to
> check the protection to see if PROT_READ got added automatically,
> etc?)
>

We could add an HWCAP.

--Andy

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 33/33] x86, pkeys: execute-only support
  2016-02-17 22:17     ` Andy Lutomirski
@ 2016-02-17 22:53       ` Dave Hansen
  2016-02-18  0:46         ` Andy Lutomirski
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-17 22:53 UTC (permalink / raw)
  To: Andy Lutomirski, Kees Cook
  Cc: Dave Hansen, X86 ML, Linux-MM, Andrew Morton, LKML, Linus Torvalds

On 02/17/2016 02:17 PM, Andy Lutomirski wrote:
>> > Is there a way to detect this feature's availability without userspace
>> > having to set up a segv handler and attempting to read a
>> > PROT_EXEC-only region? (i.e. cpu flag for protection keys, or a way to
>> > check the protection to see if PROT_READ got added automatically,
>> > etc?)
>> >
> We could add an HWCAP.

I'll bite.  What's an HWCAP?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 33/33] x86, pkeys: execute-only support
  2016-02-17 22:53       ` Dave Hansen
@ 2016-02-18  0:46         ` Andy Lutomirski
  0 siblings, 0 replies; 84+ messages in thread
From: Andy Lutomirski @ 2016-02-18  0:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Dave Hansen, X86 ML, Linux-MM, Andrew Morton, LKML,
	Linus Torvalds

On Wed, Feb 17, 2016 at 2:53 PM, Dave Hansen <dave@sr71.net> wrote:
> On 02/17/2016 02:17 PM, Andy Lutomirski wrote:
>>> > Is there a way to detect this feature's availability without userspace
>>> > having to set up a segv handler and attempting to read a
>>> > PROT_EXEC-only region? (i.e. cpu flag for protection keys, or a way to
>>> > check the protection to see if PROT_READ got added automatically,
>>> > etc?)
>>> >
>> We could add an HWCAP.
>
> I'll bite.  What's an HWCAP?

It's a CPU capability vector that's passed to every program as an auxv
entry.  On x86, ELF_HWCAP is useless (it's already fully used up for
pointless purposes for CPUID stuff), but ELF_HWCAP2 could be added and
a bit could be defined like HWCAP2_PROT_EXEC_ONLY.

Some day, WRFSBASE, etc will be advertised via ELF_HWCAP2, I suspect.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] mm/gup: Overload get_user_pages() functions
  2016-02-12 21:01 ` [PATCH 02/33] mm: overload get_user_pages() functions Dave Hansen
  2016-02-16  8:36   ` Ingo Molnar
@ 2016-02-18 20:15   ` tip-bot for Dave Hansen
  1 sibling, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, jack, paul.gortmaker, xiexiuqi, akpm, mguzik, n-horiguchi,
	srikar, mhocko, linux-kernel, leon, aarcange, geliangtang,
	kirill.shutemov, mingo, yamada.masahiro, vdavydov, viro, peterz,
	dan.j.williams, dave, oleg, torvalds, mcoquelin.stm32, hannes,
	kuleshovmail, dave.hansen, vbabka, koct9i, dingel, hpa

Commit-ID:  cde70140fed8429acf7a14e2e2cbd3e329036653
Gitweb:     http://git.kernel.org/tip/cde70140fed8429acf7a14e2e2cbd3e329036653
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:01:55 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Feb 2016 10:11:12 +0100

mm/gup: Overload get_user_pages() functions

The concept here was a suggestion from Ingo.  The implementation
horrors are all mine.

This allows get_user_pages(), get_user_pages_unlocked(), and
get_user_pages_locked() to be called with or without the
leading tsk/mm arguments.  We will give a compile-time warning
about the old style being __deprecated and we will also
WARN_ON() if the non-remote version is used for a remote-style
access.

Doing this, folks will get nice warnings and will not break the
build.  This should be nice for -next and will hopefully let
developers fix up their own code instead of maintainers needing
to do it at merge time.

The way we do this is hideous.  It uses the __VA_ARGS__ macro
functionality to call different functions based on the number
of arguments passed to the macro.

There's an additional hack to ensure that our EXPORT_SYMBOL()
of the deprecated symbols doesn't trigger a warning.

We should be able to remove this mess as soon as -rc1 hits in
the release after this is merged.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Geliang Tang <geliangtang@163.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Leon Romanovsky <leon@leon.nu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h | 74 ++++++++++++++++++++++++++++++++++++++++++++++--------
 mm/gup.c           | 62 ++++++++++++++++++++++++++++++++++-----------
 mm/nommu.c         | 64 ++++++++++++++++++++++++++++++++--------------
 mm/util.c          |  4 +--
 4 files changed, 158 insertions(+), 46 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index faf3b70..4c73178 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1229,24 +1229,78 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long start, unsigned long nr_pages,
 			    int write, int force, struct page **pages,
 			    struct vm_area_struct **vmas);
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
-		    int write, int force, struct page **pages,
-		    struct vm_area_struct **vmas);
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
-		    int write, int force, struct page **pages,
-		    int *locked);
+long get_user_pages6(unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    struct vm_area_struct **vmas);
+long get_user_pages_locked6(unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages, int *locked);
 long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
 			       unsigned long start, unsigned long nr_pages,
 			       int write, int force, struct page **pages,
 			       unsigned int gup_flags);
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked5(unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 
+/* suppress warnings from use in EXPORT_SYMBOL() */
+#ifndef __DISABLE_GUP_DEPRECATED
+#define __gup_deprecated __deprecated
+#else
+#define __gup_deprecated
+#endif
+/*
+ * These macros provide backward-compatibility with the old
+ * get_user_pages() variants which took tsk/mm.  These
+ * functions/macros provide both compile-time __deprecated so we
+ * can catch old-style use and not break the build.  The actual
+ * functions also have WARN_ON()s to let us know at runtime if
+ * the get_user_pages() should have been the "remote" variant.
+ *
+ * These are hideous, but temporary.
+ *
+ * If you run into one of these __deprecated warnings, look
+ * at how you are calling get_user_pages().  If you are calling
+ * it with current/current->mm as the first two arguments,
+ * simply remove those arguments.  The behavior will be the same
+ * as it is now.  If you are calling it on another task, use
+ * get_user_pages_remote() instead.
+ *
+ * Any questions?  Ask Dave Hansen <dave@sr71.net>
+ */
+long
+__gup_deprecated
+get_user_pages8(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		struct vm_area_struct **vmas);
+#define GUP_MACRO(_1, _2, _3, _4, _5, _6, _7, _8, get_user_pages, ...)	\
+	get_user_pages
+#define get_user_pages(...) GUP_MACRO(__VA_ARGS__,	\
+		get_user_pages8, x,			\
+		get_user_pages6, x, x, x, x, x)(__VA_ARGS__)
+
+__gup_deprecated
+long get_user_pages_locked8(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages,
+		int *locked);
+#define GUPL_MACRO(_1, _2, _3, _4, _5, _6, _7, _8, get_user_pages_locked, ...)	\
+	get_user_pages_locked
+#define get_user_pages_locked(...) GUPL_MACRO(__VA_ARGS__,	\
+		get_user_pages_locked8,	x,			\
+		get_user_pages_locked6, x, x, x, x)(__VA_ARGS__)
+
+__gup_deprecated
+long get_user_pages_unlocked7(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		int write, int force, struct page **pages);
+#define GUPU_MACRO(_1, _2, _3, _4, _5, _6, _7, get_user_pages_unlocked, ...)	\
+	get_user_pages_unlocked
+#define get_user_pages_unlocked(...) GUPU_MACRO(__VA_ARGS__,	\
+		get_user_pages_unlocked7, x,			\
+		get_user_pages_unlocked5, x, x, x, x)(__VA_ARGS__)
+
 /* Container for pinned pfns / pages */
 struct frame_vector {
 	unsigned int nr_allocated;	/* Number of frames we have space for */
diff --git a/mm/gup.c b/mm/gup.c
index 36ca850..8a035e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1,3 +1,4 @@
+#define __DISABLE_GUP_DEPRECATED 1
 #include <linux/kernel.h>
 #include <linux/errno.h>
 #include <linux/err.h>
@@ -807,15 +808,15 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
  *      if (locked)
  *          up_read(&mm->mmap_sem);
  */
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-			   unsigned long start, unsigned long nr_pages,
+long get_user_pages_locked6(unsigned long start, unsigned long nr_pages,
 			   int write, int force, struct page **pages,
 			   int *locked)
 {
-	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
-				       pages, NULL, locked, true, FOLL_TOUCH);
+	return __get_user_pages_locked(current, current->mm, start, nr_pages,
+				       write, force, pages, NULL, locked, true,
+				       FOLL_TOUCH);
 }
-EXPORT_SYMBOL(get_user_pages_locked);
+EXPORT_SYMBOL(get_user_pages_locked6);
 
 /*
  * Same as get_user_pages_unlocked(...., FOLL_TOUCH) but it allows to
@@ -860,14 +861,13 @@ EXPORT_SYMBOL(__get_user_pages_unlocked);
  * or if "force" shall be set to 1 (get_user_pages_fast misses the
  * "force" parameter).
  */
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			     unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked5(unsigned long start, unsigned long nr_pages,
 			     int write, int force, struct page **pages)
 {
-	return __get_user_pages_unlocked(tsk, mm, start, nr_pages, write,
-					 force, pages, FOLL_TOUCH);
+	return __get_user_pages_unlocked(current, current->mm, start, nr_pages,
+					 write, force, pages, FOLL_TOUCH);
 }
-EXPORT_SYMBOL(get_user_pages_unlocked);
+EXPORT_SYMBOL(get_user_pages_unlocked5);
 
 /*
  * get_user_pages_remote() - pin user pages in memory
@@ -939,16 +939,15 @@ EXPORT_SYMBOL(get_user_pages_remote);
  * This is the same as get_user_pages_remote() for the time
  * being.
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, unsigned long nr_pages,
+long get_user_pages6(unsigned long start, unsigned long nr_pages,
 		int write, int force, struct page **pages,
 		struct vm_area_struct **vmas)
 {
-	return __get_user_pages_locked(tsk, mm, start, nr_pages,
+	return __get_user_pages_locked(current, current->mm, start, nr_pages,
 				       write, force, pages, vmas, NULL, false,
 				       FOLL_TOUCH);
 }
-EXPORT_SYMBOL(get_user_pages);
+EXPORT_SYMBOL(get_user_pages6);
 
 /**
  * populate_vma_page_range() -  populate a range of pages in the vma.
@@ -1484,3 +1483,38 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 }
 
 #endif /* CONFIG_HAVE_GENERIC_RCU_GUP */
+
+long get_user_pages8(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long start, unsigned long nr_pages,
+		     int write, int force, struct page **pages,
+		     struct vm_area_struct **vmas)
+{
+	WARN_ONCE(tsk != current, "get_user_pages() called on remote task");
+	WARN_ONCE(mm != current->mm, "get_user_pages() called on remote mm");
+
+	return get_user_pages6(start, nr_pages, write, force, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages8);
+
+long get_user_pages_locked8(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages, int *locked)
+{
+	WARN_ONCE(tsk != current, "get_user_pages_locked() called on remote task");
+	WARN_ONCE(mm != current->mm, "get_user_pages_locked() called on remote mm");
+
+	return get_user_pages_locked6(start, nr_pages, write, force, pages, locked);
+}
+EXPORT_SYMBOL(get_user_pages_locked8);
+
+long get_user_pages_unlocked7(struct task_struct *tsk, struct mm_struct *mm,
+				  unsigned long start, unsigned long nr_pages,
+				  int write, int force, struct page **pages)
+{
+	WARN_ONCE(tsk != current, "get_user_pages_unlocked() called on remote task");
+	WARN_ONCE(mm != current->mm, "get_user_pages_unlocked() called on remote mm");
+
+	return get_user_pages_unlocked5(start, nr_pages, write, force, pages);
+}
+EXPORT_SYMBOL(get_user_pages_unlocked7);
+
diff --git a/mm/nommu.c b/mm/nommu.c
index fbf6f0f1..b64d04d 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -15,6 +15,8 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#define __DISABLE_GUP_DEPRECATED
+
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -182,8 +184,7 @@ finish_or_fault:
  *   slab page or a secondary page from a compound page
  * - don't permit access to VMAs that don't support it, such as I/O mappings
  */
-long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, unsigned long nr_pages,
+long get_user_pages6(unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages,
 		    struct vm_area_struct **vmas)
 {
@@ -194,20 +195,18 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 	if (force)
 		flags |= FOLL_FORCE;
 
-	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
-				NULL);
+	return __get_user_pages(current, current->mm, start, nr_pages, flags,
+				pages, vmas, NULL);
 }
-EXPORT_SYMBOL(get_user_pages);
+EXPORT_SYMBOL(get_user_pages6);
 
-long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
-			   unsigned long start, unsigned long nr_pages,
-			   int write, int force, struct page **pages,
-			   int *locked)
+long get_user_pages_locked6(unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    int *locked)
 {
-	return get_user_pages(tsk, mm, start, nr_pages, write, force,
-			      pages, NULL);
+	return get_user_pages6(start, nr_pages, write, force, pages, NULL);
 }
-EXPORT_SYMBOL(get_user_pages_locked);
+EXPORT_SYMBOL(get_user_pages_locked6);
 
 long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
 			       unsigned long start, unsigned long nr_pages,
@@ -216,21 +215,20 @@ long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
 {
 	long ret;
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
-			     pages, NULL);
+	ret = __get_user_pages(tsk, mm, start, nr_pages, gup_flags, pages,
+				NULL, NULL);
 	up_read(&mm->mmap_sem);
 	return ret;
 }
 EXPORT_SYMBOL(__get_user_pages_unlocked);
 
-long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
-			     unsigned long start, unsigned long nr_pages,
+long get_user_pages_unlocked5(unsigned long start, unsigned long nr_pages,
 			     int write, int force, struct page **pages)
 {
-	return __get_user_pages_unlocked(tsk, mm, start, nr_pages, write,
-					 force, pages, 0);
+	return __get_user_pages_unlocked(current, current->mm, start, nr_pages,
+					 write, force, pages, 0);
 }
-EXPORT_SYMBOL(get_user_pages_unlocked);
+EXPORT_SYMBOL(get_user_pages_unlocked5);
 
 /**
  * follow_pfn - look up PFN at a user virtual address
@@ -2108,3 +2106,31 @@ static int __meminit init_admin_reserve(void)
 	return 0;
 }
 subsys_initcall(init_admin_reserve);
+
+long get_user_pages8(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long start, unsigned long nr_pages,
+		     int write, int force, struct page **pages,
+		     struct vm_area_struct **vmas)
+{
+	return get_user_pages6(start, nr_pages, write, force, pages, vmas);
+}
+EXPORT_SYMBOL(get_user_pages8);
+
+long get_user_pages_locked8(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long start, unsigned long nr_pages,
+			    int write, int force, struct page **pages,
+			    int *locked)
+{
+	return get_user_pages_locked6(start, nr_pages, write,
+				      force, pages, locked);
+}
+EXPORT_SYMBOL(get_user_pages_locked8);
+
+long get_user_pages_unlocked7(struct task_struct *tsk, struct mm_struct *mm,
+			      unsigned long start, unsigned long nr_pages,
+			      int write, int force, struct page **pages)
+{
+	return get_user_pages_unlocked5(start, nr_pages, write, force, pages);
+}
+EXPORT_SYMBOL(get_user_pages_unlocked7);
+
diff --git a/mm/util.c b/mm/util.c
index 4fb14ca..1e60116 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -283,9 +283,7 @@ EXPORT_SYMBOL_GPL(__get_user_pages_fast);
 int __weak get_user_pages_fast(unsigned long start,
 				int nr_pages, int write, struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
-	return get_user_pages_unlocked(current, mm, start, nr_pages,
-				       write, 0, pages);
+	return get_user_pages_unlocked(start, nr_pages, write, 0, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
 

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm
  2016-02-12 21:01 ` [PATCH 03/33] mm, gup: switch callers of get_user_pages() to not pass tsk/mm Dave Hansen
@ 2016-02-18 20:16   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, linux-kernel, hpa, torvalds, aarcange, brgerst,
	dave.hansen, srikar, riel, n-horiguchi, vbabka, kirill.shutemov,
	tglx, mingo, dave, peterz, luto, dvlasenk, bp

Commit-ID:  d4edcf0d56958db0aca0196314ca38a5e730ea92
Gitweb:     http://git.kernel.org/tip/d4edcf0d56958db0aca0196314ca38a5e730ea92
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:01:56 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Feb 2016 10:11:12 +0100

mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm

We will soon modify the vanilla get_user_pages() so it can no
longer be used on mm/tasks other than 'current/current->mm',
which is by far the most common way it is called.  For now,
we allow the old-style calls, but warn when they are used.
(implemented in previous patch)

This patch switches all callers of:

	get_user_pages()
	get_user_pages_unlocked()
	get_user_pages_locked()

to stop passing tsk/mm so they will no longer see the warnings.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/cris/arch-v32/drivers/cryptocop.c      |  8 ++------
 arch/ia64/kernel/err_inject.c               |  3 +--
 arch/mips/mm/gup.c                          |  3 +--
 arch/s390/mm/gup.c                          |  4 +---
 arch/sh/mm/gup.c                            |  2 +-
 arch/sparc/mm/gup.c                         |  2 +-
 arch/x86/mm/gup.c                           |  2 +-
 arch/x86/mm/mpx.c                           |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c     |  3 +--
 drivers/gpu/drm/radeon/radeon_ttm.c         |  3 +--
 drivers/gpu/drm/via/via_dmablit.c           |  3 +--
 drivers/infiniband/core/umem.c              |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  3 +--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  3 +--
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  2 +-
 drivers/media/pci/ivtv/ivtv-udma.c          |  4 ++--
 drivers/media/pci/ivtv/ivtv-yuv.c           | 10 ++++------
 drivers/media/v4l2-core/videobuf-dma-sg.c   |  3 +--
 drivers/misc/mic/scif/scif_rma.c            |  2 --
 drivers/misc/sgi-gru/grufault.c             |  3 +--
 drivers/scsi/st.c                           |  2 --
 drivers/video/fbdev/pvr2fb.c                |  4 ++--
 drivers/virt/fsl_hypervisor.c               |  5 ++---
 mm/frame_vector.c                           |  2 +-
 mm/gup.c                                    |  6 ++++--
 mm/ksm.c                                    |  2 +-
 mm/mempolicy.c                              |  6 +++---
 net/ceph/pagevec.c                          |  2 +-
 virt/kvm/kvm_main.c                         | 10 +++++-----
 29 files changed, 44 insertions(+), 64 deletions(-)

diff --git a/arch/cris/arch-v32/drivers/cryptocop.c b/arch/cris/arch-v32/drivers/cryptocop.c
index 877da19..617645d 100644
--- a/arch/cris/arch-v32/drivers/cryptocop.c
+++ b/arch/cris/arch-v32/drivers/cryptocop.c
@@ -2719,9 +2719,7 @@ static int cryptocop_ioctl_process(struct inode *inode, struct file *filp, unsig
 	/* Acquire the mm page semaphore. */
 	down_read(&current->mm->mmap_sem);
 
-	err = get_user_pages(current,
-			     current->mm,
-			     (unsigned long int)(oper.indata + prev_ix),
+	err = get_user_pages((unsigned long int)(oper.indata + prev_ix),
 			     noinpages,
 			     0,  /* read access only for in data */
 			     0, /* no force */
@@ -2736,9 +2734,7 @@ static int cryptocop_ioctl_process(struct inode *inode, struct file *filp, unsig
 	}
 	noinpages = err;
 	if (oper.do_cipher){
-		err = get_user_pages(current,
-				     current->mm,
-				     (unsigned long int)oper.cipher_outdata,
+		err = get_user_pages((unsigned long int)oper.cipher_outdata,
 				     nooutpages,
 				     1, /* write access for out data */
 				     0, /* no force */
diff --git a/arch/ia64/kernel/err_inject.c b/arch/ia64/kernel/err_inject.c
index 0c161ed..09f8457 100644
--- a/arch/ia64/kernel/err_inject.c
+++ b/arch/ia64/kernel/err_inject.c
@@ -142,8 +142,7 @@ store_virtual_to_phys(struct device *dev, struct device_attribute *attr,
 	u64 virt_addr=simple_strtoull(buf, NULL, 16);
 	int ret;
 
-        ret = get_user_pages(current, current->mm, virt_addr,
-                        1, VM_READ, 0, NULL, NULL);
+	ret = get_user_pages(virt_addr, 1, VM_READ, 0, NULL, NULL);
 	if (ret<=0) {
 #ifdef ERR_INJ_DEBUG
 		printk("Virtual address %lx is not existing.\n",virt_addr);
diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 1afd87c..982e83f 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -286,8 +286,7 @@ slow_irqon:
 	start += nr << PAGE_SHIFT;
 	pages += nr;
 
-	ret = get_user_pages_unlocked(current, mm, start,
-				      (end - start) >> PAGE_SHIFT,
+	ret = get_user_pages_unlocked(start, (end - start) >> PAGE_SHIFT,
 				      write, 0, pages);
 
 	/* Have to be a bit careful with return values */
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 13dab0c..49a1c84 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -210,7 +210,6 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
 	int nr, ret;
 
 	might_sleep();
@@ -222,8 +221,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	/* Try to get the remaining pages with get_user_pages */
 	start += nr << PAGE_SHIFT;
 	pages += nr;
-	ret = get_user_pages_unlocked(current, mm, start,
-			     nr_pages - nr, write, 0, pages);
+	ret = get_user_pages_unlocked(start, nr_pages - nr, write, 0, pages);
 	/* Have to be a bit careful with return values */
 	if (nr > 0)
 		ret = (ret < 0) ? nr : ret + nr;
diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
index e7af6a6..40fa6c8 100644
--- a/arch/sh/mm/gup.c
+++ b/arch/sh/mm/gup.c
@@ -257,7 +257,7 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index eb3d8e8..4e06750 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -237,7 +237,7 @@ slow:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 6d5eb59..ce5e454 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -422,7 +422,7 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		ret = get_user_pages_unlocked(current, mm, start,
+		ret = get_user_pages_unlocked(start,
 					      (end - start) >> PAGE_SHIFT,
 					      write, 0, pages);
 
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index b2fd67d..84fa4a4 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -546,8 +546,8 @@ static int mpx_resolve_fault(long __user *addr, int write)
 	int nr_pages = 1;
 	int force = 0;
 
-	gup_ret = get_user_pages(current, current->mm, (unsigned long)addr,
-				 nr_pages, write, force, NULL, NULL);
+	gup_ret = get_user_pages((unsigned long)addr, nr_pages, write,
+			force, NULL, NULL);
 	/*
 	 * get_user_pages() returns number of pages gotten.
 	 * 0 means we failed to fault in and get anything,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 6442a06..5fedfb6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -518,8 +518,7 @@ static int amdgpu_ttm_tt_pin_userptr(struct ttm_tt *ttm)
 		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
 		struct page **pages = ttm->pages + pinned;
 
-		r = get_user_pages(current, current->mm, userptr, num_pages,
-				   write, 0, pages, NULL);
+		r = get_user_pages(userptr, num_pages, write, 0, pages, NULL);
 		if (r < 0)
 			goto release_pages;
 
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index e343074..927a9f2 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -554,8 +554,7 @@ static int radeon_ttm_tt_pin_userptr(struct ttm_tt *ttm)
 		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
 		struct page **pages = ttm->pages + pinned;
 
-		r = get_user_pages(current, current->mm, userptr, num_pages,
-				   write, 0, pages, NULL);
+		r = get_user_pages(userptr, num_pages, write, 0, pages, NULL);
 		if (r < 0)
 			goto release_pages;
 
diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c
index d0cbd5e..e797dfc 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,8 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  drm_via_dmablit_t *xfer)
 	if (NULL == vsg->pages)
 		return -ENOMEM;
 	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(current, current->mm,
-			     (unsigned long)xfer->mem_addr,
+	ret = get_user_pages((unsigned long)xfer->mem_addr,
 			     vsg->num_pages,
 			     (vsg->direction == DMA_FROM_DEVICE),
 			     0, vsg->pages, NULL);
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 38acb3c..fe4d2e1 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -188,7 +188,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	sg_list_start = umem->sg_head.sgl;
 
 	while (npages) {
-		ret = get_user_pages(current, current->mm, cur_base,
+		ret = get_user_pages(cur_base,
 				     min_t(unsigned long, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
 				     1, !umem->writable, page_list, vma_list);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index 7d2e42d..6c00d04 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,8 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 		goto out;
 	}
 
-	ret = get_user_pages(current, current->mm, uaddr & PAGE_MASK, 1, 1, 0,
-			     pages, NULL);
+	ret = get_user_pages(uaddr & PAGE_MASK, 1, 1, 0, pages, NULL);
 	if (ret < 0)
 		goto out;
 
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 74f90b2..2d2b94f 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -66,8 +66,7 @@ static int __qib_get_user_pages(unsigned long start_page, size_t num_pages,
 	}
 
 	for (got = 0; got < num_pages; got += ret) {
-		ret = get_user_pages(current, current->mm,
-				     start_page + got * PAGE_SIZE,
+		ret = get_user_pages(start_page + got * PAGE_SIZE,
 				     num_pages - got, 1, 1,
 				     p + got, NULL);
 		if (ret < 0)
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 645a5f6..7209fbc 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -144,7 +144,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	ret = 0;
 
 	while (npages) {
-		ret = get_user_pages(current, current->mm, cur_base,
+		ret = get_user_pages(cur_base,
 					min_t(unsigned long, npages,
 					PAGE_SIZE / sizeof(struct page *)),
 					1, !writable, page_list, NULL);
diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c
index 24152ac..4769469 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -124,8 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr,
 	}
 
 	/* Get user pages for DMA Xfer */
-	err = get_user_pages_unlocked(current, current->mm,
-			user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
+	err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, 0,
+			1, dma->map);
 
 	if (user_dma.page_count != err) {
 		IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
diff --git a/drivers/media/pci/ivtv/ivtv-yuv.c b/drivers/media/pci/ivtv/ivtv-yuv.c
index 2b8e7b2..b094054 100644
--- a/drivers/media/pci/ivtv/ivtv-yuv.c
+++ b/drivers/media/pci/ivtv/ivtv-yuv.c
@@ -75,14 +75,12 @@ static int ivtv_yuv_prep_user_dma(struct ivtv *itv, struct ivtv_user_dma *dma,
 	ivtv_udma_get_page_info (&uv_dma, (unsigned long)args->uv_source, 360 * uv_decode_height);
 
 	/* Get user pages for DMA Xfer */
-	y_pages = get_user_pages_unlocked(current, current->mm,
-				y_dma.uaddr, y_dma.page_count, 0, 1,
-				&dma->map[0]);
+	y_pages = get_user_pages_unlocked(y_dma.uaddr,
+			y_dma.page_count, 0, 1, &dma->map[0]);
 	uv_pages = 0; /* silence gcc. value is set and consumed only if: */
 	if (y_pages == y_dma.page_count) {
-		uv_pages = get_user_pages_unlocked(current, current->mm,
-					uv_dma.uaddr, uv_dma.page_count, 0, 1,
-					&dma->map[y_pages]);
+		uv_pages = get_user_pages_unlocked(uv_dma.uaddr,
+				uv_dma.page_count, 0, 1, &dma->map[y_pages]);
 	}
 
 	if (y_pages != y_dma.page_count || uv_pages != uv_dma.page_count) {
diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index f669ced..df4c052c 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -181,8 +181,7 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 	dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
 		data, size, dma->nr_pages);
 
-	err = get_user_pages(current, current->mm,
-			     data & PAGE_MASK, dma->nr_pages,
+	err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
 			     rw == READ, 1, /* force */
 			     dma->pages, NULL);
 
diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
index 8310b4d..0fa0d242 100644
--- a/drivers/misc/mic/scif/scif_rma.c
+++ b/drivers/misc/mic/scif/scif_rma.c
@@ -1394,8 +1394,6 @@ retry:
 		}
 
 		pinned_pages->nr_pages = get_user_pages(
-				current,
-				mm,
 				(u64)addr,
 				nr_pages,
 				!!(prot & SCIF_PROT_WRITE),
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index f74fc0c..a2d97b9 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
 #else
 	*pageshift = PAGE_SHIFT;
 #endif
-	if (get_user_pages
-	    (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
+	if (get_user_pages(vaddr, 1, write, 0, &page, NULL) <= 0)
 		return -EFAULT;
 	*paddr = page_to_phys(page);
 	put_page(page);
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index 2e52295..664852a 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4817,8 +4817,6 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
         /* Try to fault in all of the necessary pages */
         /* rw==READ means read from drive, write into memory area */
 	res = get_user_pages_unlocked(
-		current,
-		current->mm,
 		uaddr,
 		nr_pages,
 		rw == READ,
diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
index 0e24eb9..71a923e 100644
--- a/drivers/video/fbdev/pvr2fb.c
+++ b/drivers/video/fbdev/pvr2fb.c
@@ -686,8 +686,8 @@ static ssize_t pvr2fb_write(struct fb_info *info, const char *buf,
 	if (!pages)
 		return -ENOMEM;
 
-	ret = get_user_pages_unlocked(current, current->mm, (unsigned long)buf,
-				      nr_pages, WRITE, 0, pages);
+	ret = get_user_pages_unlocked((unsigned long)buf, nr_pages, WRITE,
+			0, pages);
 
 	if (ret < nr_pages) {
 		nr_pages = ret;
diff --git a/drivers/virt/fsl_hypervisor.c b/drivers/virt/fsl_hypervisor.c
index 32c8fc5..60bdad3 100644
--- a/drivers/virt/fsl_hypervisor.c
+++ b/drivers/virt/fsl_hypervisor.c
@@ -244,9 +244,8 @@ static long ioctl_memcpy(struct fsl_hv_ioctl_memcpy __user *p)
 
 	/* Get the physical addresses of the source buffer */
 	down_read(&current->mm->mmap_sem);
-	num_pinned = get_user_pages(current, current->mm,
-		param.local_vaddr - lb_offset, num_pages,
-		(param.source == -1) ? READ : WRITE,
+	num_pinned = get_user_pages(param.local_vaddr - lb_offset,
+		num_pages, (param.source == -1) ? READ : WRITE,
 		0, pages, NULL);
 	up_read(&current->mm->mmap_sem);
 
diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index 7cf2b71..381bb07 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -58,7 +58,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
 		vec->got_ref = true;
 		vec->is_pfns = false;
-		ret = get_user_pages_locked(current, mm, start, nr_frames,
+		ret = get_user_pages_locked(start, nr_frames,
 			write, force, (struct page **)(vec->ptrs), &locked);
 		goto out;
 	}
diff --git a/mm/gup.c b/mm/gup.c
index 8a035e0..de24ef4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -936,8 +936,10 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 EXPORT_SYMBOL(get_user_pages_remote);
 
 /*
- * This is the same as get_user_pages_remote() for the time
- * being.
+ * This is the same as get_user_pages_remote(), just with a
+ * less-flexible calling convention where we assume that the task
+ * and mm being operated on are the current task's.  We also
+ * obviously don't pass FOLL_REMOTE in here.
  */
 long get_user_pages6(unsigned long start, unsigned long nr_pages,
 		int write, int force, struct page **pages,
diff --git a/mm/ksm.c b/mm/ksm.c
index ca6d2a0..c2013f6 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -352,7 +352,7 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
 /*
  * We use break_ksm to break COW on a ksm page: it's a stripped down
  *
- *	if (get_user_pages(current, mm, addr, 1, 1, 1, &page, NULL) == 1)
+ *	if (get_user_pages(addr, 1, 1, 1, &page, NULL) == 1)
  *		put_page(page);
  *
  * but taking great care only to touch a ksm page, in a VM_MERGEABLE vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4c4187c..dd0ce7f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -844,12 +844,12 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 	}
 }
 
-static int lookup_node(struct mm_struct *mm, unsigned long addr)
+static int lookup_node(unsigned long addr)
 {
 	struct page *p;
 	int err;
 
-	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	err = get_user_pages(addr & PAGE_MASK, 1, 0, 0, &p, NULL);
 	if (err >= 0) {
 		err = page_to_nid(p);
 		put_page(p);
@@ -904,7 +904,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 
 	if (flags & MPOL_F_NODE) {
 		if (flags & MPOL_F_ADDR) {
-			err = lookup_node(mm, addr);
+			err = lookup_node(addr);
 			if (err < 0)
 				goto out;
 			*policy = err;
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index d4f5f22..10297f7 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -24,7 +24,7 @@ struct page **ceph_get_direct_page_vector(const void __user *data,
 		return ERR_PTR(-ENOMEM);
 
 	while (got < num_pages) {
-		rc = get_user_pages_unlocked(current, current->mm,
+		rc = get_user_pages_unlocked(
 		    (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
 		    num_pages - got, write_page, 0, pages + got);
 		if (rc < 0)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a11cfd2..0253ad9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1264,15 +1264,16 @@ unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *w
 	return gfn_to_hva_memslot_prot(slot, gfn, writable);
 }
 
-static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
-	unsigned long start, int write, struct page **page)
+static int get_user_page_nowait(unsigned long start, int write,
+		struct page **page)
 {
 	int flags = FOLL_TOUCH | FOLL_NOWAIT | FOLL_HWPOISON | FOLL_GET;
 
 	if (write)
 		flags |= FOLL_WRITE;
 
-	return __get_user_pages(tsk, mm, start, 1, flags, page, NULL, NULL);
+	return __get_user_pages(current, current->mm, start, 1, flags, page,
+			NULL, NULL);
 }
 
 static inline int check_user_page_hwpoison(unsigned long addr)
@@ -1334,8 +1335,7 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 
 	if (async) {
 		down_read(&current->mm->mmap_sem);
-		npages = get_user_page_nowait(current, current->mm,
-					      addr, write_fault, page);
+		npages = get_user_page_nowait(addr, write_fault, page);
 		up_read(&current->mm->mmap_sem);
 	} else
 		npages = __get_user_pages_unlocked(current, current->mm, addr, 1,

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/fpu: Add placeholder for 'Processor Trace' XSAVE state
  2016-02-12 21:01 ` [PATCH 04/33] x86, fpu: add placeholder for Processor Trace XSAVE state Dave Hansen
@ 2016-02-18 20:16   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, torvalds, dvlasenk, linux-kernel, mingo, dave, bp, luto,
	riel, peterz, hpa, dave.hansen, brgerst, akpm

Commit-ID:  1f96b1efbad4bb753e7fd265753f6cac1cdc5648
Gitweb:     http://git.kernel.org/tip/1f96b1efbad4bb753e7fd265753f6cac1cdc5648
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:01:58 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Feb 2016 10:11:13 +0100

x86/fpu: Add placeholder for 'Processor Trace' XSAVE state

There is an XSAVE state component for Intel Processor Trace (PT).
But, we do not currently use it.

We add a placeholder in the code for it so it is not a mystery and
also so we do not need an explicit enum initialization for Protection
Keys in a moment.

Why don't we use it?

We might end up using this at _some_ point in the future.  But,
this is a "system" state which requires using the currently
unsupported XSAVES feature.  Unlike all the other XSAVE states,
PT state is also not directly tied to a thread.  You might
context-switch between threads, but not want to change any of the
PT state.  Or, you might switch between threads, and *do* want to
change PT state, all depending on what is being traced.

We currently just manually set some MSRs to do this PT context
switching, and it is unclear whether replacing our direct MSR use
with XSAVE will be a net win or loss, both in code complexity and
performance.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: fenghua.yu@intel.com
Cc: linux-mm@kvack.org
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20160212210158.5E4BCAE2@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/fpu/types.h |  1 +
 arch/x86/kernel/fpu/xstate.c     | 10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 1c6f6ac..aad3181 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -108,6 +108,7 @@ enum xfeature {
 	XFEATURE_OPMASK,
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
+	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
 
 	XFEATURE_MAX,
 };
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d425cda5..c2e2349 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -13,6 +13,11 @@
 
 #include <asm/tlbflush.h>
 
+/*
+ * Although we spell it out in here, the Processor Trace
+ * xfeature is completely unused.  We use other mechanisms
+ * to save/restore PT state in Linux.
+ */
 static const char *xfeature_names[] =
 {
 	"x87 floating point registers"	,
@@ -23,7 +28,7 @@ static const char *xfeature_names[] =
 	"AVX-512 opmask"		,
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
-	"unknown xstate feature"	,
+	"Processor Trace (unused)"	,
 };
 
 /*
@@ -470,7 +475,8 @@ static void check_xstate_against_struct(int nr)
 	 * numbers.
 	 */
 	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX)) {
+	    (nr >= XFEATURE_MAX) ||
+	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR)) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Add Kconfig option
  2016-02-12 21:02 ` [PATCH 05/33] x86, pkeys: Add Kconfig option Dave Hansen
@ 2016-02-18 20:16   ` tip-bot for Dave Hansen
  2016-02-19 11:27     ` [PATCH] x86/mm/pkeys: Do not enable them by default Borislav Petkov
  0 siblings, 1 reply; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, brgerst, bp, tglx, torvalds, hpa, dave.hansen,
	akpm, peterz, mingo, luto, dvlasenk, dave, riel

Commit-ID:  35e97790f5f1e5cf2b5522c55e3e31d5c81bd226
Gitweb:     http://git.kernel.org/tip/35e97790f5f1e5cf2b5522c55e3e31d5c81bd226
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:00 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Feb 2016 10:11:13 +0100

x86/mm/pkeys: Add Kconfig option

I don't have a strong opinion on whether we need a Kconfig prompt
or not.  Protection Keys has relatively little code associated
with it, and it is not a heavyweight feature to keep enabled.
However, I can imagine that folks would still appreciate being
able to disable it.

Note that, with disabled-features.h, the checks in the code
for protection keys are always the same:

	cpu_has(c, X86_FEATURE_PKU)

With the config option disabled, this essentially turns into an

We will hide the prompt for now.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210200.DB7055E8@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ab2ed53..3632cdd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1714,6 +1714,10 @@ config X86_INTEL_MPX
 
 	  If unsure, say N.
 
+config X86_INTEL_MEMORY_PROTECTION_KEYS
+	def_bool y
+	depends on CPU_SUP_INTEL && X86_64
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/cpufeature, x86/mm/pkeys: Add protection keys related CPUID definitions
  2016-02-12 21:02 ` [PATCH 06/33] x86, pkeys: cpuid bit definition Dave Hansen
@ 2016-02-18 20:17   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, tglx, peterz, torvalds, dave, riel, dave.hansen, akpm,
	luto, hpa, brgerst, linux-kernel, dvlasenk, bp

Commit-ID:  dfb4a70f20c5b3880da56ee4c9484bdb4e8f1e65
Gitweb:     http://git.kernel.org/tip/dfb4a70f20c5b3880da56ee4c9484bdb4e8f1e65
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:01 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Feb 2016 10:11:13 +0100

x86/cpufeature, x86/mm/pkeys: Add protection keys related CPUID definitions

There are two CPUID bits for protection keys.  One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys.  Specifically:

	Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
	Protection keys (and the RDPKRU/WRPKRU instructions)

This is because userspace can not see CR4 contents, but it can
see CPUID contents.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:

	CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]

X86_FEATURE_OSPKE is "OSPKU":

	CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]

These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.

Add it to the disabled-features mask when its config option is
off.  Even though we are not using it here, we also extend the
REQUIRED_MASK_BIT_SET() macro to keep it mirroring the
DISABLED_MASK_BIT_SET() version.

This means that in almost all code, you should use:

	cpu_has(c, X86_FEATURE_PKU)

and *not* the CONFIG option.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210201.7714C250@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/cpufeature.h        | 59 +++++++++++++++++++++-----------
 arch/x86/include/asm/cpufeatures.h       |  2 +-
 arch/x86/include/asm/disabled-features.h | 15 ++++++++
 arch/x86/include/asm/required-features.h |  7 ++++
 arch/x86/kernel/cpu/common.c             |  1 +
 5 files changed, 63 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 68e4e82..50e292a 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -26,6 +26,7 @@ enum cpuid_leafs
 	CPUID_8000_0008_EBX,
 	CPUID_6_EAX,
 	CPUID_8000_000A_EDX,
+	CPUID_7_ECX,
 };
 
 #ifdef CONFIG_X86_FEATURE_NAMES
@@ -48,28 +49,42 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
 	 test_bit(bit, (unsigned long *)((c)->x86_capability))
 
 #define REQUIRED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & REQUIRED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & REQUIRED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & REQUIRED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & REQUIRED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & REQUIRED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & REQUIRED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & REQUIRED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & REQUIRED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & REQUIRED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & REQUIRED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & REQUIRED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & REQUIRED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & REQUIRED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & REQUIRED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & REQUIRED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & REQUIRED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & REQUIRED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & REQUIRED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & REQUIRED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & REQUIRED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & REQUIRED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & REQUIRED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & REQUIRED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK13)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK14)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & REQUIRED_MASK15)) ||	\
+	   (((bit)>>5)==14 && (1UL<<((bit)&31) & REQUIRED_MASK16)) )
 
 #define DISABLED_MASK_BIT_SET(bit)					\
-	 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & DISABLED_MASK0)) ||	\
-	   (((bit)>>5)==1 && (1UL<<((bit)&31) & DISABLED_MASK1)) ||	\
-	   (((bit)>>5)==2 && (1UL<<((bit)&31) & DISABLED_MASK2)) ||	\
-	   (((bit)>>5)==3 && (1UL<<((bit)&31) & DISABLED_MASK3)) ||	\
-	   (((bit)>>5)==4 && (1UL<<((bit)&31) & DISABLED_MASK4)) ||	\
-	   (((bit)>>5)==5 && (1UL<<((bit)&31) & DISABLED_MASK5)) ||	\
-	   (((bit)>>5)==6 && (1UL<<((bit)&31) & DISABLED_MASK6)) ||	\
-	   (((bit)>>5)==7 && (1UL<<((bit)&31) & DISABLED_MASK7)) ||	\
-	   (((bit)>>5)==8 && (1UL<<((bit)&31) & DISABLED_MASK8)) ||	\
-	   (((bit)>>5)==9 && (1UL<<((bit)&31) & DISABLED_MASK9)) )
+	 ( (((bit)>>5)==0  && (1UL<<((bit)&31) & DISABLED_MASK0 )) ||	\
+	   (((bit)>>5)==1  && (1UL<<((bit)&31) & DISABLED_MASK1 )) ||	\
+	   (((bit)>>5)==2  && (1UL<<((bit)&31) & DISABLED_MASK2 )) ||	\
+	   (((bit)>>5)==3  && (1UL<<((bit)&31) & DISABLED_MASK3 )) ||	\
+	   (((bit)>>5)==4  && (1UL<<((bit)&31) & DISABLED_MASK4 )) ||	\
+	   (((bit)>>5)==5  && (1UL<<((bit)&31) & DISABLED_MASK5 )) ||	\
+	   (((bit)>>5)==6  && (1UL<<((bit)&31) & DISABLED_MASK6 )) ||	\
+	   (((bit)>>5)==7  && (1UL<<((bit)&31) & DISABLED_MASK7 )) ||	\
+	   (((bit)>>5)==8  && (1UL<<((bit)&31) & DISABLED_MASK8 )) ||	\
+	   (((bit)>>5)==9  && (1UL<<((bit)&31) & DISABLED_MASK9 )) ||	\
+	   (((bit)>>5)==10 && (1UL<<((bit)&31) & DISABLED_MASK10)) ||	\
+	   (((bit)>>5)==11 && (1UL<<((bit)&31) & DISABLED_MASK11)) ||	\
+	   (((bit)>>5)==12 && (1UL<<((bit)&31) & DISABLED_MASK12)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK13)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK14)) ||	\
+	   (((bit)>>5)==13 && (1UL<<((bit)&31) & DISABLED_MASK15)) ||	\
+	   (((bit)>>5)==14 && (1UL<<((bit)&31) & DISABLED_MASK16)) )
 
 #define cpu_has(c, bit)							\
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 :	\
@@ -79,6 +94,10 @@ extern const char * const x86_bug_flags[NBUGINTS*32];
 	(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 : 	\
 	 x86_this_cpu_test_bit(bit, (unsigned long *)&cpu_info.x86_capability))
 
+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 16 */
+#define X86_FEATURE_PKU		(16*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE	(16*32+ 4) /* OS Protection Keys Enable */
+
 /*
  * This macro is for detection of features which need kernel
  * infrastructure to be used.  It may *not* directly test the CPU
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 0ceb6ad..cbb2c56 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -12,7 +12,7 @@
 /*
  * Defines x86 CPU feature bits
  */
-#define NCAPINTS	16	/* N 32-bit words worth of info */
+#define NCAPINTS	17	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index f226df0..39343be 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -28,6 +28,14 @@
 # define DISABLE_CENTAUR_MCR	0
 #endif /* CONFIG_X86_64 */
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+# define DISABLE_PKU		(1<<(X86_FEATURE_PKU))
+# define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE))
+#else
+# define DISABLE_PKU		0
+# define DISABLE_OSPKE		0
+#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -41,5 +49,12 @@
 #define DISABLED_MASK7	0
 #define DISABLED_MASK8	0
 #define DISABLED_MASK9	(DISABLE_MPX)
+#define DISABLED_MASK10	0
+#define DISABLED_MASK11	0
+#define DISABLED_MASK12	0
+#define DISABLED_MASK13	0
+#define DISABLED_MASK14	0
+#define DISABLED_MASK15	0
+#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE)
 
 #endif /* _ASM_X86_DISABLED_FEATURES_H */
diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h
index 5c6e4fb..4916144 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -92,5 +92,12 @@
 #define REQUIRED_MASK7	0
 #define REQUIRED_MASK8	0
 #define REQUIRED_MASK9	0
+#define REQUIRED_MASK10	0
+#define REQUIRED_MASK11	0
+#define REQUIRED_MASK12	0
+#define REQUIRED_MASK13	0
+#define REQUIRED_MASK14	0
+#define REQUIRED_MASK15	0
+#define REQUIRED_MASK16	0
 
 #endif /* _ASM_X86_REQUIRED_FEATURES_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f4bb2c4..a719ad7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -627,6 +627,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		c->x86_capability[CPUID_7_0_EBX] = ebx;
 
 		c->x86_capability[CPUID_6_EAX] = cpuid_eax(0x00000006);
+		c->x86_capability[CPUID_7_ECX] = ecx;
 	}
 
 	/* Extended state features: level 0x0000000d */

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/cpu, x86/mm/pkeys: Define new CR4 bit
  2016-02-12 21:02 ` [PATCH 07/33] x86, pkeys: define new CR4 bit Dave Hansen
@ 2016-02-18 20:17   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, riel, dave.hansen, akpm, peterz, luto, torvalds,
	dvlasenk, mingo, bp, brgerst, hpa, tglx, dave

Commit-ID:  f28b49d2bcdb9ef9e771b3d6750f40be9d453316
Gitweb:     http://git.kernel.org/tip/f28b49d2bcdb9ef9e771b3d6750f40be9d453316
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:02 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Feb 2016 10:11:14 +0100

x86/cpu, x86/mm/pkeys: Define new CR4 bit

There is a new bit in CR4 for enabling protection keys.  We
will actually enable it later in the series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210202.3CFC3DB2@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/uapi/asm/processor-flags.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 79887ab..567de50 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -118,6 +118,8 @@
 #define X86_CR4_SMEP		_BITUL(X86_CR4_SMEP_BIT)
 #define X86_CR4_SMAP_BIT	21 /* enable SMAP support */
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
+#define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
+#define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/fpu, x86/mm/pkeys: Add PKRU xsave fields and data structures
  2016-02-12 21:02 ` [PATCH 08/33] x86, pkeys: add PKRU xsave fields and data structure(s) Dave Hansen
@ 2016-02-18 20:17   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dvlasenk, tglx, riel, luto, torvalds, dave, mingo, hpa, bp,
	dave.hansen, brgerst, akpm, peterz, linux-kernel

Commit-ID:  c8df40098451ba18a43f22b563c9129182353158
Gitweb:     http://git.kernel.org/tip/c8df40098451ba18a43f22b563c9129182353158
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:04 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Feb 2016 10:11:14 +0100

x86/fpu, x86/mm/pkeys: Add PKRU xsave fields and data structures

The protection keys register (PKRU) is saved and restored using
xsave.  Define the data structure that we will use to access it
inside the xsave buffer.

Note that we also have to widen the printk of the xsave feature
masks since this is feature 0x200 and we only did two characters
before.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210204.56DF8F7B@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/fpu/types.h  | 11 +++++++++++
 arch/x86/include/asm/fpu/xstate.h |  3 ++-
 arch/x86/kernel/fpu/xstate.c      |  7 ++++++-
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index aad3181..36b90bb 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -109,6 +109,7 @@ enum xfeature {
 	XFEATURE_ZMM_Hi256,
 	XFEATURE_Hi16_ZMM,
 	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
+	XFEATURE_PKRU,
 
 	XFEATURE_MAX,
 };
@@ -121,6 +122,7 @@ enum xfeature {
 #define XFEATURE_MASK_OPMASK		(1 << XFEATURE_OPMASK)
 #define XFEATURE_MASK_ZMM_Hi256		(1 << XFEATURE_ZMM_Hi256)
 #define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
+#define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
@@ -213,6 +215,15 @@ struct avx_512_hi16_state {
 	struct reg_512_bit		hi16_zmm[16];
 } __packed;
 
+/*
+ * State component 9: 32-bit PKRU register.  The state is
+ * 8 bytes long but only 4 bytes is used currently.
+ */
+struct pkru_state {
+	u32				pkru;
+	u32				pad;
+} __packed;
+
 struct xstate_header {
 	u64				xfeatures;
 	u64				xcomp_bv;
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index af30fde..9994d42 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -28,7 +28,8 @@
 				 XFEATURE_MASK_YMM | \
 				 XFEATURE_MASK_OPMASK | \
 				 XFEATURE_MASK_ZMM_Hi256 | \
-				 XFEATURE_MASK_Hi16_ZMM)
+				 XFEATURE_MASK_Hi16_ZMM	 | \
+				 XFEATURE_MASK_PKRU)
 
 /* All currently supported features */
 #define XCNTXT_MASK	(XFEATURE_MASK_LAZY | XFEATURE_MASK_EAGER)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c2e2349..a63ca80 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -29,6 +29,8 @@ static const char *xfeature_names[] =
 	"AVX-512 Hi256"			,
 	"AVX-512 ZMM_Hi256"		,
 	"Processor Trace (unused)"	,
+	"Protection Keys User registers",
+	"unknown xstate feature"	,
 };
 
 /*
@@ -58,6 +60,7 @@ void fpu__xstate_clear_all_cpu_caps(void)
 	setup_clear_cpu_cap(X86_FEATURE_AVX512CD);
 	setup_clear_cpu_cap(X86_FEATURE_MPX);
 	setup_clear_cpu_cap(X86_FEATURE_XGETBV1);
+	setup_clear_cpu_cap(X86_FEATURE_PKU);
 }
 
 /*
@@ -236,7 +239,7 @@ static void __init print_xstate_feature(u64 xstate_mask)
 	const char *feature_name;
 
 	if (cpu_has_xfeatures(xstate_mask, &feature_name))
-		pr_info("x86/fpu: Supporting XSAVE feature 0x%02Lx: '%s'\n", xstate_mask, feature_name);
+		pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name);
 }
 
 /*
@@ -252,6 +255,7 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_OPMASK);
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
+	print_xstate_feature(XFEATURE_MASK_PKRU);
 }
 
 /*
@@ -468,6 +472,7 @@ static void check_xstate_against_struct(int nr)
 	XCHECK_SZ(sz, nr, XFEATURE_OPMASK,    struct avx_512_opmask_state);
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
+	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Add PTE bits for storing protection key
  2016-02-12 21:02 ` [PATCH 09/33] x86, pkeys: PTE bits for storing protection key Dave Hansen
@ 2016-02-18 20:18   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, akpm, linux-kernel, tglx, torvalds, hpa, luto, riel,
	dvlasenk, dave, brgerst, bp, mingo, dave.hansen

Commit-ID:  5c1d90f51027e197e1299ab1235a2fed78910905
Gitweb:     http://git.kernel.org/tip/5c1d90f51027e197e1299ab1235a2fed78910905
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:05 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:31:44 +0100

x86/mm/pkeys: Add PTE bits for storing protection key

Previous documentation has referred to these 4 bits as "ignored".
That means that software could have made use of them.  But, as
far as I know, the kernel never used them.

They are still ignored when protection keys is not enabled, so
they could theoretically still get used for software purposes.

We also implement "empty" versions so that code that references
to them can be optimized away by the compiler when the config
option is not enabled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210205.81E33ED6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_types.h | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 4432ab7..cae10ba 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -20,13 +20,18 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
+#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
+#define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
+#define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
+#define _PAGE_BIT_PKEY_BIT3	62	/* Protection Keys, bit 4/4 */
+#define _PAGE_BIT_NX		63	/* No execute: only valid after cpuid check */
+
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
-#define _PAGE_BIT_DEVMAP		_PAGE_BIT_SOFTW4
-#define _PAGE_BIT_NX		63	/* No execute: only valid after cpuid check */
+#define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -47,6 +52,17 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2)
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3)
+#else
+#define _PAGE_PKEY_BIT0	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT1	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT2	(_AT(pteval_t, 0))
+#define _PAGE_PKEY_BIT3	(_AT(pteval_t, 0))
+#endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Add new 'PF_PK' page fault error code bit
  2016-02-12 21:02 ` [PATCH 10/33] x86, pkeys: new page fault error code bit: PF_PK Dave Hansen
@ 2016-02-18 20:18   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, dave.hansen, akpm, linux-kernel, riel, tglx, dvlasenk,
	peterz, torvalds, bp, brgerst, hpa, dave, luto

Commit-ID:  b3ecd51559ae7a8f40b10443773b9cd0e6a50f5e
Gitweb:     http://git.kernel.org/tip/b3ecd51559ae7a8f40b10443773b9cd0e6a50f5e
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:07 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:31:50 +0100

x86/mm/pkeys: Add new 'PF_PK' page fault error code bit

Note: "PK" is how the Intel SDM refers to this bit, so we also
use that nomenclature.

This only defines the bit, it does not plumb it anywhere to be
handled.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210207.DA7B43E6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/fault.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9..9f72f9c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -33,6 +33,7 @@
  *   bit 2 ==	 0: kernel-mode access	1: user-mode access
  *   bit 3 ==				1: use of reserved bit detected
  *   bit 4 ==				1: fault was an instruction fetch
+ *   bit 5 ==				1: protection keys block access
  */
 enum x86_pf_error_code {
 
@@ -41,6 +42,7 @@ enum x86_pf_error_code {
 	PF_USER		=		1 << 2,
 	PF_RSVD		=		1 << 3,
 	PF_INSTR	=		1 << 4,
+	PF_PK		=		1 << 5,
 };
 
 /*
@@ -916,6 +918,12 @@ static int spurious_fault_check(unsigned long error_code, pte_t *pte)
 
 	if ((error_code & PF_INSTR) && !pte_exec(*pte))
 		return 0;
+	/*
+	 * Note: We do not do lazy flushing on protection key
+	 * changes, so no spurious fault will ever set PF_PK.
+	 */
+	if ((error_code & PF_PK))
+		return 1;
 
 	return 1;
 }

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] mm/core, x86/mm/pkeys: Store protection bits in high VMA flags
  2016-02-12 21:02 ` [PATCH 11/33] x86, pkeys: store protection in high VMA flags Dave Hansen
@ 2016-02-18 20:19   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, bp, torvalds, tglx, xiexiuqi, koct9i, kirill.shutemov,
	mingo, vdavydov, valentinrothberg, dan.j.williams, oleg, peterz,
	linux-kernel, dave.hansen, jack, brgerst, riel, sasha.levin,
	vbabka, dvlasenk, mgorman, mhocko, akpm, luto, dave

Commit-ID:  63c17fb8e5a46a16e10e82005748837fd11a2024
Gitweb:     http://git.kernel.org/tip/63c17fb8e5a46a16e10e82005748837fd11a2024
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:08 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:31:50 +0100

mm/core, x86/mm/pkeys: Store protection bits in high VMA flags

vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures.  The high 32 bits are unused on 64-bit
platforms.  We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.

Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.

This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.

Sparse complains about these constants unless we explicitly
call them "UL".

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Valentin Rothberg <valentinrothberg@gmail.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210208.81AF00D5@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig   |  1 +
 include/linux/mm.h | 11 +++++++++++
 mm/Kconfig         |  3 +++
 3 files changed, 15 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3632cdd..fb2ebeb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -155,6 +155,7 @@ config X86
 	select VIRT_TO_BUS
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4c73178..54d173b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -170,6 +170,17 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_HIGH_ARCH_BIT_0	32	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_1	33	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
+#define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
+#define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
+#define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
+#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff --git a/mm/Kconfig b/mm/Kconfig
index 03cbfa0..6cf4399 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -669,3 +669,6 @@ config ZONE_DEVICE
 
 config FRAME_VECTOR
 	bool
+
+config ARCH_USES_HIGH_VMA_FLAGS
+	bool

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Add arch-specific VMA protection bits
  2016-02-12 21:02 ` [PATCH 12/33] x86, pkeys: arch-specific protection bits Dave Hansen
@ 2016-02-18 20:19   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, hpa, mingo, brgerst, luto, dave, bp, dave.hansen, akpm,
	riel, dvlasenk, torvalds, tglx, linux-kernel

Commit-ID:  8f62c883222c9e3c06d60b5e55e307a3d1f18257
Gitweb:     http://git.kernel.org/tip/8f62c883222c9e3c06d60b5e55e307a3d1f18257
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:10 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:31:51 +0100

x86/mm/pkeys: Add arch-specific VMA protection bits

Lots of things seem to do:

        vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot.  So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT).  It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
	1. "prot" argument to system calls
	2. vma->vm_flags, filled from the mmap "prot"
	3. vma->vm_page prot, filled from vma->vm_flags
	4. the PTE itself.

The pseudocode for these for steps are as follows:

	mmap(PROT_PKEY*)
	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
	pte = pfn | vma->vm_page_prot

Note that this provides a new definitions for x86:

	arch_vm_get_page_prot()

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210210.FE483A42@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mmu_context.h   | 11 +++++++++++
 arch/x86/include/asm/pgtable_types.h | 12 ++++++++++--
 arch/x86/include/uapi/asm/mman.h     | 16 ++++++++++++++++
 include/linux/mm.h                   |  7 +++++++
 4 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index bfd9b2a..94c4c8b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -275,4 +275,15 @@ static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		mpx_notify_unmap(mm, vma, start, end);
 }
 
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+	u16 pkey = 0;
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	unsigned long vma_pkey_mask = VM_PKEY_BIT0 | VM_PKEY_BIT1 |
+				      VM_PKEY_BIT2 | VM_PKEY_BIT3;
+	pkey = (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
+#endif
+	return pkey;
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index cae10ba..8c35cf0 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -115,7 +115,12 @@
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
 			 _PAGE_DIRTY)
 
-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify.  The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY)
@@ -231,7 +236,10 @@ enum page_cache_mode {
 /* Extracts the PFN from a (pte|pmd|pud|pgd)val_t of a 4KB page */
 #define PTE_PFN_MASK		((pteval_t)PHYSICAL_PAGE_MASK)
 
-/* Extracts the flags from a (pte|pmd|pud|pgd)val_t of a 4KB page */
+/*
+ *  Extracts the flags from a (pte|pmd|pud|pgd)val_t
+ *  This includes the protection key value.
+ */
 #define PTE_FLAGS_MASK		(~PTE_PFN_MASK)
 
 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 513b05f..e8562e0 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -6,6 +6,22 @@
 #define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
 #define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ *
+ * Only override these if Protection Keys are available
+ * (which is only on 64-bit).
+ */
+#define arch_vm_get_page_prot(vm_flags)	__pgprot(	\
+		((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
+		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+#endif
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 54d173b..3056369 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -183,6 +183,13 @@ extern unsigned int kobjsize(const void *objp);
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
+#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
+# define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+# define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+# define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#endif
 #elif defined(CONFIG_PPC)
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Pass VMA down in to fault signal generation code
  2016-02-12 21:02 ` [PATCH 13/33] x86, pkeys: pass VMA down in to fault signal generation code Dave Hansen
@ 2016-02-18 20:19   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, luto, dvlasenk, dave.hansen, bp, riel, dave, akpm,
	linux-kernel, peterz, hpa, tglx, brgerst, mingo

Commit-ID:  7b2d0dbac4890c8ca4a8acc57709639fc8b158e9
Gitweb:     http://git.kernel.org/tip/7b2d0dbac4890c8ca4a8acc57709639fc8b158e9
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:11 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:31:51 +0100

x86/mm/pkeys: Pass VMA down in to fault signal generation code

During a page fault, we look up the VMA to ensure that the fault
is in a region with a valid mapping.  But, in the top-level page
fault code we don't need the VMA for much else.  Once we have
decided that an access is bad, we are going to send a signal no
matter what and do not need the VMA any more.  So we do not pass
it down in to the signal generation code.

But, for protection keys, we need the VMA.  It tells us *which*
protection key we violated if we get a PF_PK.  So, we need to
pass the VMA down and fill in siginfo->si_pkey.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210211.AD3B36A3@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/fault.c | 50 ++++++++++++++++++++++++++++----------------------
 1 file changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9f72f9c..3c51c66 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -171,7 +171,8 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
 
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
-		     struct task_struct *tsk, int fault)
+		     struct task_struct *tsk, struct vm_area_struct *vma,
+		     int fault)
 {
 	unsigned lsb = 0;
 	siginfo_t info;
@@ -656,6 +657,8 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 	struct task_struct *tsk = current;
 	unsigned long flags;
 	int sig;
+	/* No context means no VMA to pass down */
+	struct vm_area_struct *vma = NULL;
 
 	/* Are we prepared to handle this kernel fault? */
 	if (fixup_exception(regs)) {
@@ -679,7 +682,8 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 			tsk->thread.cr2 = address;
 
 			/* XXX: hwpoison faults will set the wrong code. */
-			force_sig_info_fault(signal, si_code, address, tsk, 0);
+			force_sig_info_fault(signal, si_code, address,
+					     tsk, vma, 0);
 		}
 
 		/*
@@ -756,7 +760,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		       unsigned long address, int si_code)
+		       unsigned long address, struct vm_area_struct *vma,
+		       int si_code)
 {
 	struct task_struct *tsk = current;
 
@@ -799,7 +804,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 		tsk->thread.error_code	= error_code;
 		tsk->thread.trap_nr	= X86_TRAP_PF;
 
-		force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
+		force_sig_info_fault(SIGSEGV, si_code, address, tsk, vma, 0);
 
 		return;
 	}
@@ -812,14 +817,14 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-		     unsigned long address)
+		     unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
+	__bad_area_nosemaphore(regs, error_code, address, vma, SEGV_MAPERR);
 }
 
 static void
 __bad_area(struct pt_regs *regs, unsigned long error_code,
-	   unsigned long address, int si_code)
+	   unsigned long address,  struct vm_area_struct *vma, int si_code)
 {
 	struct mm_struct *mm = current->mm;
 
@@ -829,25 +834,25 @@ __bad_area(struct pt_regs *regs, unsigned long error_code,
 	 */
 	up_read(&mm->mmap_sem);
 
-	__bad_area_nosemaphore(regs, error_code, address, si_code);
+	__bad_area_nosemaphore(regs, error_code, address, vma, si_code);
 }
 
 static noinline void
 bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 {
-	__bad_area(regs, error_code, address, SEGV_MAPERR);
+	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
-		      unsigned long address)
+		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, SEGV_ACCERR);
+	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void
 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
-	  unsigned int fault)
+	  struct vm_area_struct *vma, unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	int code = BUS_ADRERR;
@@ -874,12 +879,13 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 		code = BUS_MCEERR_AR;
 	}
 #endif
-	force_sig_info_fault(SIGBUS, code, address, tsk, fault);
+	force_sig_info_fault(SIGBUS, code, address, tsk, vma, fault);
 }
 
 static noinline void
 mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-	       unsigned long address, unsigned int fault)
+	       unsigned long address, struct vm_area_struct *vma,
+	       unsigned int fault)
 {
 	if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
 		no_context(regs, error_code, address, 0, 0);
@@ -903,9 +909,9 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 	} else {
 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
 			     VM_FAULT_HWPOISON_LARGE))
-			do_sigbus(regs, error_code, address, fault);
+			do_sigbus(regs, error_code, address, vma, fault);
 		else if (fault & VM_FAULT_SIGSEGV)
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, vma);
 		else
 			BUG();
 	}
@@ -1119,7 +1125,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		 * Don't take the mm semaphore here. If we fixup a prefetch
 		 * fault we could otherwise deadlock:
 		 */
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 
 		return;
 	}
@@ -1132,7 +1138,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 		pgtable_bad(regs, error_code, address);
 
 	if (unlikely(smap_violation(error_code, regs))) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1141,7 +1147,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	 * in a region with pagefaults disabled then we must not take the fault
 	 */
 	if (unlikely(faulthandler_disabled() || !mm)) {
-		bad_area_nosemaphore(regs, error_code, address);
+		bad_area_nosemaphore(regs, error_code, address, NULL);
 		return;
 	}
 
@@ -1185,7 +1191,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
 		if ((error_code & PF_USER) == 0 &&
 		    !search_exception_tables(regs->ip)) {
-			bad_area_nosemaphore(regs, error_code, address);
+			bad_area_nosemaphore(regs, error_code, address, NULL);
 			return;
 		}
 retry:
@@ -1233,7 +1239,7 @@ retry:
 	 */
 good_area:
 	if (unlikely(access_error(error_code, vma))) {
-		bad_area_access_error(regs, error_code, address);
+		bad_area_access_error(regs, error_code, address, vma);
 		return;
 	}
 
@@ -1271,7 +1277,7 @@ good_area:
 
 	up_read(&mm->mmap_sem);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
-		mm_fault_error(regs, error_code, address, fault);
+		mm_fault_error(regs, error_code, address, vma, fault);
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] signals, pkeys: Notify userspace about protection key faults
  2016-02-12 21:02 ` [PATCH 14/33] signals, pkeys: notify userspace about protection key faults Dave Hansen
@ 2016-02-18 20:20   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, bp, dave, dvlasenk, torvalds, brgerst, tglx, vegard.nossum,
	oleg, linux-kernel, amanieu, peterz, dave.hansen, viro, vdavydov,
	mingo, hpa, riel, richard, luto, sasha.levin, arnd, palmer

Commit-ID:  cd0ea35ff5511cde299a61c21a95889b4a71464e
Gitweb:     http://git.kernel.org/tip/cd0ea35ff5511cde299a61c21a95889b4a71464e
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:12 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:32:42 +0100

signals, pkeys: Notify userspace about protection key faults

A protection key fault is very similar to any other access error.
There must be a VMA, etc...  We even want to take the same action
(SIGSEGV) that we do with a normal access fault.

However, we do need to let userspace know that something is
different.  We do this the same way what we did with SEGV_BNDERR
with Memory Protection eXtensions (MPX): define a new SEGV code:
SEGV_PKUERR.

We add a siginfo field: si_pkey that reveals to userspace which
protection key was set on the PTE that we faulted on.  There is
no other easy way for userspace to figure this out.  They could
parse smaps but that would be a bit cruel.

We share space with in siginfo with _addr_bnd.  #BR faults from
MPX are completely separate from page faults (#PF) that trigger
from protection key violations, so we never need both at the same
time.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

The x86 code to actually fill in the siginfo is in the next
patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amanieu d'Antras <amanieu@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210212.3A9B83AC@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/uapi/asm-generic/siginfo.h | 17 ++++++++++++-----
 kernel/signal.c                    |  4 ++++
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 1e35520..90384d5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -91,10 +91,15 @@ typedef struct siginfo {
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
 			short _addr_lsb; /* LSB of the reported address */
-			struct {
-				void __user *_lower;
-				void __user *_upper;
-			} _addr_bnd;
+			union {
+				/* used when si_code=SEGV_BNDERR */
+				struct {
+					void __user *_lower;
+					void __user *_upper;
+				} _addr_bnd;
+				/* used when si_code=SEGV_PKUERR */
+				u64 _pkey;
+			};
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -137,6 +142,7 @@ typedef struct siginfo {
 #define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_lower	_sifields._sigfault._addr_bnd._lower
 #define si_upper	_sifields._sigfault._addr_bnd._upper
+#define si_pkey		_sifields._sigfault._pkey
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 #ifdef __ARCH_SIGSYS
@@ -206,7 +212,8 @@ typedef struct siginfo {
 #define SEGV_MAPERR	(__SI_FAULT|1)	/* address not mapped to object */
 #define SEGV_ACCERR	(__SI_FAULT|2)	/* invalid permissions for mapped object */
 #define SEGV_BNDERR	(__SI_FAULT|3)  /* failed address bound checks */
-#define NSIGSEGV	3
+#define SEGV_PKUERR	(__SI_FAULT|4)  /* failed protection key checks */
+#define NSIGSEGV	4
 
 /*
  * SIGBUS si_codes
diff --git a/kernel/signal.c b/kernel/signal.c
index 0508544..fe8ed29 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2709,6 +2709,10 @@ int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
 			err |= __put_user(from->si_upper, &to->si_upper);
 		}
 #endif
+#ifdef SEGV_PKUERR
+		if (from->si_signo == SIGSEGV && from->si_code == SEGV_PKUERR)
+			err |= __put_user(from->si_pkey, &to->si_pkey);
+#endif
 		break;
 	case __SI_CHLD:
 		err |= __put_user(from->si_pid, &to->si_pid);

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Fill in pkey field in siginfo
  2016-02-12 21:02 ` [PATCH 15/33] x86, pkeys: fill in pkey field in siginfo Dave Hansen
@ 2016-02-18 20:20   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mingo, torvalds, dvlasenk, dave.hansen, peterz,
	hpa, akpm, riel, bp, dave, luto, brgerst, tglx

Commit-ID:  019132ff3daf36c97a4006655dfd00ee42f2b590
Gitweb:     http://git.kernel.org/tip/019132ff3daf36c97a4006655dfd00ee42f2b590
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:14 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:32:43 +0100

x86/mm/pkeys: Fill in pkey field in siginfo

This fills in the new siginfo field: si_pkey to indicate to
userspace which protection key was set on the PTE that we faulted
on.

Note though that *ALL* protection key faults have to be generated
by a valid, present PTE at some point.  But this code does no PTE
lookups which seeds odd.  The reason is that we take advantage of
the way we generate PTEs from VMAs.  All PTEs under a VMA share
some attributes.  For instance, they are _all_ either PROT_READ
*OR* PROT_NONE.  They also always share a protection key, so we
never have to walk the page tables; we just use the VMA.

Note that _pkey is a 64-bit value.  The current hardware only
supports 4-bit protection keys.  We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210213.ABC488FA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_types.h |  5 +++
 arch/x86/mm/fault.c                  | 64 +++++++++++++++++++++++++++++++++++-
 2 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8c35cf0..7b5efe2 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -65,6 +65,11 @@
 #endif
 #define __HAVE_ARCH_PTE_SPECIAL
 
+#define _PAGE_PKEY_MASK (_PAGE_PKEY_BIT0 | \
+			 _PAGE_PKEY_BIT1 | \
+			 _PAGE_PKEY_BIT2 | \
+			 _PAGE_PKEY_BIT3)
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 3c51c66..6e71dcf 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -15,12 +15,14 @@
 #include <linux/context_tracking.h>	/* exception_enter(), ...	*/
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 
+#include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
 #include <asm/pgalloc.h>		/* pgd_*(), ...			*/
 #include <asm/kmemcheck.h>		/* kmemcheck_*(), ...		*/
 #include <asm/fixmap.h>			/* VSYSCALL_ADDR		*/
 #include <asm/vsyscall.h>		/* emulate_vsyscall		*/
 #include <asm/vm86.h>			/* struct vm86			*/
+#include <asm/mmu_context.h>		/* vma_pkey()			*/
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -169,6 +171,56 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr)
 	return prefetch;
 }
 
+/*
+ * A protection key fault means that the PKRU value did not allow
+ * access to some PTE.  Userspace can figure out what PKRU was
+ * from the XSAVE state, and this function fills out a field in
+ * siginfo so userspace can discover which protection key was set
+ * on the PTE.
+ *
+ * If we get here, we know that the hardware signaled a PF_PK
+ * fault and that there was a VMA once we got in the fault
+ * handler.  It does *not* guarantee that the VMA we find here
+ * was the one that we faulted on.
+ *
+ * 1. T1   : mprotect_key(foo, PAGE_SIZE, pkey=4);
+ * 2. T1   : set PKRU to deny access to pkey=4, touches page
+ * 3. T1   : faults...
+ * 4.    T2: mprotect_key(foo, PAGE_SIZE, pkey=5);
+ * 5. T1   : enters fault handler, takes mmap_sem, etc...
+ * 6. T1   : reaches here, sees vma_pkey(vma)=5, when we really
+ *	     faulted on a pte with its pkey=4.
+ */
+static void fill_sig_info_pkey(int si_code, siginfo_t *info,
+		struct vm_area_struct *vma)
+{
+	/* This is effectively an #ifdef */
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	/* Fault not from Protection Keys: nothing to do */
+	if (si_code != SEGV_PKUERR)
+		return;
+	/*
+	 * force_sig_info_fault() is called from a number of
+	 * contexts, some of which have a VMA and some of which
+	 * do not.  The PF_PK handing happens after we have a
+	 * valid VMA, so we should never reach this without a
+	 * valid VMA.
+	 */
+	if (!vma) {
+		WARN_ONCE(1, "PKU fault with no VMA passed in");
+		info->si_pkey = 0;
+		return;
+	}
+	/*
+	 * si_pkey should be thought of as a strong hint, but not
+	 * absolutely guranteed to be 100% accurate because of
+	 * the race explained above.
+	 */
+	info->si_pkey = vma_pkey(vma);
+}
+
 static void
 force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 		     struct task_struct *tsk, struct vm_area_struct *vma,
@@ -187,6 +239,8 @@ force_sig_info_fault(int si_signo, int si_code, unsigned long address,
 		lsb = PAGE_SHIFT;
 	info.si_addr_lsb = lsb;
 
+	fill_sig_info_pkey(si_code, &info, vma);
+
 	force_sig_info(si_signo, &info, tsk);
 }
 
@@ -847,7 +901,15 @@ static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
 {
-	__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
+	/*
+	 * This OSPKE check is not strictly necessary at runtime.
+	 * But, doing it this way allows compiler optimizations
+	 * if pkeys are compiled out.
+	 */
+	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
+	else
+		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
 }
 
 static void

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Add functions to fetch PKRU
  2016-02-12 21:02 ` [PATCH 16/33] x86, pkeys: add functions to fetch PKRU Dave Hansen
@ 2016-02-18 20:21   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:21 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, dave, linux-kernel, brgerst, dave.hansen, torvalds, akpm,
	bp, dvlasenk, riel, mingo, peterz, luto, hpa

Commit-ID:  a927cb83f3300bcb1ae18672e58029acddd18b33
Gitweb:     http://git.kernel.org/tip/a927cb83f3300bcb1ae18672e58029acddd18b33
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:15 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:32:43 +0100

x86/mm/pkeys: Add functions to fetch PKRU

This adds the raw instruction to access PKRU as well as some
accessor functions that correctly handle when the CPU does not
support the instruction.  We don't use it here, but we will use
read_pkru() in the next patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210215.15238D34@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h       |  8 ++++++++
 arch/x86/include/asm/special_insns.h | 22 ++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0687c47..e997dcc 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -99,6 +99,14 @@ static inline int pte_dirty(pte_t pte)
 	return pte_flags(pte) & _PAGE_DIRTY;
 }
 
+
+static inline u32 read_pkru(void)
+{
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		return __read_pkru();
+	return 0;
+}
+
 static inline int pte_young(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_ACCESSED;
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 2270e41..aee6e76 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -98,6 +98,28 @@ static inline void native_write_cr8(unsigned long val)
 }
 #endif
 
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static inline u32 __read_pkru(void)
+{
+	u32 ecx = 0;
+	u32 edx, pkru;
+
+	/*
+	 * "rdpkru" instruction.  Places PKRU contents in to EAX,
+	 * clears EDX and requires that ecx=0.
+	 */
+	asm volatile(".byte 0x0f,0x01,0xee\n\t"
+		     : "=a" (pkru), "=d" (edx)
+		     : "c" (ecx));
+	return pkru;
+}
+#else
+static inline u32 __read_pkru(void)
+{
+	return 0;
+}
+#endif
+
 static inline void native_wbinvd(void)
 {
 	asm volatile("wbinvd": : :"memory");

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] mm/gup: Factor out VMA fault permission checking
  2016-02-12 21:02 ` [PATCH 17/33] mm: factor out VMA fault permission checking Dave Hansen
@ 2016-02-18 20:21   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:21 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: aneesh.kumar, dvlasenk, jason.low2, bp, torvalds, dan.j.williams,
	riel, dave, kirill.shutemov, akpm, hpa, linux-kernel, brgerst,
	luto, dave.hansen, sasha.levin, dingel, peterz, tglx, mingo,
	emunson

Commit-ID:  d4925e00d59698a201231cf99dce47d8b922bb34
Gitweb:     http://git.kernel.org/tip/d4925e00d59698a201231cf99dce47d8b922bb34
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:16 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:32:43 +0100

mm/gup: Factor out VMA fault permission checking

This code matches a fault condition up with the VMA and ensures
that the VMA allows the fault to be handled instead of just
erroring out.

We will be extending this in a moment to comprehend protection
keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210216.C3824032@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/gup.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index de24ef4..b935c2c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -610,6 +610,18 @@ next_page:
 }
 EXPORT_SYMBOL(__get_user_pages);
 
+bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
+{
+	vm_flags_t vm_flags;
+
+	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+
+	if (!(vm_flags & vma->vm_flags))
+		return false;
+
+	return true;
+}
+
 /*
  * fixup_user_fault() - manually resolve a user page fault
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -645,7 +657,6 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 		     bool *unlocked)
 {
 	struct vm_area_struct *vma;
-	vm_flags_t vm_flags;
 	int ret, major = 0;
 
 	if (unlocked)
@@ -656,8 +667,7 @@ retry:
 	if (!vma || address < vma->vm_start)
 		return -EFAULT;
 
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
-	if (!(vm_flags & vma->vm_flags))
+	if (!vma_permits_fault(vma, fault_flags))
 		return -EFAULT;
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/gup: Simplify get_user_pages() PTE bit handling
  2016-02-12 21:02 ` [PATCH 18/33] x86, mm: simplify get_user_pages() PTE bit handling Dave Hansen
@ 2016-02-18 20:21   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:21 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-kernel, mingo, riel, brgerst, peterz, bp,
	dave.hansen, luto, dvlasenk, hpa, torvalds, dave, akpm

Commit-ID:  1874f6895c92d991ccf85edcc55a0d9dd552d71c
Gitweb:     http://git.kernel.org/tip/1874f6895c92d991ccf85edcc55a0d9dd552d71c
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:18 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:32:44 +0100

x86/mm/gup: Simplify get_user_pages() PTE bit handling

The current get_user_pages() code is a wee bit more complicated
than it needs to be for pte bit checking.  Currently, it establishes
a mask of required pte _PAGE_* bits and ensures that the pte it
goes after has all those bits.

This consolidates the three identical copies of this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210218.3A2D4045@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/gup.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index ce5e454..2f0a329 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -75,6 +75,24 @@ static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
 }
 
 /*
+ * 'pteval' can come from a pte, pmd or pud.  We only check
+ * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
+ * same value on all 3 types.
+ */
+static inline int pte_allows_gup(unsigned long pteval, int write)
+{
+	unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
+
+	if (write)
+		need_pte_bits |= _PAGE_RW;
+
+	if ((pteval & need_pte_bits) != need_pte_bits)
+		return 0;
+
+	return 1;
+}
+
+/*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
  * register pressure.
@@ -83,14 +101,9 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
 	struct dev_pagemap *pgmap = NULL;
-	unsigned long mask;
 	int nr_start = *nr;
 	pte_t *ptep;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-
 	ptep = pte_offset_map(&pmd, addr);
 	do {
 		pte_t pte = gup_get_pte(ptep);
@@ -110,7 +123,8 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 				pte_unmap(ptep);
 				return 0;
 			}
-		} else if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+		} else if (!pte_allows_gup(pte_val(pte), write) ||
+			   pte_special(pte)) {
 			pte_unmap(ptep);
 			return 0;
 		}
@@ -164,14 +178,10 @@ static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pmd_flags(pmd) & mask) != mask)
+	if (!pte_allows_gup(pmd_val(pmd), write))
 		return 0;
 
 	VM_BUG_ON(!pfn_valid(pmd_pfn(pmd)));
@@ -231,14 +241,10 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	unsigned long mask;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pud_flags(pud) & mask) != mask)
+	if (!pte_allows_gup(pud_val(pud), write))
 		return 0;
 	/* hugepages are never "special" */
 	VM_BUG_ON(pud_flags(pud) & _PAGE_SPECIAL);

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
  2016-02-12 21:02 ` [PATCH 19/33] x86, pkeys: check VMAs and PTEs for protection keys Dave Hansen
@ 2016-02-18 20:22   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, dan.j.williams, willy, dvlasenk, mgorman, raindel, mingo,
	dave, riel, david, boaz, tglx, torvalds, jmarchan, vogt, arnd,
	dingel, peterz, sds, mhocko, luto, mpe, dave.hansen, dahi,
	sasha.levin, gxt, david.vrabel, hughd, minchan, luto, benh, bp,
	vbabka, toshi.kani, heiko.carstens, aneesh.kumar, jgross,
	kirill.shutemov, jason.low2, brgerst, akpm, linux-kernel,
	schwidefsky, mpatocka, aik, ldufour, paulus

Commit-ID:  33a709b25a760b91184bb335cf7d7c32b8123013
Gitweb:     http://git.kernel.org/tip/33a709b25a760b91184bb335cf7d7c32b8123013
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:19 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 09:32:44 +0100

mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys

Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action.  For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.

We try to do the same thing for protection keys.  Basically, we
try to make sure that if a user does this:

	mprotect(ptr, size, PROT_NONE);
	*ptr = foo;

they see the same effects with protection keys when they do this:

	mprotect(ptr, size, PROT_READ|PROT_WRITE);
	set_pkey(ptr, size, 4);
	wrpkru(0xffffff3f); // access disable pkey 4
	*ptr = foo;

The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.

We add two functions and expose them to generic code:

	arch_pte_access_permitted(pte_flags, write)
	arch_vma_access_permitted(vma, write)

These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.

But, there are also cases where we do not want to respect
protection keys.  When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/powerpc/include/asm/mmu_context.h   | 11 +++++++
 arch/s390/include/asm/mmu_context.h      | 11 +++++++
 arch/unicore32/include/asm/mmu_context.h | 11 +++++++
 arch/x86/include/asm/mmu_context.h       | 49 ++++++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable.h           | 29 +++++++++++++++++++
 arch/x86/mm/fault.c                      | 21 +++++++++++++-
 arch/x86/mm/gup.c                        |  5 ++++
 include/asm-generic/mm_hooks.h           | 11 +++++++
 mm/gup.c                                 | 18 ++++++++++--
 mm/memory.c                              |  4 +++
 10 files changed, 166 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 878c277..a0f1838 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -148,5 +148,16 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index fb1b93e..2627b33 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -130,4 +130,15 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif /* __S390_MMU_CONTEXT_H */
diff --git a/arch/unicore32/include/asm/mmu_context.h b/arch/unicore32/include/asm/mmu_context.h
index 1cb5220..3133f94 100644
--- a/arch/unicore32/include/asm/mmu_context.h
+++ b/arch/unicore32/include/asm/mmu_context.h
@@ -97,4 +97,15 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 94c4c8b..19036cd 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -286,4 +286,53 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 	return pkey;
 }
 
+static inline bool __pkru_allows_pkey(u16 pkey, bool write)
+{
+	u32 pkru = read_pkru();
+
+	if (!__pkru_allows_read(pkru, pkey))
+		return false;
+	if (write && !__pkru_allows_write(pkru, pkey))
+		return false;
+
+	return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to PKRU for other
+ * processes or any way to tell *which * PKRU in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+	if (!current->mm)
+		return true;
+	/*
+	 * Should PKRU be enforced on the access to this VMA?  If
+	 * the VMA is from another process, then PKRU has no
+	 * relevance and should not be enforced.
+	 */
+	if (current->mm != vma->vm_mm)
+		return true;
+
+	return false;
+}
+
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* allow access if the VMA is not one from this process */
+	if (vma_is_foreign(vma))
+		return true;
+	return __pkru_allows_pkey(vma_pkey(vma), write);
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	return __pkru_allows_pkey(pte_flags_pkey(pte_flags(pte)), write);
+}
+
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e997dcc..3cbfae8 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -919,6 +919,35 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 }
 #endif
 
+#define PKRU_AD_BIT 0x1
+#define PKRU_WD_BIT 0x2
+
+static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+}
+
+static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
+{
+	int pkru_pkey_bits = pkey * 2;
+	/*
+	 * Access-disable disables writes too so we need to check
+	 * both bits here.
+	 */
+	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+}
+
+static inline u16 pte_flags_pkey(unsigned long pte_flags)
+{
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+	/* ifdef to avoid doing 59-bit shift on 32-bit values */
+	return (pte_flags & _PAGE_PKEY_MASK) >> _PAGE_BIT_PKEY_BIT0;
+#else
+	return 0;
+#endif
+}
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6e71dcf..319331a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -897,6 +897,16 @@ bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 	__bad_area(regs, error_code, address, NULL, SEGV_MAPERR);
 }
 
+static inline bool bad_area_access_from_pkeys(unsigned long error_code,
+		struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return false;
+	if (error_code & PF_PK)
+		return true;
+	return false;
+}
+
 static noinline void
 bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 		      unsigned long address, struct vm_area_struct *vma)
@@ -906,7 +916,7 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
 	 * But, doing it this way allows compiler optimizations
 	 * if pkeys are compiled out.
 	 */
-	if (boot_cpu_has(X86_FEATURE_OSPKE) && (error_code & PF_PK))
+	if (bad_area_access_from_pkeys(error_code, vma))
 		__bad_area(regs, error_code, address, vma, SEGV_PKUERR);
 	else
 		__bad_area(regs, error_code, address, vma, SEGV_ACCERR);
@@ -1081,6 +1091,15 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/*
+	 * Access or read was blocked by protection keys. We do
+	 * this check before any others because we do not want
+	 * to, for instance, confuse a protection-key-denied
+	 * write with one for which we should do a COW.
+	 */
+	if (error_code & PF_PK)
+		return 1;
+
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */
 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 2f0a329..bab259e 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -11,6 +11,7 @@
 #include <linux/swap.h>
 #include <linux/memremap.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 
 static inline pte_t gup_get_pte(pte_t *ptep)
@@ -89,6 +90,10 @@ static inline int pte_allows_gup(unsigned long pteval, int write)
 	if ((pteval & need_pte_bits) != need_pte_bits)
 		return 0;
 
+	/* Check memory protection keys permissions. */
+	if (!__pkru_allows_pkey(pte_flags_pkey(pteval), write))
+		return 0;
+
 	return 1;
 }
 
diff --git a/include/asm-generic/mm_hooks.h b/include/asm-generic/mm_hooks.h
index 866aa46..c1fc5af 100644
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -26,4 +26,15 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 {
 }
 
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
+
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 #endif	/* _ASM_GENERIC_MM_HOOKS_H */
diff --git a/mm/gup.c b/mm/gup.c
index b935c2c..e0f5f357 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -15,6 +15,7 @@
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
 
+#include <asm/mmu_context.h>
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
@@ -444,6 +445,8 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
+	if (!arch_vma_access_permitted(vma, (gup_flags & FOLL_WRITE)))
+		return -EFAULT;
 	return 0;
 }
 
@@ -612,13 +615,19 @@ EXPORT_SYMBOL(__get_user_pages);
 
 bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 {
-	vm_flags_t vm_flags;
-
-	vm_flags = (fault_flags & FAULT_FLAG_WRITE) ? VM_WRITE : VM_READ;
+	bool write = !!(fault_flags & FAULT_FLAG_WRITE);
+	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
 
 	if (!(vm_flags & vma->vm_flags))
 		return false;
 
+	/*
+	 * The architecture might have a hardware protection
+	 * mechanism other than read/write that can deny access
+	 */
+	if (!arch_vma_access_permitted(vma, write))
+		return false;
+
 	return true;
 }
 
@@ -1172,6 +1181,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			pte_protnone(pte) || (write && !pte_write(pte)))
 			goto pte_unmap;
 
+		if (!arch_pte_access_permitted(pte, write))
+			goto pte_unmap;
+
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 		head = compound_head(page);
diff --git a/mm/memory.c b/mm/memory.c
index 8bfbad0..d7e84fe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -65,6 +65,7 @@
 #include <linux/userfaultfd_k.h>
 
 #include <asm/io.h>
+#include <asm/mmu_context.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
@@ -3378,6 +3379,9 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pmd_t *pmd;
 	pte_t *pte;
 
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE))
+		return VM_FAULT_SIGSEGV;
+
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Optimize fault handling in access_error()
  2016-02-12 21:02 ` [PATCH 21/33] x86, pkeys: optimize fault handling in access_error() Dave Hansen
@ 2016-02-18 20:23   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, torvalds, dave, dave.hansen, peterz, luto, brgerst, bp,
	riel, mingo, akpm, dvlasenk, linux-kernel, tglx

Commit-ID:  07f146f53e8de826e4afa3a88ea65bdb13c24959
Gitweb:     http://git.kernel.org/tip/07f146f53e8de826e4afa3a88ea65bdb13c24959
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:22 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:28 +0100

x86/mm/pkeys: Optimize fault handling in access_error()

We might not strictly have to make modifictions to
access_error() to check the VMA here.

If we do not, we will do this:

 1. app sets VMA pkey to K
 2. app touches a !present page
 3. do_page_fault(), allocates and maps page, sets pte.pkey=K
 4. return to userspace
 5. touch instruction reexecutes, but triggers PF_PK
 6. do PKEY signal

What happens with this patch applied:

 1. app sets VMA pkey to K
 2. app touches a !present page
 3. do_page_fault() notices that K is inaccessible
 4. do PKEY signal

We basically skip the fault that does an allocation.

So what this lets us do is protect areas from even being
*populated* unless it is accessible according to protection
keys.  That seems handy to me and makes protection keys work
more like an mprotect()'d mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210222.EBB63D8C@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/fault.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 319331a..68ecdff 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -900,10 +900,16 @@ bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
 static inline bool bad_area_access_from_pkeys(unsigned long error_code,
 		struct vm_area_struct *vma)
 {
+	/* This code is always called on the current mm */
+	bool foreign = false;
+
 	if (!boot_cpu_has(X86_FEATURE_OSPKE))
 		return false;
 	if (error_code & PF_PK)
 		return true;
+	/* this checks permission keys on the VMA: */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+		return true;
 	return false;
 }
 
@@ -1091,6 +1097,8 @@ int show_unhandled_signals = 1;
 static inline int
 access_error(unsigned long error_code, struct vm_area_struct *vma)
 {
+	/* This is only called for the current mm, so: */
+	bool foreign = false;
 	/*
 	 * Access or read was blocked by protection keys. We do
 	 * this check before any others because we do not want
@@ -1099,6 +1107,13 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 	 */
 	if (error_code & PF_PK)
 		return 1;
+	/*
+	 * Make sure to check the VMA so that we do not perform
+	 * faults just to hit a PF_PK as soon as we fill in a
+	 * page.
+	 */
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+		return 1;
 
 	if (error_code & PF_WRITE) {
 		/* write, present and write, not present: */

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] mm/core, x86/mm/pkeys: Differentiate instruction fetches
  2016-02-12 21:02 ` [PATCH 22/33] x86, pkeys: differentiate instruction fetches Dave Hansen
@ 2016-02-18 20:23   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, peterz, luto, akpm, hpa, riel, tglx,
	torvalds, dvlasenk, dave, brgerst, dave.hansen, bp

Commit-ID:  d61172b4b695b821388cdb6088a41d431bcbb93b
Gitweb:     http://git.kernel.org/tip/d61172b4b695b821388cdb6088a41d431bcbb93b
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:24 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:29 +0100

mm/core, x86/mm/pkeys: Differentiate instruction fetches

As discussed earlier, we attempt to enforce protection keys in
software.

However, the code checks all faults to ensure that they are not
violating protection key permissions.  It was assumed that all
faults are either write faults where we check PKRU[key].WD (write
disable) or read faults where we check the AD (access disable)
bit.

But, there is a third category of faults for protection keys:
instruction faults.  Instruction faults never run afoul of
protection keys because they do not affect instruction fetches.

So, plumb the PF_INSTR bit down in to the
arch_vma_access_permitted() function where we do the protection
key checks.

We also add a new FAULT_FLAG_INSTRUCTION.  This is because
handle_mm_fault() is not passed the architecture-specific
error_code where we keep PF_INSTR, so we need to encode the
instruction fetch information in to the arch-generic fault
flags.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/s390/include/asm/mmu_context.h    |  2 +-
 arch/x86/include/asm/mmu_context.h     |  5 ++++-
 arch/x86/mm/fault.c                    |  8 ++++++--
 include/asm-generic/mm_hooks.h         |  2 +-
 include/linux/mm.h                     |  1 +
 mm/gup.c                               | 11 +++++++++--
 mm/memory.c                            |  1 +
 8 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index df9bf3e..4eaab40 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -149,7 +149,7 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index 8906600..fa66b6d 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -131,7 +131,7 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index b4d939a..6572b94 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -323,8 +323,11 @@ static inline bool vma_is_foreign(struct vm_area_struct *vma)
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
+	/* pkeys never affect instruction fetches */
+	if (execute)
+		return true;
 	/* allow access if the VMA is not one from this process */
 	if (foreign || vma_is_foreign(vma))
 		return true;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 68ecdff..d81744e 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -908,7 +908,8 @@ static inline bool bad_area_access_from_pkeys(unsigned long error_code,
 	if (error_code & PF_PK)
 		return true;
 	/* this checks permission keys on the VMA: */
-	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE),
+				(error_code & PF_INSTR), foreign))
 		return true;
 	return false;
 }
@@ -1112,7 +1113,8 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 	 * faults just to hit a PF_PK as soon as we fill in a
 	 * page.
 	 */
-	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), foreign))
+	if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE),
+				(error_code & PF_INSTR), foreign))
 		return 1;
 
 	if (error_code & PF_WRITE) {
@@ -1267,6 +1269,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 
 	if (error_code & PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
+	if (error_code & PF_INSTR)
+		flags |= FAULT_FLAG_INSTRUCTION;
 
 	/*
 	 * When running in the kernel we expect faults to occur only to
diff --git a/include/asm-generic/mm_hooks.h b/include/asm-generic/mm_hooks.h
index d5c9633..cc5d9a1 100644
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -27,7 +27,7 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 }
 
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
-		bool write, bool foreign)
+		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2aaa0f0..7955c3e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -252,6 +252,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
+#define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/mm/gup.c b/mm/gup.c
index d276760..7f1c4fb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -449,7 +449,11 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 		if (!(vm_flags & VM_MAYREAD))
 			return -EFAULT;
 	}
-	if (!arch_vma_access_permitted(vma, write, foreign))
+	/*
+	 * gups are always data accesses, not instruction
+	 * fetches, so execute=false here
+	 */
+	if (!arch_vma_access_permitted(vma, write, false, foreign))
 		return -EFAULT;
 	return 0;
 }
@@ -629,8 +633,11 @@ bool vma_permits_fault(struct vm_area_struct *vma, unsigned int fault_flags)
 	/*
 	 * The architecture might have a hardware protection
 	 * mechanism other than read/write that can deny access.
+	 *
+	 * gup always represents data access, not instruction
+	 * fetches, so execute=false here:
 	 */
-	if (!arch_vma_access_permitted(vma, write, foreign))
+	if (!arch_vma_access_permitted(vma, write, false, foreign))
 		return false;
 
 	return true;
diff --git a/mm/memory.c b/mm/memory.c
index 76c44e5..99e9f92 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3380,6 +3380,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *pte;
 
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_REMOTE))
 		return VM_FAULT_SIGSEGV;
 

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Dump PKRU with other kernel registers
  2016-02-12 21:02 ` [PATCH 23/33] x86, pkeys: dump PKRU with other kernel registers Dave Hansen
@ 2016-02-18 20:24   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, dvlasenk, riel, dave.hansen, dave, linux-kernel, tglx,
	peterz, bp, akpm, mingo, luto, brgerst, torvalds

Commit-ID:  c0b17b5bd4b7b98e7c6b67c9f69343b64711271b
Gitweb:     http://git.kernel.org/tip/c0b17b5bd4b7b98e7c6b67c9f69343b64711271b
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:25 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:29 +0100

x86/mm/pkeys: Dump PKRU with other kernel registers

Protection Keys never affect kernel mappings.  But, they can
affect whether the kernel will fault when it touches a user
mapping.  The kernel doesn't touch user mappings without some
careful choreography and these accesses don't generally result in
oopses.  But, if one does, we definitely want to have PKRU
available so we can figure out if protection keys played a role.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210225.BF0D4482@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/process_64.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b9d99e0..776229e 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -116,6 +116,8 @@ void __show_regs(struct pt_regs *regs, int all)
 	printk(KERN_DEFAULT "DR0: %016lx DR1: %016lx DR2: %016lx\n", d0, d1, d2);
 	printk(KERN_DEFAULT "DR3: %016lx DR6: %016lx DR7: %016lx\n", d3, d6, d7);
 
+	if (boot_cpu_has(X86_FEATURE_OSPKE))
+		printk(KERN_DEFAULT "PKRU: %08x\n", read_pkru());
 }
 
 void release_thread(struct task_struct *dead_task)

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Dump pkey from VMA in /proc/pid/ smaps
  2016-02-12 21:02 ` [PATCH 24/33] x86, pkeys: dump pkey from VMA in /proc/pid/smaps Dave Hansen
@ 2016-02-18 20:24   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: ldufour, dave.hansen, vbabka, bp, riel, jmarchan, dave, koct9i,
	peterz, hannes, mwilliamson, brgerst, tglx, hpa, linux-kernel,
	bp, viro, pbonzini, luto, bhe, msalter, mingo, mhocko,
	n-horiguchi, kirill.shutemov, jkosina, akpm, torvalds, jroedel,
	dvlasenk, dyoung

Commit-ID:  c1192f8428414679c8126180e690f8daa1d4d98a
Gitweb:     http://git.kernel.org/tip/c1192f8428414679c8126180e690f8daa1d4d98a
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:27 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:29 +0100

x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps

The protection key can now be just as important as read/write
permissions on a VMA.  We need some debug mechanism to help
figure out if it is in play.  smaps seems like a logical
place to expose it.

arch/x86/kernel/setup.c is a bit of a weirdo place to put
this code, but it already had seq_file.h and there was not
a much better existing place to put it.

We also use no #ifdef.  If protection keys is .config'd out we
will effectively get the same function as if we used the weak
generic function.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Mark Williamson <mwilliamson@undo-software.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/setup.c |  9 +++++++++
 fs/proc/task_mmu.c      | 14 ++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d3d80e6..7260f99 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -112,6 +112,7 @@
 #include <asm/alternative.h>
 #include <asm/prom.h>
 #include <asm/microcode.h>
+#include <asm/mmu_context.h>
 
 /*
  * max_low_pfn_mapped: highest direct mapped pfn under 4GB
@@ -1282,3 +1283,11 @@ static int __init register_kernel_offset_dumper(void)
 	return 0;
 }
 __initcall(register_kernel_offset_dumper);
+
+void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return;
+
+	seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
+}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fa95ab2..9df4316 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -660,11 +660,20 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_MERGEABLE)]	= "mg",
 		[ilog2(VM_UFFD_MISSING)]= "um",
 		[ilog2(VM_UFFD_WP)]	= "uw",
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+		/* These come out via ProtectionKey: */
+		[ilog2(VM_PKEY_BIT0)]	= "",
+		[ilog2(VM_PKEY_BIT1)]	= "",
+		[ilog2(VM_PKEY_BIT2)]	= "",
+		[ilog2(VM_PKEY_BIT3)]	= "",
+#endif
 	};
 	size_t i;
 
 	seq_puts(m, "VmFlags: ");
 	for (i = 0; i < BITS_PER_LONG; i++) {
+		if (!mnemonics[i][0])
+			continue;
 		if (vma->vm_flags & (1UL << i)) {
 			seq_printf(m, "%c%c ",
 				   mnemonics[i][0], mnemonics[i][1]);
@@ -702,6 +711,10 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 }
 #endif /* HUGETLB_PAGE */
 
+void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
+{
+}
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
@@ -783,6 +796,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   (vma->vm_flags & VM_LOCKED) ?
 			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
 
+	arch_show_smap(m, vma);
 	show_smap_vma_flags(m, vma);
 	m_cache_vma(m, vma);
 	return 0;

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Add Kconfig prompt to existing config option
  2016-02-12 21:02 ` [PATCH 25/33] x86, pkeys: add Kconfig prompt to existing config option Dave Hansen
@ 2016-02-18 20:24   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: riel, akpm, peterz, mingo, brgerst, hpa, bp, dvlasenk, tglx,
	torvalds, luto, dave, linux-kernel, dave.hansen

Commit-ID:  284244a9876225eb73102aff41d4492f65cb2868
Gitweb:     http://git.kernel.org/tip/284244a9876225eb73102aff41d4492f65cb2868
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:28 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:30 +0100

x86/mm/pkeys: Add Kconfig prompt to existing config option

I don't have a strong opinion on whether we need this or not.
Protection Keys has relatively little code associated with it,
and it is not a heavyweight feature to keep enabled.  However,
I can imagine that folks would still appreciate being able to
disable it.

Here's the option if folks want it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210228.7E79386C@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fb2ebeb..b875434 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1716,8 +1716,18 @@ config X86_INTEL_MPX
 	  If unsure, say N.
 
 config X86_INTEL_MEMORY_PROTECTION_KEYS
+	prompt "Intel Memory Protection Keys"
 	def_bool y
+	# Note: only available in 64-bit mode
 	depends on CPU_SUP_INTEL && X86_64
+	---help---
+	  Memory Protection Keys provides a mechanism for enforcing
+	  page-based protections, but without requiring modification of the
+	  page tables when an application changes protection domains.
+
+	  For details, see Documentation/x86/protection-keys.txt
+
+	  If unsure, say y.
 
 config EFI
 	bool "EFI runtime service support"

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
  2016-02-12 21:02 ` [PATCH 26/33] x86, pkeys: actually enable Memory Protection Keys in CPU Dave Hansen
@ 2016-02-18 20:25   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, akpm, riel, tglx, torvalds, dvlasenk, luto, linux-kernel,
	bp, dave, brgerst, hpa, peterz, dave.hansen

Commit-ID:  0697694564c84f4c9320e5d103d0191297a20023
Gitweb:     http://git.kernel.org/tip/0697694564c84f4c9320e5d103d0191297a20023
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:29 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:30 +0100

x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU

This sets the bit in 'cr4' to actually enable the protection
keys feature.  We also include a boot-time disable for the
feature "nopku".

Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid
bit to appear set.  At this point in boot, identify_cpu()
has already run the actual CPUID instructions and populated
the "cpu features" structures.  We need to go back and
re-run identify_cpu() to make sure it gets updated values.

We *could* simply re-populate the 11th word of the cpuid
data, but this is probably quick enough.

Also note that with the cpu_has() check and X86_FEATURE_PKU
present in disabled-features.h, we do not need an #ifdef
for setup_pku().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210229.6708027C@viggo.jf.intel.com
[ Small readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/kernel-parameters.txt |  3 +++
 arch/x86/kernel/cpu/common.c        | 43 +++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index a37b5bb..acf467d 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -976,6 +976,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			See Documentation/x86/intel_mpx.txt for more
 			information about the feature.
 
+	nopku		[X86] Disable Memory Protection Keys CPU feature found
+			in some Intel CPUs.
+
 	eagerfpu=	[X86]
 			on	enable eager fpu restore
 			off	disable eager fpu restore
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a719ad7..4fac263 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -304,6 +304,48 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c)
 }
 
 /*
+ * Protection Keys are not available in 32-bit mode.
+ */
+static bool pku_disabled;
+
+static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+{
+	if (!cpu_has(c, X86_FEATURE_PKU))
+		return;
+	if (pku_disabled)
+		return;
+
+	cr4_set_bits(X86_CR4_PKE);
+	/*
+	 * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+	 * cpuid bit to be set.  We need to ensure that we
+	 * update that bit in this CPU's "cpu_info".
+	 */
+	get_cpu_cap(c);
+}
+
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+static __init int setup_disable_pku(char *arg)
+{
+	/*
+	 * Do not clear the X86_FEATURE_PKU bit.  All of the
+	 * runtime checks are against OSPKE so clearing the
+	 * bit does nothing.
+	 *
+	 * This way, we will see "pku" in cpuinfo, but not
+	 * "ospke", which is exactly what we want.  It shows
+	 * that the CPU has PKU, but the OS has not enabled it.
+	 * This happens to be exactly how a system would look
+	 * if we disabled the config option.
+	 */
+	pr_info("x86: 'nopku' specified, disabling Memory Protection Keys\n");
+	pku_disabled = true;
+	return 1;
+}
+__setup("nopku", setup_disable_pku);
+#endif /* CONFIG_X86_64 */
+
+/*
  * Some CPU features depend on higher CPUID levels, which may not always
  * be available due to CPUID level capping or broken virtualization
  * software.  Add those features to this table to auto-disable them.
@@ -960,6 +1002,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 	init_hypervisor(c);
 	x86_init_rdrand(c);
 	x86_init_cache_qos(c);
+	setup_pku(c);
 
 	/*
 	 * Clear/Set all flags overriden by options, need do it

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
  2016-02-12 21:02 ` [PATCH 27/33] mm, multi-arch: pass a protection key in to calc_vm_flag_bits() Dave Hansen
@ 2016-02-18 20:25   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: paul.gortmaker, vdavydov, yamada.masahiro, luto, benh, brgerst,
	dave, oleg, mgorman, peterz, airlied, david, leon,
	dan.j.williams, mpe, tglx, dave.hansen, riel, koct9i, mingo,
	gang.chen.5i5j, linux-kernel, dvlasenk, bp, torvalds, ebiederm,
	mcoquelin.stm32, aarcange, gregkh, geliangtang, riandrews, akpm,
	hpa, arve, paulus, kirill.shutemov

Commit-ID:  e6bfb70959a0ca6ddedb29e779a293c6f71ed0e7
Gitweb:     http://git.kernel.org/tip/e6bfb70959a0ca6ddedb29e779a293c6f71ed0e7
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:31 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:30 +0100

mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()

This plumbs a protection key through calc_vm_flag_bits().  We
could have done this in calc_vm_prot_bits(), but I did not feel
super strongly which way to go.  It was pretty arbitrary which
one to use.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Airlie <airlied@linux.ie>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Geliang Tang <geliangtang@163.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Leon Romanovsky <leon@leon.nu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Riley Andrews <riandrews@android.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: devel@driverdev.osuosl.org
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/powerpc/include/asm/mman.h  | 5 +++--
 drivers/char/agp/frontend.c      | 2 +-
 drivers/staging/android/ashmem.c | 4 ++--
 include/linux/mman.h             | 6 +++---
 mm/mmap.c                        | 2 +-
 mm/mprotect.c                    | 2 +-
 mm/nommu.c                       | 2 +-
 7 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 8565c25..2563c43 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -18,11 +18,12 @@
  * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
  * here.  How important is the optimization?
  */
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot)
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+		unsigned long pkey)
 {
 	return (prot & PROT_SAO) ? VM_SAO : 0;
 }
-#define arch_calc_vm_prot_bits(prot) arch_calc_vm_prot_bits(prot)
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
diff --git a/drivers/char/agp/frontend.c b/drivers/char/agp/frontend.c
index 09f17eb..0f64d14 100644
--- a/drivers/char/agp/frontend.c
+++ b/drivers/char/agp/frontend.c
@@ -156,7 +156,7 @@ static pgprot_t agp_convert_mmap_flags(int prot)
 {
 	unsigned long prot_bits;
 
-	prot_bits = calc_vm_prot_bits(prot) | VM_SHARED;
+	prot_bits = calc_vm_prot_bits(prot, 0) | VM_SHARED;
 	return vm_get_page_prot(prot_bits);
 }
 
diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 5bb1283..2695ff1 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -372,8 +372,8 @@ static int ashmem_mmap(struct file *file, struct vm_area_struct *vma)
 	}
 
 	/* requested protection bits must match our allowed protection mask */
-	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask)) &
-		     calc_vm_prot_bits(PROT_MASK))) {
+	if (unlikely((vma->vm_flags & ~calc_vm_prot_bits(asma->prot_mask, 0)) &
+		     calc_vm_prot_bits(PROT_MASK, 0))) {
 		ret = -EPERM;
 		goto out;
 	}
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 16373c8..33e17f6 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -35,7 +35,7 @@ static inline void vm_unacct_memory(long pages)
  */
 
 #ifndef arch_calc_vm_prot_bits
-#define arch_calc_vm_prot_bits(prot) 0
+#define arch_calc_vm_prot_bits(prot, pkey) 0
 #endif
 
 #ifndef arch_vm_get_page_prot
@@ -70,12 +70,12 @@ static inline int arch_validate_prot(unsigned long prot)
  * Combine the mmap "prot" argument into "vm_flags" used internally.
  */
 static inline unsigned long
-calc_vm_prot_bits(unsigned long prot)
+calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 {
 	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
-	       arch_calc_vm_prot_bits(prot);
+	       arch_calc_vm_prot_bits(prot, pkey);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
index e2e9f48..784d2d6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1313,7 +1313,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f7cb3d4..3790c8b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -380,7 +380,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot);
+	vm_flags = calc_vm_prot_bits(prot, 0);
 
 	down_write(&current->mm->mmap_sem);
 
diff --git a/mm/nommu.c b/mm/nommu.c
index b64d04d..5ba39b8 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1082,7 +1082,7 @@ static unsigned long determine_vm_flags(struct file *file,
 {
 	unsigned long vm_flags;
 
-	vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags);
+	vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
 	/* vm_flags |= mm->def_flags; */
 
 	if (!(capabilities & NOMMU_MAP_DIRECT)) {

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] mm/core, x86/mm/pkeys: Add arch_validate_pkey()
  2016-02-12 21:02 ` [PATCH 28/33] x86, pkeys: add arch_validate_pkey() Dave Hansen
@ 2016-02-18 20:25   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, peterz, brgerst, bp, hpa, riel, torvalds, dave.hansen,
	luto, akpm, linux-kernel, dave, dvlasenk, tglx

Commit-ID:  66d375709d2c891acc639538fd3179fa0cbb0daf
Gitweb:     http://git.kernel.org/tip/66d375709d2c891acc639538fd3179fa0cbb0daf
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:32 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:31 +0100

mm/core, x86/mm/pkeys: Add arch_validate_pkey()

The syscall-level code is passed a protection key and need to
return an appropriate error code if the protection key is bogus.
We will be using this in subsequent patches.

Note that this also begins a series of arch-specific calls that
we need to expose in otherwise arch-independent code.  We create
a linux/pkeys.h header where we will put *all* the stubs for
these functions.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210232.774EEAAB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig             |  1 +
 arch/x86/include/asm/pkeys.h |  6 ++++++
 include/linux/pkeys.h        | 25 +++++++++++++++++++++++++
 mm/Kconfig                   |  2 ++
 4 files changed, 34 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b875434..eda18ce 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -156,6 +156,7 @@ config X86
 	select X86_DEV_DMA_OPS			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
 	select ARCH_USES_HIGH_VMA_FLAGS		if X86_INTEL_MEMORY_PROTECTION_KEYS
+	select ARCH_HAS_PKEYS			if X86_INTEL_MEMORY_PROTECTION_KEYS
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
new file mode 100644
index 0000000..04243c2
--- /dev/null
+++ b/arch/x86/include/asm/pkeys.h
@@ -0,0 +1,6 @@
+#ifndef _ASM_X86_PKEYS_H
+#define _ASM_X86_PKEYS_H
+
+#define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)
+
+#endif /*_ASM_X86_PKEYS_H */
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
new file mode 100644
index 0000000..55e465f
--- /dev/null
+++ b/include/linux/pkeys.h
@@ -0,0 +1,25 @@
+#ifndef _LINUX_PKEYS_H
+#define _LINUX_PKEYS_H
+
+#include <linux/mm_types.h>
+#include <asm/mmu_context.h>
+
+#ifdef CONFIG_ARCH_HAS_PKEYS
+#include <asm/pkeys.h>
+#else /* ! CONFIG_ARCH_HAS_PKEYS */
+#define arch_max_pkey() (1)
+#endif /* ! CONFIG_ARCH_HAS_PKEYS */
+
+/*
+ * This is called from mprotect_pkey().
+ *
+ * Returns true if the protection keys is valid.
+ */
+static inline bool validate_pkey(int pkey)
+{
+	if (pkey < 0)
+		return false;
+	return (pkey < arch_max_pkey());
+}
+
+#endif /* _LINUX_PKEYS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 6cf4399..2702bb6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -672,3 +672,5 @@ config FRAME_VECTOR
 
 config ARCH_USES_HIGH_VMA_FLAGS
 	bool
+config ARCH_HAS_PKEYS
+	bool

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm: Factor out LDT init from context init
  2016-02-12 21:02 ` [PATCH 29/33] x86: separate out LDT init from context init Dave Hansen
@ 2016-02-18 20:26   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dave, dvlasenk, hpa, peterz, linux-kernel, luto, riel, tglx,
	akpm, torvalds, dave.hansen, mingo, bp, brgerst

Commit-ID:  39a0526fb3f7d93433d146304278477eb463f8af
Gitweb:     http://git.kernel.org/tip/39a0526fb3f7d93433d146304278477eb463f8af
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:34 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:31 +0100

x86/mm: Factor out LDT init from context init

The arch-specific mm_context_t is a great place to put
protection-key allocation state.

But, we need to initialize the allocation state because pkey 0 is
always "allocated".  All of the runtime initialization of
mm_context_t is done in *_ldt() manipulation functions.  This
renames the existing LDT functions like this:

	init_new_context() -> init_new_context_ldt()
	destroy_context() -> destroy_context_ldt()

and makes init_new_context() and destroy_context() available for
generic use.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210234.DB34FCC5@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mmu_context.h | 21 ++++++++++++++++-----
 arch/x86/kernel/ldt.c              |  4 ++--
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 6572b94..8428002 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -52,15 +52,15 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
-void destroy_context(struct mm_struct *mm);
+int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
+void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline int init_new_context(struct task_struct *tsk,
-				   struct mm_struct *mm)
+static inline int init_new_context_ldt(struct task_struct *tsk,
+				       struct mm_struct *mm)
 {
 	return 0;
 }
-static inline void destroy_context(struct mm_struct *mm) {}
+static inline void destroy_context_ldt(struct mm_struct *mm) {}
 #endif
 
 static inline void load_mm_ldt(struct mm_struct *mm)
@@ -104,6 +104,17 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 #endif
 }
 
+static inline int init_new_context(struct task_struct *tsk,
+				   struct mm_struct *mm)
+{
+	init_new_context_ldt(tsk, mm);
+	return 0;
+}
+static inline void destroy_context(struct mm_struct *mm)
+{
+	destroy_context_ldt(mm);
+}
+
 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			     struct task_struct *tsk)
 {
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 6acc9dd..6707039 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -103,7 +103,7 @@ static void free_ldt_struct(struct ldt_struct *ldt)
  * we do not have to muck with descriptors here, that is
  * done in switch_mm() as needed.
  */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
+int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct ldt_struct *new_ldt;
 	struct mm_struct *old_mm;
@@ -144,7 +144,7 @@ out_unlock:
  *
  * 64bit: Don't touch the LDT register - we're already in the next thread.
  */
-void destroy_context(struct mm_struct *mm)
+void destroy_context_ldt(struct mm_struct *mm)
 {
 	free_ldt_struct(mm->context.ldt);
 	mm->context.ldt = NULL;

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/fpu: Allow setting of XSAVE state
  2016-02-12 21:02 ` [PATCH 30/33] x86, fpu: allow setting of XSAVE state Dave Hansen
@ 2016-02-18 20:26   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, dave, torvalds, oleg, linux-kernel, quentin.casasnovas,
	peterz, bp, luto, hpa, riel, tglx, fenghua.yu, dvlasenk, mingo,
	dave.hansen, brgerst

Commit-ID:  b8b9b6ba9dec3f155c7555cb208ba4078e97aedb
Gitweb:     http://git.kernel.org/tip/b8b9b6ba9dec3f155c7555cb208ba4078e97aedb
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:35 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:32 +0100

x86/fpu: Allow setting of XSAVE state

We want to modify the Protection Key rights inside the kernel, so
we need to change PKRU's contents.  But, if we do a plain
'wrpkru', when we return to userspace we might do an XRSTOR and
wipe out the kernel's 'wrpkru'.  So, we need to go after PKRU in
the xsave buffer.

We do this by:

  1. Ensuring that we have the XSAVE registers (fpregs) in the
     kernel FPU buffer (fpstate)
  2. Looking up the location of a given state in the buffer
  3. Filling in the stat
  4. Ensuring that the hardware knows that state is present there
     (basically that the 'init optimization' is not in place).
  5. Copying the newly-modified state back to the registers if
     necessary.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210235.5A3139BF@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/fpu/internal.h |  2 +
 arch/x86/kernel/fpu/core.c          | 63 ++++++++++++++++++++++++
 arch/x86/kernel/fpu/xstate.c        | 98 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index a212434..31ac8e6 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -25,6 +25,8 @@
 extern void fpu__activate_curr(struct fpu *fpu);
 extern void fpu__activate_fpstate_read(struct fpu *fpu);
 extern void fpu__activate_fpstate_write(struct fpu *fpu);
+extern void fpu__current_fpstate_write_begin(void);
+extern void fpu__current_fpstate_write_end(void);
 extern void fpu__save(struct fpu *fpu);
 extern void fpu__restore(struct fpu *fpu);
 extern int  fpu__restore_sig(void __user *buf, int ia32_frame);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 299b58b..dea8e76 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -354,6 +354,69 @@ void fpu__activate_fpstate_write(struct fpu *fpu)
 }
 
 /*
+ * This function must be called before we write the current
+ * task's fpstate.
+ *
+ * This call gets the current FPU register state and moves
+ * it in to the 'fpstate'.  Preemption is disabled so that
+ * no writes to the 'fpstate' can occur from context
+ * swiches.
+ *
+ * Must be followed by a fpu__current_fpstate_write_end().
+ */
+void fpu__current_fpstate_write_begin(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	/*
+	 * Ensure that the context-switching code does not write
+	 * over the fpstate while we are doing our update.
+	 */
+	preempt_disable();
+
+	/*
+	 * Move the fpregs in to the fpu's 'fpstate'.
+	 */
+	fpu__activate_fpstate_read(fpu);
+
+	/*
+	 * The caller is about to write to 'fpu'.  Ensure that no
+	 * CPU thinks that its fpregs match the fpstate.  This
+	 * ensures we will not be lazy and skip a XRSTOR in the
+	 * future.
+	 */
+	fpu->last_cpu = -1;
+}
+
+/*
+ * This function must be paired with fpu__current_fpstate_write_begin()
+ *
+ * This will ensure that the modified fpstate gets placed back in
+ * the fpregs if necessary.
+ *
+ * Note: This function may be called whether or not an _actual_
+ * write to the fpstate occurred.
+ */
+void fpu__current_fpstate_write_end(void)
+{
+	struct fpu *fpu = &current->thread.fpu;
+
+	/*
+	 * 'fpu' now has an updated copy of the state, but the
+	 * registers may still be out of date.  Update them with
+	 * an XRSTOR if they are active.
+	 */
+	if (fpregs_active())
+		copy_kernel_to_fpregs(&fpu->state);
+
+	/*
+	 * Our update is done and the fpregs/fpstate are in sync
+	 * if necessary.  Context switches can happen again.
+	 */
+	preempt_enable();
+}
+
+/*
  * 'fpu__restore()' is called to copy FPU registers from
  * the FPU fpstate to the live hw registers and to activate
  * access to the hardware registers, so that FPU instructions
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index a63ca80..30d144f 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -679,6 +679,19 @@ void fpu__resume_cpu(void)
 }
 
 /*
+ * Given an xstate feature mask, calculate where in the xsave
+ * buffer the state is.  Callers should ensure that the buffer
+ * is valid.
+ *
+ * Note: does not work for compacted buffers.
+ */
+void *__raw_xsave_addr(struct xregs_state *xsave, int xstate_feature_mask)
+{
+	int feature_nr = fls64(xstate_feature_mask) - 1;
+
+	return (void *)xsave + xstate_comp_offsets[feature_nr];
+}
+/*
  * Given the xsave area and a state inside, this function returns the
  * address of the state.
  *
@@ -698,7 +711,6 @@ void fpu__resume_cpu(void)
  */
 void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
 {
-	int feature_nr = fls64(xstate_feature) - 1;
 	/*
 	 * Do we even *have* xsave state?
 	 */
@@ -726,7 +738,7 @@ void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
 	if (!(xsave->header.xfeatures & xstate_feature))
 		return NULL;
 
-	return (void *)xsave + xstate_comp_offsets[feature_nr];
+	return __raw_xsave_addr(xsave, xstate_feature);
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
 
@@ -761,3 +773,85 @@ const void *get_xsave_field_ptr(int xsave_state)
 
 	return get_xsave_addr(&fpu->state.xsave, xsave_state);
 }
+
+
+/*
+ * Set xfeatures (aka XSTATE_BV) bit for a feature that we want
+ * to take out of its "init state".  This will ensure that an
+ * XRSTOR actually restores the state.
+ */
+static void fpu__xfeature_set_non_init(struct xregs_state *xsave,
+		int xstate_feature_mask)
+{
+	xsave->header.xfeatures |= xstate_feature_mask;
+}
+
+/*
+ * This function is safe to call whether the FPU is in use or not.
+ *
+ * Note that this only works on the current task.
+ *
+ * Inputs:
+ *	@xsave_state: state which is defined in xsave.h (e.g. XFEATURE_MASK_FP,
+ *	XFEATURE_MASK_SSE, etc...)
+ *	@xsave_state_ptr: a pointer to a copy of the state that you would
+ *	like written in to the current task's FPU xsave state.  This pointer
+ *	must not be located in the current tasks's xsave area.
+ * Output:
+ *	address of the state in the xsave area or NULL if the state
+ *	is not present or is in its 'init state'.
+ */
+static void fpu__xfeature_set_state(int xstate_feature_mask,
+		void *xstate_feature_src, size_t len)
+{
+	struct xregs_state *xsave = &current->thread.fpu.state.xsave;
+	struct fpu *fpu = &current->thread.fpu;
+	void *dst;
+
+	if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
+		WARN_ONCE(1, "%s() attempted with no xsave support", __func__);
+		return;
+	}
+
+	/*
+	 * Tell the FPU code that we need the FPU state to be in
+	 * 'fpu' (not in the registers), and that we need it to
+	 * be stable while we write to it.
+	 */
+	fpu__current_fpstate_write_begin();
+
+	/*
+	 * This method *WILL* *NOT* work for compact-format
+	 * buffers.  If the 'xstate_feature_mask' is unset in
+	 * xcomp_bv then we may need to move other feature state
+	 * "up" in the buffer.
+	 */
+	if (xsave->header.xcomp_bv & xstate_feature_mask) {
+		WARN_ON_ONCE(1);
+		goto out;
+	}
+
+	/* find the location in the xsave buffer of the desired state */
+	dst = __raw_xsave_addr(&fpu->state.xsave, xstate_feature_mask);
+
+	/*
+	 * Make sure that the pointer being passed in did not
+	 * come from the xsave buffer itself.
+	 */
+	WARN_ONCE(xstate_feature_src == dst, "set from xsave buffer itself");
+
+	/* put the caller-provided data in the location */
+	memcpy(dst, xstate_feature_src, len);
+
+	/*
+	 * Mark the xfeature so that the CPU knows there is state
+	 * in the buffer now.
+	 */
+	fpu__xfeature_set_non_init(xsave, xstate_feature_mask);
+out:
+	/*
+	 * We are done writing to the 'fpu'.  Reenable preeption
+	 * and (possibly) move the fpstate back in to the fpregs.
+	 */
+	fpu__current_fpstate_write_end();
+}

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Allow kernel to modify user pkey rights register
  2016-02-12 21:02 ` [PATCH 31/33] x86, pkeys: allow kernel to modify user pkey rights register Dave Hansen
@ 2016-02-18 20:27   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, dave, akpm, bp, luto, tglx, mingo, peterz, hpa,
	riel, torvalds, dave.hansen, brgerst, dvlasenk

Commit-ID:  8459429693395ca9e8d18101300b120ad9171795
Gitweb:     http://git.kernel.org/tip/8459429693395ca9e8d18101300b120ad9171795
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:36 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:32 +0100

x86/mm/pkeys: Allow kernel to modify user pkey rights register

The Protection Key Rights for User memory (PKRU) is a 32-bit
user-accessible register.  It contains two bits for each
protection key: one to write-disable (WD) access to memory
covered by the key and another to access-disable (AD).

Userspace can read/write the register with the RDPKRU and WRPKRU
instructions.  But, the register is saved and restored with the
XSAVE family of instructions, which means we have to treat it
like a floating point register.

The kernel needs to write to the register if it wants to
implement execute-only memory or if it implements a system call
to change PKRU.

To do this, we need to create a 'pkru_state' buffer, read the old
contents in to it, modify it, and then tell the FPU code that
there is modified data in there so it can (possibly) move the
buffer back in to the registers.

This uses the fpu__xfeature_set_state() function that we defined
in the previous patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h |  5 +--
 arch/x86/include/asm/pkeys.h   |  3 ++
 arch/x86/kernel/fpu/xstate.c   | 74 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/pkeys.h          |  5 +++
 4 files changed, 85 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 3cbfae8..1ff49ec 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -921,16 +921,17 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 
 #define PKRU_AD_BIT 0x1
 #define PKRU_WD_BIT 0x2
+#define PKRU_BITS_PER_PKEY 2
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * 2;
+	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
 	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * 2;
+	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
 	/*
 	 * Access-disable disables writes too so we need to check
 	 * both bits here.
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 04243c2..5061aec 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -3,4 +3,7 @@
 
 #define arch_max_pkey() (boot_cpu_has(X86_FEATURE_OSPKE) ? 16 : 1)
 
+extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 30d144f..50813c3 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -5,6 +5,7 @@
  */
 #include <linux/compat.h>
 #include <linux/cpu.h>
+#include <linux/pkeys.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -855,3 +856,76 @@ out:
 	 */
 	fpu__current_fpstate_write_end();
 }
+
+#define NR_VALID_PKRU_BITS (CONFIG_NR_PROTECTION_KEYS * 2)
+#define PKRU_VALID_MASK (NR_VALID_PKRU_BITS - 1)
+
+/*
+ * This will go out and modify the XSAVE buffer so that PKRU is
+ * set to a particular state for access to 'pkey'.
+ *
+ * PKRU state does affect kernel access to user memory.  We do
+ * not modfiy PKRU *itself* here, only the XSAVE state that will
+ * be restored in to PKRU when we return back to userspace.
+ */
+int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val)
+{
+	struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+	struct pkru_state *old_pkru_state;
+	struct pkru_state new_pkru_state;
+	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+	u32 new_pkru_bits = 0;
+
+	if (!validate_pkey(pkey))
+		return -EINVAL;
+	/*
+	 * This check implies XSAVE support.  OSPKE only gets
+	 * set if we enable XSAVE and we enable PKU in XCR0.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return -EINVAL;
+
+	/* Set the bits we need in PKRU  */
+	if (init_val & PKEY_DISABLE_ACCESS)
+		new_pkru_bits |= PKRU_AD_BIT;
+	if (init_val & PKEY_DISABLE_WRITE)
+		new_pkru_bits |= PKRU_WD_BIT;
+
+	/* Shift the bits in to the correct place in PKRU for pkey. */
+	new_pkru_bits <<= pkey_shift;
+
+	/* Locate old copy of the state in the xsave buffer */
+	old_pkru_state = get_xsave_addr(xsave, XFEATURE_MASK_PKRU);
+
+	/*
+	 * When state is not in the buffer, it is in the init
+	 * state, set it manually.  Otherwise, copy out the old
+	 * state.
+	 */
+	if (!old_pkru_state)
+		new_pkru_state.pkru = 0;
+	else
+		new_pkru_state.pkru = old_pkru_state->pkru;
+
+	/* mask off any old bits in place */
+	new_pkru_state.pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+	/* Set the newly-requested bits */
+	new_pkru_state.pkru |= new_pkru_bits;
+
+	/*
+	 * We could theoretically live without zeroing pkru.pad.
+	 * The current XSAVE feature state definition says that
+	 * only bytes 0->3 are used.  But we do not want to
+	 * chance leaking kernel stack out to userspace in case a
+	 * memcpy() of the whole xsave buffer was done.
+	 *
+	 * They're in the same cacheline anyway.
+	 */
+	new_pkru_state.pad = 0;
+
+	fpu__xfeature_set_state(XFEATURE_MASK_PKRU, &new_pkru_state,
+			sizeof(new_pkru_state));
+
+	return 0;
+}
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 55e465f..fc325b3 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -4,6 +4,11 @@
 #include <linux/mm_types.h>
 #include <asm/mmu_context.h>
 
+#define PKEY_DISABLE_ACCESS	0x1
+#define PKEY_DISABLE_WRITE	0x2
+#define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
+				 PKEY_DISABLE_WRITE)
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 #include <asm/pkeys.h>
 #else /* ! CONFIG_ARCH_HAS_PKEYS */

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits () for VMA flags
  2016-02-12 21:02 ` [PATCH 32/33] x86, pkeys: create an x86 arch_calc_vm_prot_bits() for VMA flags Dave Hansen
@ 2016-02-18 20:27   ` tip-bot for Dave Hansen
  0 siblings, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, dvlasenk, tglx, akpm, mingo, dave.hansen, peterz, brgerst,
	linux-kernel, luto, dave, torvalds, bp, riel

Commit-ID:  878ba03932d757ce4e954db4defec74a0de0435b
Gitweb:     http://git.kernel.org/tip/878ba03932d757ce4e954db4defec74a0de0435b
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:37 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:32 +0100

x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags

calc_vm_prot_bits() takes PROT_{READ,WRITE,EXECUTE} bits and
turns them in to the vma->vm_flags/VM_* bits.  We need to do a
similar thing for protection keys.

We take a protection key (4 bits) and encode it in to the 4
VM_PKEY_* bits.

Note: this code is not new.  It was simply a part of the
mprotect_pkey() patch in the past.  I broke it out for use
in the execute-only support.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210237.CFB94AD5@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/uapi/asm/mman.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index e8562e0..39bca7f 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -20,6 +20,12 @@
 		((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) |	\
 		((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot, key) (		\
+		((key) & 0x1 ? VM_PKEY_BIT0 : 0) |      \
+		((key) & 0x2 ? VM_PKEY_BIT1 : 0) |      \
+		((key) & 0x4 ? VM_PKEY_BIT2 : 0) |      \
+		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
 #include <asm-generic/mman.h>

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [tip:mm/pkeys] mm/core, x86/mm/pkeys: Add execute-only protection keys support
  2016-02-12 21:02 ` [PATCH 33/33] x86, pkeys: execute-only support Dave Hansen
  2016-02-17 21:27   ` Kees Cook
@ 2016-02-18 20:27   ` tip-bot for Dave Hansen
  1 sibling, 0 replies; 84+ messages in thread
From: tip-bot for Dave Hansen @ 2016-02-18 20:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: riel, peterz, david, tglx, luto, dvlasenk, brgerst, linux-kernel,
	torvalds, gang.chen.5i5j, keescook, dave, oleg, bp, aarcange,
	sds, luto, aneesh.kumar, mingo, hpa, mgorman, akpm, bp,
	dan.j.williams, kirill.shutemov, dave.hansen, koct9i,
	vladimir.murzin, dahi, will.deacon, kwapulinski.piotr

Commit-ID:  62b5f7d013fc455b8db26cf01e421f4c0d264b92
Gitweb:     http://git.kernel.org/tip/62b5f7d013fc455b8db26cf01e421f4c0d264b92
Author:     Dave Hansen <dave.hansen@linux.intel.com>
AuthorDate: Fri, 12 Feb 2016 13:02:40 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Feb 2016 19:46:33 +0100

mm/core, x86/mm/pkeys: Add execute-only protection keys support

Protection keys provide new page-based protection in hardware.
But, they have an interesting attribute: they only affect data
accesses and never affect instruction fetches.  That means that
if we set up some memory which is set as "access-disabled" via
protection keys, we can still execute from it.

This patch uses protection keys to set up mappings to do just that.
If a user calls:

	mmap(..., PROT_EXEC);
or
	mprotect(ptr, sz, PROT_EXEC);

(note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
notice this, and set a special protection key on the memory.  It
also sets the appropriate bits in the Protection Keys User Rights
(PKRU) register so that the memory becomes unreadable and
unwritable.

I haven't found any userspace that does this today.  With this
facility in place, we expect userspace to move to use it
eventually.  Userspace _could_ start doing this today.  Any
PROT_EXEC calls get converted to PROT_READ inside the kernel, and
would transparently be upgraded to "true" PROT_EXEC with this
code.  IOW, userspace never has to do any PROT_EXEC runtime
detection.

This feature provides enhanced protection against leaking
executable memory contents.  This helps thwart attacks which are
attempting to find ROP gadgets on the fly.

But, the security provided by this approach is not comprehensive.
The PKRU register which controls access permissions is a normal
user register writable from unprivileged userspace.  An attacker
who can execute the 'wrpkru' instruction can easily disable the
protection provided by this feature.

The protection key that is used for execute-only support is
permanently dedicated at compile time.  This is fine for now
because there is currently no API to set a protection key other
than this one.

Despite there being a constant PKRU value across the entire
system, we do not set it unless this feature is in use in a
process.  That is to preserve the PKRU XSAVE 'init state',
which can lead to faster context switches.

PKRU *is* a user register and the kernel is modifying it.  That
means that code doing:

	pkru = rdpkru()
	pkru |= 0x100;
	mmap(..., PROT_EXEC);
	wrpkru(pkru);

could lose the bits in PKRU that enforce execute-only
permissions.  To avoid this, we suggest avoiding ever calling
mmap() or mprotect() when the PKRU value is expected to be
unstable.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: keescook@google.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pkeys.h |  25 +++++++++++
 arch/x86/kernel/fpu/xstate.c |   2 -
 arch/x86/mm/Makefile         |   2 +
 arch/x86/mm/fault.c          |  10 +++++
 arch/x86/mm/pkeys.c          | 101 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/pkeys.h        |   3 ++
 mm/mmap.c                    |  10 ++++-
 mm/mprotect.c                |   8 ++--
 8 files changed, 154 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 5061aec..7b84565 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -6,4 +6,29 @@
 extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 		unsigned long init_val);
 
+/*
+ * Try to dedicate one of the protection keys to be used as an
+ * execute-only protection key.
+ */
+#define PKEY_DEDICATED_EXECUTE_ONLY 15
+extern int __execute_only_pkey(struct mm_struct *mm);
+static inline int execute_only_pkey(struct mm_struct *mm)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return 0;
+
+	return __execute_only_pkey(mm);
+}
+
+extern int __arch_override_mprotect_pkey(struct vm_area_struct *vma,
+		int prot, int pkey);
+static inline int arch_override_mprotect_pkey(struct vm_area_struct *vma,
+		int prot, int pkey)
+{
+	if (!boot_cpu_has(X86_FEATURE_OSPKE))
+		return 0;
+
+	return __arch_override_mprotect_pkey(vma, prot, pkey);
+}
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 50813c3..1b19818 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -877,8 +877,6 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 	int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
 	u32 new_pkru_bits = 0;
 
-	if (!validate_pkey(pkey))
-		return -EINVAL;
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
 	 * set if we enable XSAVE and we enable PKU in XCR0.
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index f9d38a4..67cf2e1 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
 obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 
 obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
+obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
+
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d81744e..5877b92 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1108,6 +1108,16 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
 	 */
 	if (error_code & PF_PK)
 		return 1;
+
+	if (!(error_code & PF_INSTR)) {
+		/*
+		 * Assume all accesses require either read or execute
+		 * permissions.  This is not an instruction access, so
+		 * it requires read permissions.
+		 */
+		if (!(vma->vm_flags & VM_READ))
+			return 1;
+	}
 	/*
 	 * Make sure to check the VMA so that we do not perform
 	 * faults just to hit a PF_PK as soon as we fill in a
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
new file mode 100644
index 0000000..e8c4744
--- /dev/null
+++ b/arch/x86/mm/pkeys.c
@@ -0,0 +1,101 @@
+/*
+ * Intel Memory Protection Keys management
+ * Copyright (c) 2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
+#include <linux/pkeys.h>                /* PKEY_*                       */
+#include <uapi/asm-generic/mman-common.h>
+
+#include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
+#include <asm/mmu_context.h>            /* vma_pkey()                   */
+#include <asm/fpu/internal.h>           /* fpregs_active()              */
+
+int __execute_only_pkey(struct mm_struct *mm)
+{
+	int ret;
+
+	/*
+	 * We do not want to go through the relatively costly
+	 * dance to set PKRU if we do not need to.  Check it
+	 * first and assume that if the execute-only pkey is
+	 * write-disabled that we do not have to set it
+	 * ourselves.  We need preempt off so that nobody
+	 * can make fpregs inactive.
+	 */
+	preempt_disable();
+	if (fpregs_active() &&
+	    !__pkru_allows_read(read_pkru(), PKEY_DEDICATED_EXECUTE_ONLY)) {
+		preempt_enable();
+		return PKEY_DEDICATED_EXECUTE_ONLY;
+	}
+	preempt_enable();
+	ret = arch_set_user_pkey_access(current, PKEY_DEDICATED_EXECUTE_ONLY,
+			PKEY_DISABLE_ACCESS);
+	/*
+	 * If the PKRU-set operation failed somehow, just return
+	 * 0 and effectively disable execute-only support.
+	 */
+	if (ret)
+		return 0;
+
+	return PKEY_DEDICATED_EXECUTE_ONLY;
+}
+
+static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma)
+{
+	/* Do this check first since the vm_flags should be hot */
+	if ((vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) != VM_EXEC)
+		return false;
+	if (vma_pkey(vma) != PKEY_DEDICATED_EXECUTE_ONLY)
+		return false;
+
+	return true;
+}
+
+/*
+ * This is only called for *plain* mprotect calls.
+ */
+int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey)
+{
+	/*
+	 * Is this an mprotect_pkey() call?  If so, never
+	 * override the value that came from the user.
+	 */
+	if (pkey != -1)
+		return pkey;
+	/*
+	 * Look for a protection-key-drive execute-only mapping
+	 * which is now being given permissions that are not
+	 * execute-only.  Move it back to the default pkey.
+	 */
+	if (vma_is_pkey_exec_only(vma) &&
+	    (prot & (PROT_READ|PROT_WRITE))) {
+		return 0;
+	}
+	/*
+	 * The mapping is execute-only.  Go try to get the
+	 * execute-only protection key.  If we fail to do that,
+	 * fall through as if we do not have execute-only
+	 * support.
+	 */
+	if (prot == PROT_EXEC) {
+		pkey = execute_only_pkey(vma->vm_mm);
+		if (pkey > 0)
+			return pkey;
+	}
+	/*
+	 * This is a vanilla, non-pkey mprotect (or we failed to
+	 * setup execute-only), inherit the pkey from the VMA we
+	 * are working on.
+	 */
+	return vma_pkey(vma);
+}
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index fc325b3..1d405a2 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -13,6 +13,9 @@
 #include <asm/pkeys.h>
 #else /* ! CONFIG_ARCH_HAS_PKEYS */
 #define arch_max_pkey() (1)
+#define execute_only_pkey(mm) (0)
+#define arch_override_mprotect_pkey(vma, prot, pkey) (0)
+#define PKEY_DEDICATED_EXECUTE_ONLY 0
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
index 784d2d6..0175b7d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -43,6 +43,7 @@
 #include <linux/printk.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/moduleparam.h>
+#include <linux/pkeys.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1270,6 +1271,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long pgoff, unsigned long *populate)
 {
 	struct mm_struct *mm = current->mm;
+	int pkey = 0;
 
 	*populate = 0;
 
@@ -1309,11 +1311,17 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (offset_in_page(addr))
 		return addr;
 
+	if (prot == PROT_EXEC) {
+		pkey = execute_only_pkey(mm);
+		if (pkey < 0)
+			pkey = 0;
+	}
+
 	/* Do simple checking here so the lower-level routines won't have
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
 
 	if (flags & MAP_LOCKED)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3790c8b..fa37c4c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -24,6 +24,7 @@
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
 #include <linux/ksm.h>
+#include <linux/pkeys.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -354,7 +355,7 @@ fail:
 SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 		unsigned long, prot)
 {
-	unsigned long vm_flags, nstart, end, tmp, reqprot;
+	unsigned long nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
 	int error = -EINVAL;
 	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
@@ -380,8 +381,6 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
 		prot |= PROT_EXEC;
 
-	vm_flags = calc_vm_prot_bits(prot, 0);
-
 	down_write(&current->mm->mmap_sem);
 
 	vma = find_vma(current->mm, start);
@@ -411,10 +410,11 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 
 	for (nstart = start ; ; ) {
 		unsigned long newflags;
+		int pkey = arch_override_mprotect_pkey(vma, prot, -1);
 
 		/* Here we know that vma->vm_start <= nstart < vma->vm_end. */
 
-		newflags = vm_flags;
+		newflags = calc_vm_prot_bits(prot, pkey);
 		newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
 
 		/* newflags >> 4 shift VM_MAY% in place of VM_% */

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH] x86/mm/pkeys: Do not enable them by default
  2016-02-18 20:16   ` [tip:mm/pkeys] x86/mm/pkeys: " tip-bot for Dave Hansen
@ 2016-02-19 11:27     ` Borislav Petkov
  2016-02-19 17:11       ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Borislav Petkov @ 2016-02-19 11:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-tip-commits, linux-kernel, brgerst, tglx, torvalds, hpa,
	dave.hansen, akpm, peterz, mingo, luto, dvlasenk, dave, riel

On Thu, Feb 18, 2016 at 12:16:47PM -0800, tip-bot for Dave Hansen wrote:
> Commit-ID:  35e97790f5f1e5cf2b5522c55e3e31d5c81bd226
> Gitweb:     http://git.kernel.org/tip/35e97790f5f1e5cf2b5522c55e3e31d5c81bd226
> Author:     Dave Hansen <dave.hansen@linux.intel.com>
> AuthorDate: Fri, 12 Feb 2016 13:02:00 -0800
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Tue, 16 Feb 2016 10:11:13 +0100
> 
> x86/mm/pkeys: Add Kconfig option
> 
> I don't have a strong opinion on whether we need a Kconfig prompt
> or not.  Protection Keys has relatively little code associated
> with it, and it is not a heavyweight feature to keep enabled.
> However, I can imagine that folks would still appreciate being
> able to disable it.
> 
> Note that, with disabled-features.h, the checks in the code
> for protection keys are always the same:
> 
> 	cpu_has(c, X86_FEATURE_PKU)
> 
> With the config option disabled, this essentially turns into an

whoops, something is missing here. An "#ifdef."

...

>  arch/x86/Kconfig | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index ab2ed53..3632cdd 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1714,6 +1714,10 @@ config X86_INTEL_MPX
>  
>  	  If unsure, say N.
>  
> +config X86_INTEL_MEMORY_PROTECTION_KEYS
> +	def_bool y

This is not necessary.

---
From: Borislav Petkov <bp@suse.de>
Date: Fri, 19 Feb 2016 12:19:50 +0100
Subject: [PATCH] x86/mm/pkeys: Do not enable them by default

No need to default to y.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/Kconfig | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d10826d2cb5e..109bc46ccb60 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1719,8 +1719,7 @@ config X86_INTEL_MPX
 	  If unsure, say N.
 
 config X86_INTEL_MEMORY_PROTECTION_KEYS
-	prompt "Intel Memory Protection Keys"
-	def_bool y
+	bool "Intel Memory Protection Keys"
 	# Note: only available in 64-bit mode
 	depends on CPU_SUP_INTEL && X86_64
 	---help---
-- 
2.3.5

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH] x86/mm/pkeys: Do not enable them by default
  2016-02-19 11:27     ` [PATCH] x86/mm/pkeys: Do not enable them by default Borislav Petkov
@ 2016-02-19 17:11       ` Dave Hansen
  2016-02-19 17:23         ` Borislav Petkov
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-19 17:11 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen
  Cc: linux-tip-commits, linux-kernel, brgerst, tglx, torvalds, hpa,
	akpm, peterz, mingo, luto, dvlasenk, riel

On 02/19/2016 03:27 AM, Borislav Petkov wrote:
>  config X86_INTEL_MEMORY_PROTECTION_KEYS
> -	prompt "Intel Memory Protection Keys"
> -	def_bool y
> +	bool "Intel Memory Protection Keys"
>  	# Note: only available in 64-bit mode
>  	depends on CPU_SUP_INTEL && X86_64
>  	---help---

Hi Borislav,

I'd really prefer this be left on by default.  This is a feature that I
expect to be widely enabled in distribution kernels.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] x86/mm/pkeys: Do not enable them by default
  2016-02-19 17:11       ` Dave Hansen
@ 2016-02-19 17:23         ` Borislav Petkov
  2016-02-19 17:49           ` Dave Hansen
  0 siblings, 1 reply; 84+ messages in thread
From: Borislav Petkov @ 2016-02-19 17:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, linux-tip-commits, linux-kernel, brgerst, tglx,
	torvalds, hpa, akpm, peterz, mingo, luto, dvlasenk, riel

On Fri, Feb 19, 2016 at 09:11:03AM -0800, Dave Hansen wrote:
> I'd really prefer this be left on by default.  This is a feature that I
> expect to be widely enabled in distribution kernels.

Distribution kernels can enable it without defaulting to y here. Also,
this code doesn't need to be built on the majority of x86 boxes out
there because they don't have the hw support.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] x86/mm/pkeys: Do not enable them by default
  2016-02-19 17:23         ` Borislav Petkov
@ 2016-02-19 17:49           ` Dave Hansen
  2016-02-19 18:31             ` Borislav Petkov
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Hansen @ 2016-02-19 17:49 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, linux-kernel, brgerst, tglx, torvalds, hpa, akpm,
	peterz, mingo, luto, dvlasenk, riel, Linus Torvalds

Borislav, we're talking about 1566 bytes of text here:

>    text	   data	    bss	    dec	    hex	filename
> 13874312	2633704	3014656	19522672	129e470	64bit-pkey/vmlinux
> 13872746	2633648	3014656	19521050	129de1a	64bit-nopkey/vmlinux

For gains that small, we should barely even allow this thing to be
configurable, much less default it to off.

On 02/19/2016 09:23 AM, Borislav Petkov wrote:
> On Fri, Feb 19, 2016 at 09:11:03AM -0800, Dave Hansen wrote:
>> I'd really prefer this be left on by default.  This is a feature that I
>> expect to be widely enabled in distribution kernels.
> 
> Distribution kernels can enable it without defaulting to y here.

Yes, agreed.  Distros _can_ override things.  But, In general,
*especially* with user-visible effects, I'd really like defconfig (or
other build defaults) to be reasonably close to what distributions do.

> Also, this code doesn't need to be built on the majority of x86 boxes
> out there because they don't have the hw support.

My view has always been that the folks that really care about binary
size are the ones that will be the ones digging through their .configs
turning things off.

BTW, what percentage of x86 boxes must have a feature before we can
enable it by default?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] x86/mm/pkeys: Do not enable them by default
  2016-02-19 17:49           ` Dave Hansen
@ 2016-02-19 18:31             ` Borislav Petkov
  0 siblings, 0 replies; 84+ messages in thread
From: Borislav Petkov @ 2016-02-19 18:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, linux-kernel, brgerst, tglx, torvalds, hpa, akpm,
	peterz, mingo, luto, dvlasenk, riel

On Fri, Feb 19, 2016 at 09:49:02AM -0800, Dave Hansen wrote:
> For gains that small, we should barely even allow this thing to be
> configurable, much less default it to off.

Well, it's not about size gains only - it is also about adding code to
the kernel which is never going to be executed and also building it each
time on !pkey machines.

> Yes, agreed.  Distros _can_ override things.  But, In general,
> *especially* with user-visible effects, I'd really like defconfig (or
> other build defaults) to be reasonably close to what distributions do.

You can always add it to arch/x86/configs/x86_64_defconfig.

> My view has always been that the folks that really care about binary
> size are the ones that will be the ones digging through their .configs
> turning things off.

Except when automating "make oldconfig" and then wondering why all of a
sudden new features are enabled.

In my experience, it has always been the case that new features are
default off. What you want to achieve can easily be done without the
"def_bool y" thing.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [tip:x86/pkeys] mm/gup: Introduce get_user_pages_remote()
  2016-02-16 12:14   ` [tip:x86/pkeys] mm/gup: Introduce get_user_pages_remote() tip-bot for Dave Hansen
@ 2016-02-20  6:25     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 84+ messages in thread
From: Konstantin Khlebnikov @ 2016-02-20  6:25 UTC (permalink / raw)
  To: Ingo Molnar, Borislav Petkov, Andrew Morton, Dave Hansen,
	Vlastimil Babka, Andrea Arcangeli, H. Peter Anvin,
	Linux Kernel Mailing List, Srikar Dronamraju, Peter Zijlstra,
	Kirill A. Shutemov, dvlasenk, Dave Hansen, Thomas Gleixner,
	Rik van Riel, brgerst, Andy Lutomirski, Naoya Horiguchi,
	Linus Torvalds
  Cc: linux-tip-commits

On Tue, Feb 16, 2016 at 3:14 PM, tip-bot for Dave Hansen
<tipbot@zytor.com> wrote:
> Commit-ID:  1e9877902dc7e11d2be038371c6fbf2dfcd469d7
> Gitweb:     http://git.kernel.org/tip/1e9877902dc7e11d2be038371c6fbf2dfcd469d7
> Author:     Dave Hansen <dave.hansen@linux.intel.com>
> AuthorDate: Fri, 12 Feb 2016 13:01:54 -0800
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Tue, 16 Feb 2016 10:04:09 +0100
>
> mm/gup: Introduce get_user_pages_remote()
>
> For protection keys, we need to understand whether protections
> should be enforced in software or not.  In general, we enforce
> protections when working on our own task, but not when on others.
> We call these "current" and "remote" operations.
>
> This patch introduces a new get_user_pages() variant:
>
>         get_user_pages_remote()
>
> Which is a replacement for when get_user_pages() is called on
> non-current tsk/mm.

As I see task-struct argument could be NULL as well as in old api.
They only usage for it is updating task->maj_flt/min_flt.

May be just remove arg and always account major/minor faults into
current task - currently counters are plain unsigned long, so remote
access could corrupt them.

>
> We also introduce a new gup flag: FOLL_REMOTE which can be used
> for the "__" gup variants to get this new behavior.
>
> The uprobes is_trap_at_addr() location holds mmap_sem and
> calls get_user_pages(current->mm) on an instruction address.  This
> makes it a pretty unique gup caller.  Being an instruction access
> and also really originating from the kernel (vs. the app), I opted
> to consider this a 'remote' access where protection keys will not
> be enforced.
>
> Without protection keys, this patch should not change any behavior.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: jack@suse.cz
> Cc: linux-mm@kvack.org
> Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  drivers/gpu/drm/etnaviv/etnaviv_gem.c   |  6 +++---
>  drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +++++-----
>  drivers/infiniband/core/umem_odp.c      |  8 ++++----
>  fs/exec.c                               |  8 ++++++--
>  include/linux/mm.h                      |  5 +++++
>  kernel/events/uprobes.c                 | 10 ++++++++--
>  mm/gup.c                                | 27 ++++++++++++++++++++++-----
>  mm/memory.c                             |  2 +-
>  mm/process_vm_access.c                  | 11 ++++++++---
>  security/tomoyo/domain.c                |  9 ++++++++-
>  virt/kvm/async_pf.c                     |  8 +++++++-
>  11 files changed, 77 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> index 4b519e4..97d4457 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> @@ -753,9 +753,9 @@ static struct page **etnaviv_gem_userptr_do_get_pages(
>
>         down_read(&mm->mmap_sem);
>         while (pinned < npages) {
> -               ret = get_user_pages(task, mm, ptr, npages - pinned,
> -                                    !etnaviv_obj->userptr.ro, 0,
> -                                    pvec + pinned, NULL);
> +               ret = get_user_pages_remote(task, mm, ptr, npages - pinned,
> +                                           !etnaviv_obj->userptr.ro, 0,
> +                                           pvec + pinned, NULL);
>                 if (ret < 0)
>                         break;
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 59e45b3..90dbf81 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -584,11 +584,11 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
>
>                 down_read(&mm->mmap_sem);
>                 while (pinned < npages) {
> -                       ret = get_user_pages(work->task, mm,
> -                                            obj->userptr.ptr + pinned * PAGE_SIZE,
> -                                            npages - pinned,
> -                                            !obj->userptr.read_only, 0,
> -                                            pvec + pinned, NULL);
> +                       ret = get_user_pages_remote(work->task, mm,
> +                                       obj->userptr.ptr + pinned * PAGE_SIZE,
> +                                       npages - pinned,
> +                                       !obj->userptr.read_only, 0,
> +                                       pvec + pinned, NULL);
>                         if (ret < 0)
>                                 break;
>
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index e69bf26..75077a0 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -572,10 +572,10 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 user_virt, u64 bcnt,
>                  * complex (and doesn't gain us much performance in most use
>                  * cases).
>                  */
> -               npages = get_user_pages(owning_process, owning_mm, user_virt,
> -                                       gup_num_pages,
> -                                       access_mask & ODP_WRITE_ALLOWED_BIT, 0,
> -                                       local_page_list, NULL);
> +               npages = get_user_pages_remote(owning_process, owning_mm,
> +                               user_virt, gup_num_pages,
> +                               access_mask & ODP_WRITE_ALLOWED_BIT,
> +                               0, local_page_list, NULL);
>                 up_read(&owning_mm->mmap_sem);
>
>                 if (npages < 0)
> diff --git a/fs/exec.c b/fs/exec.c
> index dcd4ac7..d885b98 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -198,8 +198,12 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
>                         return NULL;
>         }
>  #endif
> -       ret = get_user_pages(current, bprm->mm, pos,
> -                       1, write, 1, &page, NULL);
> +       /*
> +        * We are doing an exec().  'current' is the process
> +        * doing the exec and bprm->mm is the new process's mm.
> +        */
> +       ret = get_user_pages_remote(current, bprm->mm, pos, 1, write,
> +                       1, &page, NULL);
>         if (ret <= 0)
>                 return NULL;
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b1d4b8c..faf3b70 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1225,6 +1225,10 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                       unsigned long start, unsigned long nr_pages,
>                       unsigned int foll_flags, struct page **pages,
>                       struct vm_area_struct **vmas, int *nonblocking);
> +long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +                           unsigned long start, unsigned long nr_pages,
> +                           int write, int force, struct page **pages,
> +                           struct vm_area_struct **vmas);
>  long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                     unsigned long start, unsigned long nr_pages,
>                     int write, int force, struct page **pages,
> @@ -2170,6 +2174,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
>  #define FOLL_MIGRATION 0x400   /* wait for page to replace migration entry */
>  #define FOLL_TRIED     0x800   /* a retry, previous pass started an IO */
>  #define FOLL_MLOCK     0x1000  /* lock present pages */
> +#define FOLL_REMOTE    0x2000  /* we are working on non-current tsk/mm */
>
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>                         void *data);
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 0167679..8eef5f5 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -299,7 +299,7 @@ int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
>
>  retry:
>         /* Read the page with vaddr into memory */
> -       ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
> +       ret = get_user_pages_remote(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
>         if (ret <= 0)
>                 return ret;
>
> @@ -1700,7 +1700,13 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
>         if (likely(result == 0))
>                 goto out;
>
> -       result = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
> +       /*
> +        * The NULL 'tsk' here ensures that any faults that occur here
> +        * will not be accounted to the task.  'mm' *is* current->mm,
> +        * but we treat this as a 'remote' access since it is
> +        * essentially a kernel access to the memory.
> +        */
> +       result = get_user_pages_remote(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
>         if (result < 0)
>                 return result;
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 7bf19ff..36ca850 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -870,7 +870,7 @@ long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
>  EXPORT_SYMBOL(get_user_pages_unlocked);
>
>  /*
> - * get_user_pages() - pin user pages in memory
> + * get_user_pages_remote() - pin user pages in memory
>   * @tsk:       the task_struct to use for page fault accounting, or
>   *             NULL if faults are not to be recorded.
>   * @mm:                mm_struct of target mm
> @@ -924,12 +924,29 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
>   * should use get_user_pages because it cannot pass
>   * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
>   */
> -long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> -               unsigned long start, unsigned long nr_pages, int write,
> -               int force, struct page **pages, struct vm_area_struct **vmas)
> +long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +               unsigned long start, unsigned long nr_pages,
> +               int write, int force, struct page **pages,
> +               struct vm_area_struct **vmas)
>  {
>         return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> -                                      pages, vmas, NULL, false, FOLL_TOUCH);
> +                                      pages, vmas, NULL, false,
> +                                      FOLL_TOUCH | FOLL_REMOTE);
> +}
> +EXPORT_SYMBOL(get_user_pages_remote);
> +
> +/*
> + * This is the same as get_user_pages_remote() for the time
> + * being.
> + */
> +long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> +               unsigned long start, unsigned long nr_pages,
> +               int write, int force, struct page **pages,
> +               struct vm_area_struct **vmas)
> +{
> +       return __get_user_pages_locked(tsk, mm, start, nr_pages,
> +                                      write, force, pages, vmas, NULL, false,
> +                                      FOLL_TOUCH);
>  }
>  EXPORT_SYMBOL(get_user_pages);
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 38090ca..8bfbad0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3685,7 +3685,7 @@ static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
>                 void *maddr;
>                 struct page *page = NULL;
>
> -               ret = get_user_pages(tsk, mm, addr, 1,
> +               ret = get_user_pages_remote(tsk, mm, addr, 1,
>                                 write, 1, &page, &vma);
>                 if (ret <= 0) {
>  #ifndef CONFIG_HAVE_IOREMAP_PROT
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 5d453e5..07514d4 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -98,9 +98,14 @@ static int process_vm_rw_single_vec(unsigned long addr,
>                 int pages = min(nr_pages, max_pages_per_loop);
>                 size_t bytes;
>
> -               /* Get the pages we're interested in */
> -               pages = get_user_pages_unlocked(task, mm, pa, pages,
> -                                               vm_write, 0, process_pages);
> +               /*
> +                * Get the pages we're interested in.  We must
> +                * add FOLL_REMOTE because task/mm might not
> +                * current/current->mm
> +                */
> +               pages = __get_user_pages_unlocked(task, mm, pa, pages,
> +                                                 vm_write, 0, process_pages,
> +                                                 FOLL_REMOTE);
>                 if (pages <= 0)
>                         return -EFAULT;
>
> diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c
> index 3865145..ade7c6c 100644
> --- a/security/tomoyo/domain.c
> +++ b/security/tomoyo/domain.c
> @@ -874,7 +874,14 @@ bool tomoyo_dump_page(struct linux_binprm *bprm, unsigned long pos,
>         }
>         /* Same with get_arg_page(bprm, pos, 0) in fs/exec.c */
>  #ifdef CONFIG_MMU
> -       if (get_user_pages(current, bprm->mm, pos, 1, 0, 1, &page, NULL) <= 0)
> +       /*
> +        * This is called at execve() time in order to dig around
> +        * in the argv/environment of the new proceess
> +        * (represented by bprm).  'current' is the process doing
> +        * the execve().
> +        */
> +       if (get_user_pages_remote(current, bprm->mm, pos, 1,
> +                               0, 1, &page, NULL) <= 0)
>                 return false;
>  #else
>         page = bprm->page[pos / PAGE_SIZE];
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 3531599..d604e87 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -79,7 +79,13 @@ static void async_pf_execute(struct work_struct *work)
>
>         might_sleep();
>
> -       get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
> +       /*
> +        * This work is run asynchromously to the task which owns
> +        * mm and might be done in another context, so we must
> +        * use FOLL_REMOTE.
> +        */
> +       __get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL, FOLL_REMOTE);
> +
>         kvm_async_page_present_sync(vcpu, apf);
>
>         spin_lock(&vcpu->async_pf.lock);

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2016-02-20  6:26 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-12 21:01 [PATCH 00/33] x86: Memory Protection Keys (v10) Dave Hansen
2016-02-12 21:01 ` [PATCH 01/33] mm: introduce get_user_pages_remote() Dave Hansen
2016-02-15  6:09   ` Balbir Singh
2016-02-15 16:29     ` Dave Hansen
2016-02-15  6:14   ` Srikar Dronamraju
2016-02-16 12:14   ` [tip:x86/pkeys] mm/gup: Introduce get_user_pages_remote() tip-bot for Dave Hansen
2016-02-20  6:25     ` Konstantin Khlebnikov
2016-02-12 21:01 ` [PATCH 02/33] mm: overload get_user_pages() functions Dave Hansen
2016-02-16  8:36   ` Ingo Molnar
2016-02-17 18:15     ` Dave Hansen
2016-02-18 20:15   ` [tip:mm/pkeys] mm/gup: Overload " tip-bot for Dave Hansen
2016-02-12 21:01 ` [PATCH 03/33] mm, gup: switch callers of get_user_pages() to not pass tsk/mm Dave Hansen
2016-02-18 20:16   ` [tip:mm/pkeys] mm/gup: Switch all " tip-bot for Dave Hansen
2016-02-12 21:01 ` [PATCH 04/33] x86, fpu: add placeholder for Processor Trace XSAVE state Dave Hansen
2016-02-18 20:16   ` [tip:mm/pkeys] x86/fpu: Add placeholder for 'Processor Trace' " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 05/33] x86, pkeys: Add Kconfig option Dave Hansen
2016-02-18 20:16   ` [tip:mm/pkeys] x86/mm/pkeys: " tip-bot for Dave Hansen
2016-02-19 11:27     ` [PATCH] x86/mm/pkeys: Do not enable them by default Borislav Petkov
2016-02-19 17:11       ` Dave Hansen
2016-02-19 17:23         ` Borislav Petkov
2016-02-19 17:49           ` Dave Hansen
2016-02-19 18:31             ` Borislav Petkov
2016-02-12 21:02 ` [PATCH 06/33] x86, pkeys: cpuid bit definition Dave Hansen
2016-02-18 20:17   ` [tip:mm/pkeys] x86/cpufeature, x86/mm/pkeys: Add protection keys related CPUID definitions tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 07/33] x86, pkeys: define new CR4 bit Dave Hansen
2016-02-18 20:17   ` [tip:mm/pkeys] x86/cpu, x86/mm/pkeys: Define " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 08/33] x86, pkeys: add PKRU xsave fields and data structure(s) Dave Hansen
2016-02-18 20:17   ` [tip:mm/pkeys] x86/fpu, x86/mm/pkeys: Add PKRU xsave fields and data structures tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 09/33] x86, pkeys: PTE bits for storing protection key Dave Hansen
2016-02-18 20:18   ` [tip:mm/pkeys] x86/mm/pkeys: Add " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 10/33] x86, pkeys: new page fault error code bit: PF_PK Dave Hansen
2016-02-18 20:18   ` [tip:mm/pkeys] x86/mm/pkeys: Add new 'PF_PK' page fault error code bit tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 11/33] x86, pkeys: store protection in high VMA flags Dave Hansen
2016-02-18 20:19   ` [tip:mm/pkeys] mm/core, x86/mm/pkeys: Store protection bits " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 12/33] x86, pkeys: arch-specific protection bits Dave Hansen
2016-02-18 20:19   ` [tip:mm/pkeys] x86/mm/pkeys: Add arch-specific VMA " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 13/33] x86, pkeys: pass VMA down in to fault signal generation code Dave Hansen
2016-02-18 20:19   ` [tip:mm/pkeys] x86/mm/pkeys: Pass " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 14/33] signals, pkeys: notify userspace about protection key faults Dave Hansen
2016-02-18 20:20   ` [tip:mm/pkeys] signals, pkeys: Notify " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 15/33] x86, pkeys: fill in pkey field in siginfo Dave Hansen
2016-02-18 20:20   ` [tip:mm/pkeys] x86/mm/pkeys: Fill " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 16/33] x86, pkeys: add functions to fetch PKRU Dave Hansen
2016-02-18 20:21   ` [tip:mm/pkeys] x86/mm/pkeys: Add " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 17/33] mm: factor out VMA fault permission checking Dave Hansen
2016-02-18 20:21   ` [tip:mm/pkeys] mm/gup: Factor " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 18/33] x86, mm: simplify get_user_pages() PTE bit handling Dave Hansen
2016-02-18 20:21   ` [tip:mm/pkeys] x86/mm/gup: Simplify " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 19/33] x86, pkeys: check VMAs and PTEs for protection keys Dave Hansen
2016-02-18 20:22   ` [tip:mm/pkeys] mm/gup, x86/mm/pkeys: Check " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 20/33] mm: do not enforce PKEY permissions on "foreign" mm access Dave Hansen
2016-02-12 21:02 ` [PATCH 21/33] x86, pkeys: optimize fault handling in access_error() Dave Hansen
2016-02-18 20:23   ` [tip:mm/pkeys] x86/mm/pkeys: Optimize " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 22/33] x86, pkeys: differentiate instruction fetches Dave Hansen
2016-02-18 20:23   ` [tip:mm/pkeys] mm/core, x86/mm/pkeys: Differentiate " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 23/33] x86, pkeys: dump PKRU with other kernel registers Dave Hansen
2016-02-18 20:24   ` [tip:mm/pkeys] x86/mm/pkeys: Dump " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 24/33] x86, pkeys: dump pkey from VMA in /proc/pid/smaps Dave Hansen
2016-02-18 20:24   ` [tip:mm/pkeys] x86/mm/pkeys: Dump pkey from VMA in /proc/pid/ smaps tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 25/33] x86, pkeys: add Kconfig prompt to existing config option Dave Hansen
2016-02-18 20:24   ` [tip:mm/pkeys] x86/mm/pkeys: Add " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 26/33] x86, pkeys: actually enable Memory Protection Keys in CPU Dave Hansen
2016-02-18 20:25   ` [tip:mm/pkeys] x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 27/33] mm, multi-arch: pass a protection key in to calc_vm_flag_bits() Dave Hansen
2016-02-18 20:25   ` [tip:mm/pkeys] mm/core, arch, powerpc: Pass " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 28/33] x86, pkeys: add arch_validate_pkey() Dave Hansen
2016-02-18 20:25   ` [tip:mm/pkeys] mm/core, x86/mm/pkeys: Add arch_validate_pkey() tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 29/33] x86: separate out LDT init from context init Dave Hansen
2016-02-18 20:26   ` [tip:mm/pkeys] x86/mm: Factor " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 30/33] x86, fpu: allow setting of XSAVE state Dave Hansen
2016-02-18 20:26   ` [tip:mm/pkeys] x86/fpu: Allow " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 31/33] x86, pkeys: allow kernel to modify user pkey rights register Dave Hansen
2016-02-18 20:27   ` [tip:mm/pkeys] x86/mm/pkeys: Allow " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 32/33] x86, pkeys: create an x86 arch_calc_vm_prot_bits() for VMA flags Dave Hansen
2016-02-18 20:27   ` [tip:mm/pkeys] x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits () " tip-bot for Dave Hansen
2016-02-12 21:02 ` [PATCH 33/33] x86, pkeys: execute-only support Dave Hansen
2016-02-17 21:27   ` Kees Cook
2016-02-17 21:33     ` Dave Hansen
2016-02-17 21:36       ` Kees Cook
2016-02-17 22:17     ` Andy Lutomirski
2016-02-17 22:53       ` Dave Hansen
2016-02-18  0:46         ` Andy Lutomirski
2016-02-18 20:27   ` [tip:mm/pkeys] mm/core, x86/mm/pkeys: Add execute-only protection keys support tip-bot for Dave Hansen
2016-02-16  9:29 ` [PATCH 00/33] x86: Memory Protection Keys (v10) Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).