linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
       [not found] <20201112220135.165028-1-jarkko@kernel.org>
@ 2020-11-12 22:01 ` Jarkko Sakkinen
  2020-11-13 10:25   ` Mel Gorman
                     ` (3 more replies)
  2020-11-12 22:01 ` [PATCH v41 11/24] x86/sgx: Add SGX misc driver interface Jarkko Sakkinen
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 12+ messages in thread
From: Jarkko Sakkinen @ 2020-11-12 22:01 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Sean Christopherson, linux-mm, Andrew Morton,
	Matthew Wilcox, Mel Gorman, Jethro Beekman, Dave Hansen,
	Jarkko Sakkinen, andriy.shevchenko, asapek, bp, cedric.xing,
	chenalexchen, conradparker, cyhanish, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen

From: Sean Christopherson <sean.j.christopherson@intel.com>

Background
==========

1. SGX enclave pages are populated with data by copying from normal memory
   via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
   this series.
2. It is desirable to be able to restrict those normal memory data sources.
   For instance, to ensure that the source data is executable before
   copying data to an executable enclave page.
3. Enclave page permissions are dynamic (just like normal permissions) and
   can be adjusted at runtime with mprotect().

This creates a problem because the original data source may have long since
vanished at the time when enclave page permissions are established (mmap()
or mprotect()).

The solution (elsewhere in this series) is to force enclaves creators to
declare their paging permission *intent* up front to the ioctl().  This
intent can be immediately compared to the source data’s mapping and
rejected if necessary.

The “intent” is also stashed off for later comparison with enclave
PTEs. This ensures that any future mmap()/mprotect() operations
performed by the enclave creator or done on behalf of the enclave
can be compared with the earlier declared permissions.

Problem
=======

There is an existing mmap() hook which allows SGX to perform this
permission comparison at mmap() time.  However, there is no corresponding
->mprotect() hook.

Solution
========

Add a vm_ops->mprotect() hook so that mprotect() operations which are
inconsistent with any page's stashed intent can be rejected by the driver.

Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jethro Beekman <jethro@fortanix.com> # v40
Acked-by: Dave Hansen <dave.hansen@intel.com> # v40
# Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---
Changes from v40:
* Move mprotect_fixup() outside of the callback. This keeps mm internals
  outside of the callback.

 include/linux/mm.h | 7 +++++++
 mm/mprotect.c      | 7 +++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..1813fa86b981 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -559,6 +559,13 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	int (*split)(struct vm_area_struct * area, unsigned long addr);
 	int (*mremap)(struct vm_area_struct * area);
+	/*
+	 * Called by mprotect() to make driver-specific permission
+	 * checks before mprotect() is finalised.   The VMA must not
+	 * be modified.  Returns 0 if eprotect() can proceed.
+	 */
+	int (*mprotect)(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, unsigned long newflags);
 	vm_fault_t (*fault)(struct vm_fault *vmf);
 	vm_fault_t (*huge_fault)(struct vm_fault *vmf,
 			enum page_entry_size pe_size);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 56c02beb6041..ab709023e9aa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -616,9 +616,16 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		tmp = vma->vm_end;
 		if (tmp > end)
 			tmp = end;
+
+		if (vma->vm_ops && vma->vm_ops->mprotect)
+			error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
+		if (error)
+			goto out;
+
 		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
 		if (error)
 			goto out;
+
 		nstart = tmp;
 
 		if (nstart < prev->vm_end)
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v41 11/24] x86/sgx: Add SGX misc driver interface
       [not found] <20201112220135.165028-1-jarkko@kernel.org>
  2020-11-12 22:01 ` [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
@ 2020-11-12 22:01 ` Jarkko Sakkinen
  2020-11-12 22:01 ` [PATCH v41 21/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
  2020-11-12 22:01 ` [PATCH v41 22/24] x86/sgx: Add ptrace() support for the SGX driver Jarkko Sakkinen
  3 siblings, 0 replies; 12+ messages in thread
From: Jarkko Sakkinen @ 2020-11-12 22:01 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Jarkko Sakkinen, linux-security-module, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, andriy.shevchenko,
	asapek, bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, npmccallum, puiterwijk, rientjes,
	sean.j.christopherson, tglx, yaozhangx, mikko.ylinen

Intel(R) SGX is new hardware functionality that can be used by applications
to set aside private regions of code and data called enclaves. New hardware
protects enclave code and data from outside access and modification.

Add a driver that presents a device file and ioctl API to build and manage
enclaves.  Subsequent patches will expend the ioctl()’s functionality.

Cc: linux-security-module@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Tested-by: Jethro Beekman <jethro@fortanix.com> # v40
# Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
# Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---
Changes from v40:
* Do not check !page in sgx_encl_may_map() while iterating.
* Do not check !vm_private_data in sgx_encl_find(), as the enclave is
  created at VFS open.

Changes from v39:
* Rename /dev/sgx/enclave as /dev/sgx_enclave.
* In the page fault handler, do not check for SGX_ENCL_DEAD. This allows
  to do forensics to the memory of debug enclaves.

 arch/x86/kernel/cpu/sgx/Makefile |   2 +
 arch/x86/kernel/cpu/sgx/driver.c | 112 ++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/driver.h |  16 ++++
 arch/x86/kernel/cpu/sgx/encl.c   | 146 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/encl.h   |  60 +++++++++++++
 arch/x86/kernel/cpu/sgx/main.c   |  12 ++-
 6 files changed, 347 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/driver.c
 create mode 100644 arch/x86/kernel/cpu/sgx/driver.h
 create mode 100644 arch/x86/kernel/cpu/sgx/encl.c
 create mode 100644 arch/x86/kernel/cpu/sgx/encl.h

diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 79510ce01b3b..3fc451120735 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -1,2 +1,4 @@
 obj-y += \
+	driver.o \
+	encl.o \
 	main.o
diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
new file mode 100644
index 000000000000..248213dea78e
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+/*  Copyright(c) 2016-20 Intel Corporation. */
+
+#include <linux/acpi.h>
+#include <linux/miscdevice.h>
+#include <linux/mman.h>
+#include <linux/security.h>
+#include <linux/suspend.h>
+#include <asm/traps.h>
+#include "driver.h"
+#include "encl.h"
+
+static int sgx_open(struct inode *inode, struct file *file)
+{
+	struct sgx_encl *encl;
+
+	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
+	if (!encl)
+		return -ENOMEM;
+
+	xa_init(&encl->page_array);
+	mutex_init(&encl->lock);
+
+	file->private_data = encl;
+
+	return 0;
+}
+
+static int sgx_release(struct inode *inode, struct file *file)
+{
+	struct sgx_encl *encl = file->private_data;
+	struct sgx_encl_page *entry;
+	unsigned long index;
+
+	xa_for_each(&encl->page_array, index, entry) {
+		if (entry->epc_page) {
+			sgx_free_epc_page(entry->epc_page);
+			encl->secs_child_cnt--;
+			entry->epc_page = NULL;
+		}
+
+		kfree(entry);
+	}
+
+	xa_destroy(&encl->page_array);
+
+	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_free_epc_page(encl->secs.epc_page);
+		encl->secs.epc_page = NULL;
+	}
+
+	/* Detect EPC page leak's. */
+	WARN_ON_ONCE(encl->secs_child_cnt);
+	WARN_ON_ONCE(encl->secs.epc_page);
+
+	kfree(encl);
+	return 0;
+}
+
+static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct sgx_encl *encl = file->private_data;
+	int ret;
+
+	ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags);
+	if (ret)
+		return ret;
+
+	vma->vm_ops = &sgx_vm_ops;
+	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
+	vma->vm_private_data = encl;
+
+	return 0;
+}
+
+static unsigned long sgx_get_unmapped_area(struct file *file,
+					   unsigned long addr,
+					   unsigned long len,
+					   unsigned long pgoff,
+					   unsigned long flags)
+{
+	if ((flags & MAP_TYPE) == MAP_PRIVATE)
+		return -EINVAL;
+
+	if (flags & MAP_FIXED)
+		return addr;
+
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
+static const struct file_operations sgx_encl_fops = {
+	.owner			= THIS_MODULE,
+	.open			= sgx_open,
+	.release		= sgx_release,
+	.mmap			= sgx_mmap,
+	.get_unmapped_area	= sgx_get_unmapped_area,
+};
+
+static struct miscdevice sgx_dev_enclave = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "sgx_enclave",
+	.nodename = "sgx_enclave",
+	.fops = &sgx_encl_fops,
+};
+
+int __init sgx_drv_init(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SGX_LC))
+		return -ENODEV;
+
+	return misc_register(&sgx_dev_enclave);
+}
diff --git a/arch/x86/kernel/cpu/sgx/driver.h b/arch/x86/kernel/cpu/sgx/driver.h
new file mode 100644
index 000000000000..cda9c43b7543
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/driver.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ARCH_SGX_DRIVER_H__
+#define __ARCH_SGX_DRIVER_H__
+
+#include <crypto/hash.h>
+#include <linux/kref.h>
+#include <linux/mmu_notifier.h>
+#include <linux/radix-tree.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/workqueue.h>
+#include "sgx.h"
+
+int sgx_drv_init(void);
+
+#endif /* __ARCH_X86_SGX_DRIVER_H__ */
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
new file mode 100644
index 000000000000..848b17a3a028
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/*  Copyright(c) 2016-20 Intel Corporation. */
+
+#include <linux/lockdep.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/shmem_fs.h>
+#include <linux/suspend.h>
+#include <linux/sched/mm.h>
+#include "arch.h"
+#include "encl.h"
+#include "encls.h"
+#include "sgx.h"
+
+static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
+						unsigned long addr,
+						unsigned long vm_flags)
+{
+	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+	struct sgx_encl_page *entry;
+
+	entry = xa_load(&encl->page_array, PFN_DOWN(addr));
+	if (!entry)
+		return ERR_PTR(-EFAULT);
+
+	/*
+	 * Verify that the faulted page has equal or higher build time
+	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
+	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
+	 */
+	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
+		return ERR_PTR(-EFAULT);
+
+	/* No page found. */
+	if (!entry->epc_page)
+		return ERR_PTR(-EFAULT);
+
+	/* Entry successfully located. */
+	return entry;
+}
+
+static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
+{
+	unsigned long addr = (unsigned long)vmf->address;
+	struct vm_area_struct *vma = vmf->vma;
+	struct sgx_encl_page *entry;
+	unsigned long phys_addr;
+	struct sgx_encl *encl;
+	vm_fault_t ret;
+
+	encl = vma->vm_private_data;
+
+	mutex_lock(&encl->lock);
+
+	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
+	if (IS_ERR(entry)) {
+		mutex_unlock(&encl->lock);
+
+		return VM_FAULT_SIGBUS;
+	}
+
+	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
+
+	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
+	if (ret != VM_FAULT_NOPAGE) {
+		mutex_unlock(&encl->lock);
+
+		return VM_FAULT_SIGBUS;
+	}
+
+	mutex_unlock(&encl->lock);
+
+	return VM_FAULT_NOPAGE;
+}
+
+/**
+ * sgx_encl_may_map() - Check if a requested VMA mapping is allowed
+ * @encl:		an enclave pointer
+ * @start:		lower bound of the address range, inclusive
+ * @end:		upper bound of the address range, exclusive
+ * @vm_flags:		VMA flags
+ *
+ * Iterate through the enclave pages contained within [@start, @end) to verify
+ * that the permissions requested by a subset of {VM_READ, VM_WRITE, VM_EXEC}
+ * does not contain any permissions that are not contained in the build time
+ * permissions of any of the enclave pages within the given address range.
+ *
+ * An enclave creator must declare the strongest permissions that will be
+ * needed for each enclave page  This ensures that mappings  have the identical
+ * or weaker permissions that the earlier declared permissions.
+ *
+ * Return: 0 on success, -EACCES otherwise
+ */
+int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
+		     unsigned long end, unsigned long vm_flags)
+{
+	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+	struct sgx_encl_page *page;
+	unsigned long count = 0;
+	int ret = 0;
+
+	XA_STATE(xas, &encl->page_array, PFN_DOWN(start));
+
+	/*
+	 * Disallow READ_IMPLIES_EXEC tasks as their VMA permissions might
+	 * conflict with the enclave page permissions.
+	 */
+	if (current->personality & READ_IMPLIES_EXEC)
+		return -EACCES;
+
+	mutex_lock(&encl->lock);
+	xas_lock(&xas);
+	xas_for_each(&xas, page, PFN_DOWN(end - 1)) {
+		if (~page->vm_max_prot_bits & vm_prot_bits) {
+			ret = -EACCES;
+			break;
+		}
+
+		/* Reschedule on every XA_CHECK_SCHED iteration. */
+		if (!(++count % XA_CHECK_SCHED)) {
+			xas_pause(&xas);
+			xas_unlock(&xas);
+			mutex_unlock(&encl->lock);
+
+			cond_resched();
+
+			mutex_lock(&encl->lock);
+			xas_lock(&xas);
+		}
+	}
+	xas_unlock(&xas);
+	mutex_unlock(&encl->lock);
+
+	return ret;
+}
+
+static int sgx_vma_mprotect(struct vm_area_struct *vma, unsigned long start,
+			    unsigned long end, unsigned long newflags)
+{
+	return sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
+}
+
+const struct vm_operations_struct sgx_vm_ops = {
+	.fault = sgx_vma_fault,
+	.mprotect = sgx_vma_mprotect,
+};
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
new file mode 100644
index 000000000000..b7e02eab5868
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/**
+ * Copyright(c) 2016-20 Intel Corporation.
+ *
+ * Contains the software defined data structures for enclaves.
+ */
+#ifndef _X86_ENCL_H
+#define _X86_ENCL_H
+
+#include <linux/cpumask.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/srcu.h>
+#include <linux/workqueue.h>
+#include <linux/xarray.h>
+#include "sgx.h"
+
+struct sgx_encl_page {
+	unsigned long desc;
+	unsigned long vm_max_prot_bits;
+	struct sgx_epc_page *epc_page;
+	struct sgx_encl *encl;
+};
+
+struct sgx_encl {
+	unsigned long base;
+	unsigned long size;
+	unsigned int page_cnt;
+	unsigned int secs_child_cnt;
+	struct mutex lock;
+	struct xarray page_array;
+	struct sgx_encl_page secs;
+};
+
+extern const struct vm_operations_struct sgx_vm_ops;
+
+static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
+				struct vm_area_struct **vma)
+{
+	struct vm_area_struct *result;
+	struct sgx_encl *encl;
+
+	result = find_vma(mm, addr);
+	if (!result || result->vm_ops != &sgx_vm_ops || addr < result->vm_start)
+		return -EINVAL;
+
+	encl = result->vm_private_data;
+	*vma = result;
+
+	return 0;
+}
+
+int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
+		     unsigned long end, unsigned long vm_flags);
+
+#endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 2e53afc288a4..38f2e80cc31a 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -9,6 +9,8 @@
 #include <linux/sched/mm.h>
 #include <linux/sched/signal.h>
 #include <linux/slab.h>
+#include "driver.h"
+#include "encl.h"
 #include "encls.h"
 
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
@@ -232,9 +234,10 @@ static bool __init sgx_page_cache_init(void)
 
 static void __init sgx_init(void)
 {
+	int ret;
 	int i;
 
-	if (!boot_cpu_has(X86_FEATURE_SGX))
+	if (!cpu_feature_enabled(X86_FEATURE_SGX))
 		return;
 
 	if (!sgx_page_cache_init())
@@ -243,8 +246,15 @@ static void __init sgx_init(void)
 	if (!sgx_page_reclaimer_init())
 		goto err_page_cache;
 
+	ret = sgx_drv_init();
+	if (ret)
+		goto err_kthread;
+
 	return;
 
+err_kthread:
+	kthread_stop(ksgxd_tsk);
+
 err_page_cache:
 	for (i = 0; i < sgx_nr_epc_sections; i++) {
 		vfree(sgx_epc_sections[i].pages);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v41 21/24] x86/sgx: Add a page reclaimer
       [not found] <20201112220135.165028-1-jarkko@kernel.org>
  2020-11-12 22:01 ` [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
  2020-11-12 22:01 ` [PATCH v41 11/24] x86/sgx: Add SGX misc driver interface Jarkko Sakkinen
@ 2020-11-12 22:01 ` Jarkko Sakkinen
  2020-11-12 22:01 ` [PATCH v41 22/24] x86/sgx: Add ptrace() support for the SGX driver Jarkko Sakkinen
  3 siblings, 0 replies; 12+ messages in thread
From: Jarkko Sakkinen @ 2020-11-12 22:01 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Jarkko Sakkinen, Hillf Danton, linux-mm,
	Jethro Beekman, akpm, andriy.shevchenko, asapek, bp, cedric.xing,
	chenalexchen, conradparker, cyhanish, dave.hansen, haitao.huang,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum,
	puiterwijk, rientjes, sean.j.christopherson, tglx, yaozhangx,
	mikko.ylinen

Just like normal RAM, there is a limited amount of enclave memory available
and overcommitting it is a very valuable tool to reduce resource use.
Introduce a simple reclaim mechanism for enclave pages.

In contrast to normal page reclaim, the kernel cannot directly access
enclave memory.  To get around this, the SGX architecture provides a set of
functions to help.  Among other things, these functions copy enclave memory
to and from normal memory, encrypting it and protecting its integrity in
the process.

Implement a page reclaimer by using these functions. Picks victim pages in
LRU fashion from all the enclaves running in the system.  A new kernel
thread (ksgxswapd) reclaims pages in the background based on watermarks,
similar to normal kswapd.

All enclave pages can be reclaimed, architecturally.  But, there are some
limits on this, such as the special SECS metadata page which must be
reclaimed last.  The page version array (used to mitigate replaying old
reclaimed pages) is also architecturally reclaimable, but not yet
implemented.  The end result is that the vast majority of enclave pages are
currently reclaimable.

Cc: Hillf Danton <hdanton@sina.com>
Cc: linux-mm@kvack.org
Acked-by: Jethro Beekman <jethro@fortanix.com> # v40
# Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
# Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---
Changes from v39:
* Remove SGX_ENCL_DEAD checks from the page reclaimer. They are no
  longer needed as enclave contents are deleted only when it is released.

 arch/x86/kernel/cpu/sgx/driver.c |  59 ++--
 arch/x86/kernel/cpu/sgx/encl.c   | 483 ++++++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/sgx/encl.h   |  51 ++++
 arch/x86/kernel/cpu/sgx/ioctl.c  |  89 +++++-
 arch/x86/kernel/cpu/sgx/main.c   | 466 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h    |  13 +
 6 files changed, 1134 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index f618a04c4224..f2eac41bb4ff 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -17,13 +17,24 @@ u32 sgx_misc_reserved_mask;
 static int sgx_open(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl;
+	int ret;
 
 	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
 	if (!encl)
 		return -ENOMEM;
 
+	kref_init(&encl->refcount);
 	xa_init(&encl->page_array);
 	mutex_init(&encl->lock);
+	INIT_LIST_HEAD(&encl->va_pages);
+	INIT_LIST_HEAD(&encl->mm_list);
+	spin_lock_init(&encl->mm_lock);
+
+	ret = init_srcu_struct(&encl->srcu);
+	if (ret) {
+		kfree(encl);
+		return ret;
+	}
 
 	file->private_data = encl;
 
@@ -33,31 +44,37 @@ static int sgx_open(struct inode *inode, struct file *file)
 static int sgx_release(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl = file->private_data;
-	struct sgx_encl_page *entry;
-	unsigned long index;
-
-	xa_for_each(&encl->page_array, index, entry) {
-		if (entry->epc_page) {
-			sgx_free_epc_page(entry->epc_page);
-			encl->secs_child_cnt--;
-			entry->epc_page = NULL;
+	struct sgx_encl_mm *encl_mm;
+
+	/*
+	 * Drain the remaining mm_list entries. At this point the list contains
+	 * entries for processes, which have closed the enclave file but have
+	 * not exited yet. The processes, which have exited, are gone from the
+	 * list by sgx_mmu_notifier_release().
+	 */
+	for ( ; ; )  {
+		spin_lock(&encl->mm_lock);
+
+		if (list_empty(&encl->mm_list)) {
+			encl_mm = NULL;
+		} else {
+			encl_mm = list_first_entry(&encl->mm_list,
+						   struct sgx_encl_mm, list);
+			list_del_rcu(&encl_mm->list);
 		}
 
-		kfree(entry);
-	}
+		spin_unlock(&encl->mm_lock);
 
-	xa_destroy(&encl->page_array);
+		/* The enclave is no longer mapped by any mm. */
+		if (!encl_mm)
+			break;
 
-	if (!encl->secs_child_cnt && encl->secs.epc_page) {
-		sgx_free_epc_page(encl->secs.epc_page);
-		encl->secs.epc_page = NULL;
+		synchronize_srcu(&encl->srcu);
+		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+		kfree(encl_mm);
 	}
 
-	/* Detect EPC page leak's. */
-	WARN_ON_ONCE(encl->secs_child_cnt);
-	WARN_ON_ONCE(encl->secs.epc_page);
-
-	kfree(encl);
+	kref_put(&encl->refcount, sgx_encl_release);
 	return 0;
 }
 
@@ -70,6 +87,10 @@ static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;
 
+	ret = sgx_encl_mm_add(encl, vma->vm_mm);
+	if (ret)
+		return ret;
+
 	vma->vm_ops = &sgx_vm_ops;
 	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
 	vma->vm_private_data = encl;
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index e60c504f869f..141424657718 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -12,11 +12,90 @@
 #include "encls.h"
 #include "sgx.h"
 
+/*
+ * ELDU: Load an EPC page as unblocked. For more info, see "OS Management of EPC
+ * Pages" in the SDM.
+ */
+static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
+			   struct sgx_epc_page *epc_page,
+			   struct sgx_epc_page *secs_page)
+{
+	unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK;
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_pageinfo pginfo;
+	struct sgx_backing b;
+	pgoff_t page_index;
+	int ret;
+
+	if (secs_page)
+		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
+	else
+		page_index = PFN_DOWN(encl->size);
+
+	ret = sgx_encl_get_backing(encl, page_index, &b);
+	if (ret)
+		return ret;
+
+	pginfo.addr = encl_page->desc & PAGE_MASK;
+	pginfo.contents = (unsigned long)kmap_atomic(b.contents);
+	pginfo.metadata = (unsigned long)kmap_atomic(b.pcmd) +
+			  b.pcmd_offset;
+
+	if (secs_page)
+		pginfo.secs = (u64)sgx_get_epc_virt_addr(secs_page);
+	else
+		pginfo.secs = 0;
+
+	ret = __eldu(&pginfo, sgx_get_epc_virt_addr(epc_page),
+		     sgx_get_epc_virt_addr(encl_page->va_page->epc_page) + va_offset);
+	if (ret) {
+		if (encls_failed(ret))
+			ENCLS_WARN(ret, "ELDU");
+
+		ret = -EFAULT;
+	}
+
+	kunmap_atomic((void *)(unsigned long)(pginfo.metadata - b.pcmd_offset));
+	kunmap_atomic((void *)(unsigned long)pginfo.contents);
+
+	sgx_encl_put_backing(&b, false);
+
+	return ret;
+}
+
+static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
+					  struct sgx_epc_page *secs_page)
+{
+
+	unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK;
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_epc_page *epc_page;
+	int ret;
+
+	epc_page = sgx_alloc_epc_page(encl_page, false);
+	if (IS_ERR(epc_page))
+		return epc_page;
+
+	ret = __sgx_encl_eldu(encl_page, epc_page, secs_page);
+	if (ret) {
+		sgx_free_epc_page(epc_page);
+		return ERR_PTR(ret);
+	}
+
+	sgx_free_va_slot(encl_page->va_page, va_offset);
+	list_move(&encl_page->va_page->list, &encl->va_pages);
+	encl_page->desc &= ~SGX_ENCL_PAGE_VA_OFFSET_MASK;
+	encl_page->epc_page = epc_page;
+
+	return epc_page;
+}
+
 static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 						unsigned long addr,
 						unsigned long vm_flags)
 {
 	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+	struct sgx_epc_page *epc_page;
 	struct sgx_encl_page *entry;
 
 	entry = xa_load(&encl->page_array, PFN_DOWN(addr));
@@ -31,11 +110,27 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
 		return ERR_PTR(-EFAULT);
 
-	/* No page found. */
-	if (!entry->epc_page)
-		return ERR_PTR(-EFAULT);
-
 	/* Entry successfully located. */
+	if (entry->epc_page) {
+		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+			return ERR_PTR(-EBUSY);
+
+		return entry;
+	}
+
+	if (!(encl->secs.epc_page)) {
+		epc_page = sgx_encl_eldu(&encl->secs, NULL);
+		if (IS_ERR(epc_page))
+			return ERR_CAST(epc_page);
+	}
+
+	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
+	if (IS_ERR(epc_page))
+		return ERR_CAST(epc_page);
+
+	encl->secs_child_cnt++;
+	sgx_mark_page_reclaimable(entry->epc_page);
+
 	return entry;
 }
 
@@ -51,12 +146,23 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 
 	encl = vma->vm_private_data;
 
+	/*
+	 * It's very unlikely but possible that allocating memory for the
+	 * mm_list entry of a forked process failed in sgx_vma_open(). When
+	 * this happens, vm_private_data is set to NULL.
+	 */
+	if (unlikely(!encl))
+		return VM_FAULT_SIGBUS;
+
 	mutex_lock(&encl->lock);
 
 	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
 	if (IS_ERR(entry)) {
 		mutex_unlock(&encl->lock);
 
+		if (PTR_ERR(entry) == -EBUSY)
+			return VM_FAULT_NOPAGE;
+
 		return VM_FAULT_SIGBUS;
 	}
 
@@ -76,11 +182,29 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 	}
 
+	sgx_encl_test_and_clear_young(vma->vm_mm, entry);
 	mutex_unlock(&encl->lock);
 
 	return VM_FAULT_NOPAGE;
 }
 
+static void sgx_vma_open(struct vm_area_struct *vma)
+{
+	struct sgx_encl *encl = vma->vm_private_data;
+
+	/*
+	 * It's possible but unlikely that vm_private_data is NULL. This can
+	 * happen in a grandchild of a process, when sgx_encl_mm_add() had
+	 * failed to allocate memory in this callback.
+	 */
+	if (unlikely(!encl))
+		return;
+
+	if (sgx_encl_mm_add(encl, vma->vm_mm))
+		vma->vm_private_data = NULL;
+}
+
+
 /**
  * sgx_encl_may_map() - Check if a requested VMA mapping is allowed
  * @encl:		an enclave pointer
@@ -151,4 +275,355 @@ static int sgx_vma_mprotect(struct vm_area_struct *vma, unsigned long start,
 const struct vm_operations_struct sgx_vm_ops = {
 	.fault = sgx_vma_fault,
 	.mprotect = sgx_vma_mprotect,
+	.open = sgx_vma_open,
+};
+
+/**
+ * sgx_encl_release - Destroy an enclave instance
+ * @kref:	address of a kref inside &sgx_encl
+ *
+ * Used together with kref_put(). Frees all the resources associated with the
+ * enclave and the instance itself.
+ */
+void sgx_encl_release(struct kref *ref)
+{
+	struct sgx_encl *encl = container_of(ref, struct sgx_encl, refcount);
+	struct sgx_va_page *va_page;
+	struct sgx_encl_page *entry;
+	unsigned long index;
+
+	xa_for_each(&encl->page_array, index, entry) {
+		if (entry->epc_page) {
+			/*
+			 * The page and its radix tree entry cannot be freed
+			 * if the page is being held by the reclaimer.
+			 */
+			if (sgx_unmark_page_reclaimable(entry->epc_page))
+				continue;
+
+			sgx_free_epc_page(entry->epc_page);
+			encl->secs_child_cnt--;
+			entry->epc_page = NULL;
+		}
+
+		kfree(entry);
+	}
+
+	xa_destroy(&encl->page_array);
+
+	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_free_epc_page(encl->secs.epc_page);
+		encl->secs.epc_page = NULL;
+	}
+
+	while (!list_empty(&encl->va_pages)) {
+		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
+					   list);
+		list_del(&va_page->list);
+		sgx_free_epc_page(va_page->epc_page);
+		kfree(va_page);
+	}
+
+	if (encl->backing)
+		fput(encl->backing);
+
+	cleanup_srcu_struct(&encl->srcu);
+
+	WARN_ON_ONCE(!list_empty(&encl->mm_list));
+
+	/* Detect EPC page leak's. */
+	WARN_ON_ONCE(encl->secs_child_cnt);
+	WARN_ON_ONCE(encl->secs.epc_page);
+
+	kfree(encl);
+}
+
+/*
+ * 'mm' is exiting and no longer needs mmu notifications.
+ */
+static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
+				     struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm = container_of(mn, struct sgx_encl_mm, mmu_notifier);
+	struct sgx_encl_mm *tmp = NULL;
+
+	/*
+	 * The enclave itself can remove encl_mm.  Note, objects can't be moved
+	 * off an RCU protected list, but deletion is ok.
+	 */
+	spin_lock(&encl_mm->encl->mm_lock);
+	list_for_each_entry(tmp, &encl_mm->encl->mm_list, list) {
+		if (tmp == encl_mm) {
+			list_del_rcu(&encl_mm->list);
+			break;
+		}
+	}
+	spin_unlock(&encl_mm->encl->mm_lock);
+
+	if (tmp == encl_mm) {
+		synchronize_srcu(&encl_mm->encl->srcu);
+		mmu_notifier_put(mn);
+	}
+}
+
+static void sgx_mmu_notifier_free(struct mmu_notifier *mn)
+{
+	struct sgx_encl_mm *encl_mm = container_of(mn, struct sgx_encl_mm, mmu_notifier);
+
+	kfree(encl_mm);
+}
+
+static const struct mmu_notifier_ops sgx_mmu_notifier_ops = {
+	.release		= sgx_mmu_notifier_release,
+	.free_notifier		= sgx_mmu_notifier_free,
 };
+
+static struct sgx_encl_mm *sgx_encl_find_mm(struct sgx_encl *encl,
+					    struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm = NULL;
+	struct sgx_encl_mm *tmp;
+	int idx;
+
+	idx = srcu_read_lock(&encl->srcu);
+
+	list_for_each_entry_rcu(tmp, &encl->mm_list, list) {
+		if (tmp->mm == mm) {
+			encl_mm = tmp;
+			break;
+		}
+	}
+
+	srcu_read_unlock(&encl->srcu, idx);
+
+	return encl_mm;
+}
+
+int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm;
+	int ret;
+
+	/*
+	 * Even though a single enclave may be mapped into an mm more than once,
+	 * each 'mm' only appears once on encl->mm_list. This is guaranteed by
+	 * holding the mm's mmap lock for write before an mm can be added or
+	 * remove to an encl->mm_list.
+	 */
+	mmap_assert_write_locked(mm);
+
+	/*
+	 * It's possible that an entry already exists in the mm_list, because it
+	 * is removed only on VFS release or process exit.
+	 */
+	if (sgx_encl_find_mm(encl, mm))
+		return 0;
+
+	encl_mm = kzalloc(sizeof(*encl_mm), GFP_KERNEL);
+	if (!encl_mm)
+		return -ENOMEM;
+
+	encl_mm->encl = encl;
+	encl_mm->mm = mm;
+	encl_mm->mmu_notifier.ops = &sgx_mmu_notifier_ops;
+
+	ret = __mmu_notifier_register(&encl_mm->mmu_notifier, mm);
+	if (ret) {
+		kfree(encl_mm);
+		return ret;
+	}
+
+	spin_lock(&encl->mm_lock);
+	list_add_rcu(&encl_mm->list, &encl->mm_list);
+	/* Pairs with smp_rmb() in sgx_reclaimer_block(). */
+	smp_wmb();
+	encl->mm_list_version++;
+	spin_unlock(&encl->mm_lock);
+
+	return 0;
+}
+
+static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl,
+					      pgoff_t index)
+{
+	struct inode *inode = encl->backing->f_path.dentry->d_inode;
+	struct address_space *mapping = inode->i_mapping;
+	gfp_t gfpmask = mapping_gfp_mask(mapping);
+
+	return shmem_read_mapping_page_gfp(mapping, index, gfpmask);
+}
+
+/**
+ * sgx_encl_get_backing() - Pin the backing storage
+ * @encl:	an enclave pointer
+ * @page_index:	enclave page index
+ * @backing:	data for accessing backing storage for the page
+ *
+ * Pin the backing storage pages for storing the encrypted contents and Paging
+ * Crypto MetaData (PCMD) of an enclave page.
+ *
+ * Return:
+ *   0 on success,
+ *   -errno otherwise.
+ */
+int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
+			 struct sgx_backing *backing)
+{
+	pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
+	struct page *contents;
+	struct page *pcmd;
+
+	contents = sgx_encl_get_backing_page(encl, page_index);
+	if (IS_ERR(contents))
+		return PTR_ERR(contents);
+
+	pcmd = sgx_encl_get_backing_page(encl, pcmd_index);
+	if (IS_ERR(pcmd)) {
+		put_page(contents);
+		return PTR_ERR(pcmd);
+	}
+
+	backing->page_index = page_index;
+	backing->contents = contents;
+	backing->pcmd = pcmd;
+	backing->pcmd_offset =
+		(page_index & (PAGE_SIZE / sizeof(struct sgx_pcmd) - 1)) *
+		sizeof(struct sgx_pcmd);
+
+	return 0;
+}
+
+/**
+ * sgx_encl_put_backing() - Unpin the backing storage
+ * @backing:	data for accessing backing storage for the page
+ * @do_write:	mark pages dirty
+ */
+void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write)
+{
+	if (do_write) {
+		set_page_dirty(backing->pcmd);
+		set_page_dirty(backing->contents);
+	}
+
+	put_page(backing->pcmd);
+	put_page(backing->contents);
+}
+
+static int sgx_encl_test_and_clear_young_cb(pte_t *ptep, unsigned long addr,
+					    void *data)
+{
+	pte_t pte;
+	int ret;
+
+	ret = pte_young(*ptep);
+	if (ret) {
+		pte = pte_mkold(*ptep);
+		set_pte_at((struct mm_struct *)data, addr, ptep, pte);
+	}
+
+	return ret;
+}
+
+/**
+ * sgx_encl_test_and_clear_young() - Test and reset the accessed bit
+ * @mm:		mm_struct that is checked
+ * @page:	enclave page to be tested for recent access
+ *
+ * Checks the Access (A) bit from the PTE corresponding to the enclave page and
+ * clears it.
+ *
+ * Return: 1 if the page has been recently accessed and 0 if not.
+ */
+int sgx_encl_test_and_clear_young(struct mm_struct *mm,
+				  struct sgx_encl_page *page)
+{
+	unsigned long addr = page->desc & PAGE_MASK;
+	struct sgx_encl *encl = page->encl;
+	struct vm_area_struct *vma;
+	int ret;
+
+	ret = sgx_encl_find(mm, addr, &vma);
+	if (ret)
+		return 0;
+
+	if (encl != vma->vm_private_data)
+		return 0;
+
+	ret = apply_to_page_range(vma->vm_mm, addr, PAGE_SIZE,
+				  sgx_encl_test_and_clear_young_cb, vma->vm_mm);
+	if (ret < 0)
+		return 0;
+
+	return ret;
+}
+
+/**
+ * sgx_alloc_va_page() - Allocate a Version Array (VA) page
+ *
+ * Allocate a free EPC page and convert it to a Version Array (VA) page.
+ *
+ * Return:
+ *   a VA page,
+ *   -errno otherwise
+ */
+struct sgx_epc_page *sgx_alloc_va_page(void)
+{
+	struct sgx_epc_page *epc_page;
+	int ret;
+
+	epc_page = sgx_alloc_epc_page(NULL, true);
+	if (IS_ERR(epc_page))
+		return ERR_CAST(epc_page);
+
+	ret = __epa(sgx_get_epc_virt_addr(epc_page));
+	if (ret) {
+		WARN_ONCE(1, "EPA returned %d (0x%x)", ret, ret);
+		sgx_free_epc_page(epc_page);
+		return ERR_PTR(-EFAULT);
+	}
+
+	return epc_page;
+}
+
+/**
+ * sgx_alloc_va_slot - allocate a VA slot
+ * @va_page:	a &struct sgx_va_page instance
+ *
+ * Allocates a slot from a &struct sgx_va_page instance.
+ *
+ * Return: offset of the slot inside the VA page
+ */
+unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page)
+{
+	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
+
+	if (slot < SGX_VA_SLOT_COUNT)
+		set_bit(slot, va_page->slots);
+
+	return slot << 3;
+}
+
+/**
+ * sgx_free_va_slot - free a VA slot
+ * @va_page:	a &struct sgx_va_page instance
+ * @offset:	offset of the slot inside the VA page
+ *
+ * Frees a slot from a &struct sgx_va_page instance.
+ */
+void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset)
+{
+	clear_bit(offset >> 3, va_page->slots);
+}
+
+/**
+ * sgx_va_page_full - is the VA page full?
+ * @va_page:	a &struct sgx_va_page instance
+ *
+ * Return: true if all slots have been taken
+ */
+bool sgx_va_page_full(struct sgx_va_page *va_page)
+{
+	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
+
+	return slot == SGX_VA_SLOT_COUNT;
+}
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 34088659cee6..c1982a9ab8df 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -19,11 +19,18 @@
 #include <linux/xarray.h>
 #include "sgx.h"
 
+/* 'desc' bits holding the offset in the VA (version array) page. */
+#define SGX_ENCL_PAGE_VA_OFFSET_MASK	GENMASK_ULL(11, 3)
+
+/* 'desc' bit marking that the page is being reclaimed. */
+#define SGX_ENCL_PAGE_BEING_RECLAIMED	BIT(3)
+
 struct sgx_encl_page {
 	unsigned long desc;
 	unsigned long vm_max_prot_bits;
 	struct sgx_epc_page *epc_page;
 	struct sgx_encl *encl;
+	struct sgx_va_page *va_page;
 };
 
 enum sgx_encl_flags {
@@ -33,6 +40,13 @@ enum sgx_encl_flags {
 	SGX_ENCL_INITIALIZED	= BIT(3),
 };
 
+struct sgx_encl_mm {
+	struct sgx_encl *encl;
+	struct mm_struct *mm;
+	struct list_head list;
+	struct mmu_notifier mmu_notifier;
+};
+
 struct sgx_encl {
 	unsigned long base;
 	unsigned long size;
@@ -44,6 +58,30 @@ struct sgx_encl {
 	struct sgx_encl_page secs;
 	unsigned long attributes;
 	unsigned long attributes_mask;
+
+	cpumask_t cpumask;
+	struct file *backing;
+	struct kref refcount;
+	struct list_head va_pages;
+	unsigned long mm_list_version;
+	struct list_head mm_list;
+	spinlock_t mm_lock;
+	struct srcu_struct srcu;
+};
+
+#define SGX_VA_SLOT_COUNT 512
+
+struct sgx_va_page {
+	struct sgx_epc_page *epc_page;
+	DECLARE_BITMAP(slots, SGX_VA_SLOT_COUNT);
+	struct list_head list;
+};
+
+struct sgx_backing {
+	pgoff_t page_index;
+	struct page *contents;
+	struct page *pcmd;
+	unsigned long pcmd_offset;
 };
 
 extern const struct vm_operations_struct sgx_vm_ops;
@@ -67,4 +105,17 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
 int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 		     unsigned long end, unsigned long vm_flags);
 
+void sgx_encl_release(struct kref *ref);
+int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
+int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
+			 struct sgx_backing *backing);
+void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write);
+int sgx_encl_test_and_clear_young(struct mm_struct *mm,
+				  struct sgx_encl_page *page);
+
+struct sgx_epc_page *sgx_alloc_va_page(void);
+unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
+void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
+bool sgx_va_page_full(struct sgx_va_page *va_page);
+
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 0ba0e670e2f0..6d37117ac8a0 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -16,20 +16,77 @@
 #include "encl.h"
 #include "encls.h"
 
+static struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl)
+{
+	struct sgx_va_page *va_page = NULL;
+	void *err;
+
+	BUILD_BUG_ON(SGX_VA_SLOT_COUNT !=
+		(SGX_ENCL_PAGE_VA_OFFSET_MASK >> 3) + 1);
+
+	if (!(encl->page_cnt % SGX_VA_SLOT_COUNT)) {
+		va_page = kzalloc(sizeof(*va_page), GFP_KERNEL);
+		if (!va_page)
+			return ERR_PTR(-ENOMEM);
+
+		va_page->epc_page = sgx_alloc_va_page();
+		if (IS_ERR(va_page->epc_page)) {
+			err = ERR_CAST(va_page->epc_page);
+			kfree(va_page);
+			return err;
+		}
+
+		WARN_ON_ONCE(encl->page_cnt % SGX_VA_SLOT_COUNT);
+	}
+	encl->page_cnt++;
+	return va_page;
+}
+
+static void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
+{
+	encl->page_cnt--;
+
+	if (va_page) {
+		sgx_free_epc_page(va_page->epc_page);
+		list_del(&va_page->list);
+		kfree(va_page);
+	}
+}
+
 static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 {
 	struct sgx_epc_page *secs_epc;
+	struct sgx_va_page *va_page;
 	struct sgx_pageinfo pginfo;
 	struct sgx_secinfo secinfo;
 	unsigned long encl_size;
+	struct file *backing;
 	long ret;
 
+	va_page = sgx_encl_grow(encl);
+	if (IS_ERR(va_page))
+		return PTR_ERR(va_page);
+	else if (va_page)
+		list_add(&va_page->list, &encl->va_pages);
+	/* else the tail page of the VA page list had free slots. */
+
 	/* The extra page goes to SECS. */
 	encl_size = secs->size + PAGE_SIZE;
 
-	secs_epc = __sgx_alloc_epc_page();
-	if (IS_ERR(secs_epc))
-		return PTR_ERR(secs_epc);
+	backing = shmem_file_setup("SGX backing", encl_size + (encl_size >> 5),
+				   VM_NORESERVE);
+	if (IS_ERR(backing)) {
+		ret = PTR_ERR(backing);
+		goto err_out_shrink;
+	}
+
+	encl->backing = backing;
+
+	secs_epc = sgx_alloc_epc_page(&encl->secs, true);
+	if (IS_ERR(secs_epc)) {
+		ret = PTR_ERR(secs_epc);
+		goto err_out_backing;
+	}
 
 	encl->secs.epc_page = secs_epc;
 
@@ -63,6 +120,13 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	sgx_free_epc_page(encl->secs.epc_page);
 	encl->secs.epc_page = NULL;
 
+err_out_backing:
+	fput(encl->backing);
+	encl->backing = NULL;
+
+err_out_shrink:
+	sgx_encl_shrink(encl, va_page);
+
 	return ret;
 }
 
@@ -228,21 +292,35 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 {
 	struct sgx_encl_page *encl_page;
 	struct sgx_epc_page *epc_page;
+	struct sgx_va_page *va_page;
 	int ret;
 
 	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
 	if (IS_ERR(encl_page))
 		return PTR_ERR(encl_page);
 
-	epc_page = __sgx_alloc_epc_page();
+	epc_page = sgx_alloc_epc_page(encl_page, true);
 	if (IS_ERR(epc_page)) {
 		kfree(encl_page);
 		return PTR_ERR(epc_page);
 	}
 
+	va_page = sgx_encl_grow(encl);
+	if (IS_ERR(va_page)) {
+		ret = PTR_ERR(va_page);
+		goto err_out_free;
+	}
+
 	mmap_read_lock(current->mm);
 	mutex_lock(&encl->lock);
 
+	/*
+	 * Adding to encl->va_pages must be done under encl->lock.  Ditto for
+	 * deleting (via sgx_encl_shrink()) in the error path.
+	 */
+	if (va_page)
+		list_add(&va_page->list, &encl->va_pages);
+
 	/*
 	 * Insert prior to EADD in case of OOM.  EADD modifies MRENCLAVE, i.e.
 	 * can't be gracefully unwound, while failure on EADD/EXTEND is limited
@@ -273,6 +351,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
+	sgx_mark_page_reclaimable(encl_page->epc_page);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -281,9 +360,11 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 	xa_erase(&encl->page_array, PFN_DOWN(encl_page->desc));
 
 err_out_unlock:
+	sgx_encl_shrink(encl, va_page);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 
+err_out_free:
 	sgx_free_epc_page(epc_page);
 	kfree(encl_page);
 
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 38f2e80cc31a..3426785df457 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -16,6 +16,15 @@
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
 static struct task_struct *ksgxd_tsk;
+static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
+
+/*
+ * These variables are part of the state of the reclaimer, and must be accessed
+ * with sgx_reclaimer_lock acquired.
+ */
+static LIST_HEAD(sgx_active_page_list);
+
+static DEFINE_SPINLOCK(sgx_reclaimer_lock);
 
 /*
  * Reset dirty EPC pages to uninitialized state. Laundry can be left with SECS
@@ -50,6 +59,348 @@ static void sgx_sanitize_section(struct sgx_epc_section *section)
 	list_splice(&dirty, &section->laundry_list);
 }
 
+static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl *encl = page->encl;
+	struct sgx_encl_mm *encl_mm;
+	bool ret = true;
+	int idx;
+
+	idx = srcu_read_lock(&encl->srcu);
+
+	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+		if (!mmget_not_zero(encl_mm->mm))
+			continue;
+
+		mmap_read_lock(encl_mm->mm);
+		ret = !sgx_encl_test_and_clear_young(encl_mm->mm, page);
+		mmap_read_unlock(encl_mm->mm);
+
+		mmput_async(encl_mm->mm);
+
+		if (!ret)
+			break;
+	}
+
+	srcu_read_unlock(&encl->srcu, idx);
+
+	if (!ret)
+		return false;
+
+	return true;
+}
+
+static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl_page *page = epc_page->owner;
+	unsigned long addr = page->desc & PAGE_MASK;
+	struct sgx_encl *encl = page->encl;
+	unsigned long mm_list_version;
+	struct sgx_encl_mm *encl_mm;
+	struct vm_area_struct *vma;
+	int idx, ret;
+
+	do {
+		mm_list_version = encl->mm_list_version;
+
+		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
+		smp_rmb();
+
+		idx = srcu_read_lock(&encl->srcu);
+
+		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+			if (!mmget_not_zero(encl_mm->mm))
+				continue;
+
+			mmap_read_lock(encl_mm->mm);
+
+			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
+			if (!ret && encl == vma->vm_private_data)
+				zap_vma_ptes(vma, addr, PAGE_SIZE);
+
+			mmap_read_unlock(encl_mm->mm);
+
+			mmput_async(encl_mm->mm);
+		}
+
+		srcu_read_unlock(&encl->srcu, idx);
+	} while (unlikely(encl->mm_list_version != mm_list_version));
+
+	mutex_lock(&encl->lock);
+
+	ret = __eblock(sgx_get_epc_virt_addr(epc_page));
+	if (encls_failed(ret))
+		ENCLS_WARN(ret, "EBLOCK");
+
+	mutex_unlock(&encl->lock);
+}
+
+static int __sgx_encl_ewb(struct sgx_epc_page *epc_page, void *va_slot,
+			  struct sgx_backing *backing)
+{
+	struct sgx_pageinfo pginfo;
+	int ret;
+
+	pginfo.addr = 0;
+	pginfo.secs = 0;
+
+	pginfo.contents = (unsigned long)kmap_atomic(backing->contents);
+	pginfo.metadata = (unsigned long)kmap_atomic(backing->pcmd) +
+			  backing->pcmd_offset;
+
+	ret = __ewb(&pginfo, sgx_get_epc_virt_addr(epc_page), va_slot);
+
+	kunmap_atomic((void *)(unsigned long)(pginfo.metadata -
+					      backing->pcmd_offset));
+	kunmap_atomic((void *)(unsigned long)pginfo.contents);
+
+	return ret;
+}
+
+static void sgx_ipi_cb(void *info)
+{
+}
+
+static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl)
+{
+	cpumask_t *cpumask = &encl->cpumask;
+	struct sgx_encl_mm *encl_mm;
+	int idx;
+
+	/*
+	 * Can race with sgx_encl_mm_add(), but ETRACK has already been
+	 * executed, which means that the CPUs running in the new mm will enter
+	 * into the enclave with a fresh epoch.
+	 */
+	cpumask_clear(cpumask);
+
+	idx = srcu_read_lock(&encl->srcu);
+
+	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+		if (!mmget_not_zero(encl_mm->mm))
+			continue;
+
+		cpumask_or(cpumask, cpumask, mm_cpumask(encl_mm->mm));
+
+		mmput_async(encl_mm->mm);
+	}
+
+	srcu_read_unlock(&encl->srcu, idx);
+
+	return cpumask;
+}
+
+/*
+ * Swap page to the regular memory transformed to the blocked state by using
+ * EBLOCK, which means that it can no loger be referenced (no new TLB entries).
+ *
+ * The first trial just tries to write the page assuming that some other thread
+ * has reset the count for threads inside the enlave by using ETRACK, and
+ * previous thread count has been zeroed out. The second trial calls ETRACK
+ * before EWB. If that fails we kick all the HW threads out, and then do EWB,
+ * which should be guaranteed the succeed.
+ */
+static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
+			 struct sgx_backing *backing)
+{
+	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_va_page *va_page;
+	unsigned int va_offset;
+	void *va_slot;
+	int ret;
+
+	encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED;
+
+	va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
+				   list);
+	va_offset = sgx_alloc_va_slot(va_page);
+	va_slot = sgx_get_epc_virt_addr(va_page->epc_page) + va_offset;
+	if (sgx_va_page_full(va_page))
+		list_move_tail(&va_page->list, &encl->va_pages);
+
+	ret = __sgx_encl_ewb(epc_page, va_slot, backing);
+	if (ret == SGX_NOT_TRACKED) {
+		ret = __etrack(sgx_get_epc_virt_addr(encl->secs.epc_page));
+		if (ret) {
+			if (encls_failed(ret))
+				ENCLS_WARN(ret, "ETRACK");
+		}
+
+		ret = __sgx_encl_ewb(epc_page, va_slot, backing);
+		if (ret == SGX_NOT_TRACKED) {
+			/*
+			 * Slow path, send IPIs to kick cpus out of the
+			 * enclave.  Note, it's imperative that the cpu
+			 * mask is generated *after* ETRACK, else we'll
+			 * miss cpus that entered the enclave between
+			 * generating the mask and incrementing epoch.
+			 */
+			on_each_cpu_mask(sgx_encl_ewb_cpumask(encl),
+					 sgx_ipi_cb, NULL, 1);
+			ret = __sgx_encl_ewb(epc_page, va_slot, backing);
+		}
+	}
+
+	if (ret) {
+		if (encls_failed(ret))
+			ENCLS_WARN(ret, "EWB");
+
+		sgx_free_va_slot(va_page, va_offset);
+	} else {
+		encl_page->desc |= va_offset;
+		encl_page->va_page = va_page;
+	}
+}
+
+static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
+				struct sgx_backing *backing)
+{
+	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_backing secs_backing;
+	int ret;
+
+	mutex_lock(&encl->lock);
+
+	sgx_encl_ewb(epc_page, backing);
+	encl_page->epc_page = NULL;
+	encl->secs_child_cnt--;
+
+	if (!encl->secs_child_cnt && test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) {
+		ret = sgx_encl_get_backing(encl, PFN_DOWN(encl->size),
+					   &secs_backing);
+		if (ret)
+			goto out;
+
+		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
+
+		sgx_free_epc_page(encl->secs.epc_page);
+		encl->secs.epc_page = NULL;
+
+		sgx_encl_put_backing(&secs_backing, true);
+	}
+
+out:
+	mutex_unlock(&encl->lock);
+}
+
+/*
+ * Take a fixed number of pages from the head of the active page pool and
+ * reclaim them to the enclave's private shmem files. Skip the pages, which have
+ * been accessed since the last scan. Move those pages to the tail of active
+ * page pool so that the pages get scanned in LRU like fashion.
+ *
+ * Batch process a chunk of pages (at the moment 16) in order to degrade amount
+ * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
+ * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI
+ * + EWB) but not sufficiently. Reclaiming one page at a time would also be
+ * problematic as it would increase the lock contention too much, which would
+ * halt forward progress.
+ */
+static void sgx_reclaim_pages(void)
+{
+	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
+	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_epc_section *section;
+	struct sgx_encl_page *encl_page;
+	struct sgx_epc_page *epc_page;
+	pgoff_t page_index;
+	int cnt = 0;
+	int ret;
+	int i;
+
+	spin_lock(&sgx_reclaimer_lock);
+	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
+		if (list_empty(&sgx_active_page_list))
+			break;
+
+		epc_page = list_first_entry(&sgx_active_page_list,
+					    struct sgx_epc_page, list);
+		list_del_init(&epc_page->list);
+		encl_page = epc_page->owner;
+
+		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
+			chunk[cnt++] = epc_page;
+		else
+			/* The owner is freeing the page. No need to add the
+			 * page back to the list of reclaimable pages.
+			 */
+			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+	}
+	spin_unlock(&sgx_reclaimer_lock);
+
+	for (i = 0; i < cnt; i++) {
+		epc_page = chunk[i];
+		encl_page = epc_page->owner;
+
+		if (!sgx_reclaimer_age(epc_page))
+			goto skip;
+
+		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
+		ret = sgx_encl_get_backing(encl_page->encl, page_index, &backing[i]);
+		if (ret)
+			goto skip;
+
+		mutex_lock(&encl_page->encl->lock);
+		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
+		mutex_unlock(&encl_page->encl->lock);
+		continue;
+
+skip:
+		spin_lock(&sgx_reclaimer_lock);
+		list_add_tail(&epc_page->list, &sgx_active_page_list);
+		spin_unlock(&sgx_reclaimer_lock);
+
+		kref_put(&encl_page->encl->refcount, sgx_encl_release);
+
+		chunk[i] = NULL;
+	}
+
+	for (i = 0; i < cnt; i++) {
+		epc_page = chunk[i];
+		if (epc_page)
+			sgx_reclaimer_block(epc_page);
+	}
+
+	for (i = 0; i < cnt; i++) {
+		epc_page = chunk[i];
+		if (!epc_page)
+			continue;
+
+		encl_page = epc_page->owner;
+		sgx_reclaimer_write(epc_page, &backing[i]);
+		sgx_encl_put_backing(&backing[i], true);
+
+		kref_put(&encl_page->encl->refcount, sgx_encl_release);
+		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+
+		section = &sgx_epc_sections[epc_page->section];
+		spin_lock(&section->lock);
+		list_add_tail(&epc_page->list, &section->page_list);
+		section->free_cnt++;
+		spin_unlock(&section->lock);
+	}
+}
+
+static unsigned long sgx_nr_free_pages(void)
+{
+	unsigned long cnt = 0;
+	int i;
+
+	for (i = 0; i < sgx_nr_epc_sections; i++)
+		cnt += sgx_epc_sections[i].free_cnt;
+
+	return cnt;
+}
+
+static bool sgx_should_reclaim(unsigned long watermark)
+{
+	return sgx_nr_free_pages() < watermark &&
+	       !list_empty(&sgx_active_page_list);
+}
+
 static int ksgxd(void *p)
 {
 	int i;
@@ -71,6 +422,20 @@ static int ksgxd(void *p)
 			WARN(1, "EPC section %d has unsanitized pages.\n", i);
 	}
 
+	while (!kthread_should_stop()) {
+		if (try_to_freeze())
+			continue;
+
+		wait_event_freezable(ksgxd_waitq,
+				     kthread_should_stop() ||
+				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
+
+		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
+			sgx_reclaim_pages();
+
+		cond_resched();
+	}
+
 	return 0;
 }
 
@@ -100,6 +465,7 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_sec
 
 	page = list_first_entry(&section->page_list, struct sgx_epc_page, list);
 	list_del_init(&page->list);
+	section->free_cnt--;
 
 	spin_unlock(&section->lock);
 	return page;
@@ -132,6 +498,100 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 	return ERR_PTR(-ENOMEM);
 }
 
+/**
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * @page:	EPC page
+ *
+ * Mark a page as reclaimable and add it to the active page list. Pages
+ * are automatically removed from the active list when freed.
+ */
+void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
+{
+	spin_lock(&sgx_reclaimer_lock);
+	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
+	list_add_tail(&page->list, &sgx_active_page_list);
+	spin_unlock(&sgx_reclaimer_lock);
+}
+
+/**
+ * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * @page:	EPC page
+ *
+ * Clear the reclaimable flag and remove the page from the active page list.
+ *
+ * Return:
+ *   0 on success,
+ *   -EBUSY if the page is in the process of being reclaimed
+ */
+int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
+{
+	spin_lock(&sgx_reclaimer_lock);
+	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
+		/* The page is being reclaimed. */
+		if (list_empty(&page->list)) {
+			spin_unlock(&sgx_reclaimer_lock);
+			return -EBUSY;
+		}
+
+		list_del(&page->list);
+		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+	}
+	spin_unlock(&sgx_reclaimer_lock);
+
+	return 0;
+}
+
+/**
+ * sgx_alloc_epc_page() - Allocate an EPC page
+ * @owner:	the owner of the EPC page
+ * @reclaim:	reclaim pages if necessary
+ *
+ * Iterate through EPC sections and borrow a free EPC page to the caller. When a
+ * page is no longer needed it must be released with sgx_free_epc_page(). If
+ * @reclaim is set to true, directly reclaim pages when we are out of pages. No
+ * mm's can be locked when @reclaim is set to true.
+ *
+ * Finally, wake up ksgxd when the number of pages goes below the watermark
+ * before returning back to the caller.
+ *
+ * Return:
+ *   an EPC page,
+ *   -errno on error
+ */
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
+{
+	struct sgx_epc_page *page;
+
+	for ( ; ; ) {
+		page = __sgx_alloc_epc_page();
+		if (!IS_ERR(page)) {
+			page->owner = owner;
+			break;
+		}
+
+		if (list_empty(&sgx_active_page_list))
+			return ERR_PTR(-ENOMEM);
+
+		if (!reclaim) {
+			page = ERR_PTR(-EBUSY);
+			break;
+		}
+
+		if (signal_pending(current)) {
+			page = ERR_PTR(-ERESTARTSYS);
+			break;
+		}
+
+		sgx_reclaim_pages();
+		cond_resched();
+	}
+
+	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
+		wake_up(&ksgxd_waitq);
+
+	return page;
+}
+
 /**
  * sgx_free_epc_page() - Free an EPC page
  * @page:	an EPC page
@@ -143,12 +603,15 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
 	int ret;
 
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+
 	ret = __eremove(sgx_get_epc_virt_addr(page));
 	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
 		return;
 
 	spin_lock(&section->lock);
 	list_add_tail(&page->list, &section->page_list);
+	section->free_cnt++;
 	spin_unlock(&section->lock);
 }
 
@@ -176,9 +639,12 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 
 	for (i = 0; i < nr_pages; i++) {
 		section->pages[i].section = index;
+		section->pages[i].flags = 0;
+		section->pages[i].owner = NULL;
 		list_add_tail(&section->pages[i].list, &section->laundry_list);
 	}
 
+	section->free_cnt = nr_pages;
 	return true;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 91234f425b89..a188a683ffb6 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -15,9 +15,17 @@
 
 #define SGX_MAX_EPC_SECTIONS		8
 #define SGX_EEXTEND_BLOCK_SIZE		256
+#define SGX_NR_TO_SCAN			16
+#define SGX_NR_LOW_PAGES		32
+#define SGX_NR_HIGH_PAGES		64
+
+/* Pages, which are being tracked by the page reclaimer. */
+#define SGX_EPC_PAGE_RECLAIMER_TRACKED	BIT(0)
 
 struct sgx_epc_page {
 	unsigned int section;
+	unsigned int flags;
+	struct sgx_encl_page *owner;
 	struct list_head list;
 };
 
@@ -33,6 +41,7 @@ struct sgx_epc_section {
 	struct list_head page_list;
 	struct list_head laundry_list;
 	struct sgx_epc_page *pages;
+	unsigned long free_cnt;
 	spinlock_t lock;
 };
 
@@ -61,4 +70,8 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
+void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
+int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+
 #endif /* _X86_SGX_H */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v41 22/24] x86/sgx: Add ptrace() support for the SGX driver
       [not found] <20201112220135.165028-1-jarkko@kernel.org>
                   ` (2 preceding siblings ...)
  2020-11-12 22:01 ` [PATCH v41 21/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
@ 2020-11-12 22:01 ` Jarkko Sakkinen
  3 siblings, 0 replies; 12+ messages in thread
From: Jarkko Sakkinen @ 2020-11-12 22:01 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Jarkko Sakkinen, linux-mm, Andrew Morton,
	Matthew Wilcox, Jethro Beekman, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	npmccallum, puiterwijk, rientjes, sean.j.christopherson, tglx,
	yaozhangx, mikko.ylinen

Enclave memory is normally inaccessible from outside the enclave.  This
makes enclaves hard to debug.  However, enclaves can be put in a debug mode
when they are being built. In debug enclaves data *can* be read and/or
written by using the ENCLS[EDBGRD] and ENCLS[EDBGWR] functions.

This is obviously only for debugging and destroys all the protections
afforded to normal enclaves.  But, enclaves know their own debug status and
can adjust their behavior appropriately.

Add a vm_ops->access() implementation which can be used to read and write
memory inside debug enclaves.  This is typically used via ptrace() APIs.

Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Tested-by: Jethro Beekman <jethro@fortanix.com> # v40
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---
Changes from v39:
* Check only for SGX_ENCL_DEBUG in sgx_vma_access(), so that a debug
  enclave's memory can read and written at any phase of its life-cycle.

 arch/x86/kernel/cpu/sgx/encl.c | 111 +++++++++++++++++++++++++++++++++
 1 file changed, 111 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 141424657718..94e7e41e6adf 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -272,10 +272,121 @@ static int sgx_vma_mprotect(struct vm_area_struct *vma, unsigned long start,
 	return sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
 }
 
+static int sgx_encl_debug_read(struct sgx_encl *encl, struct sgx_encl_page *page,
+			       unsigned long addr, void *data)
+{
+	unsigned long offset = addr & ~PAGE_MASK;
+	int ret;
+
+
+	ret = __edbgrd(sgx_get_epc_virt_addr(page->epc_page) + offset, data);
+	if (ret)
+		return -EIO;
+
+	return 0;
+}
+
+static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *page,
+				unsigned long addr, void *data)
+{
+	unsigned long offset = addr & ~PAGE_MASK;
+	int ret;
+
+	ret = __edbgwr(sgx_get_epc_virt_addr(page->epc_page) + offset, data);
+	if (ret)
+		return -EIO;
+
+	return 0;
+}
+
+/*
+ * Load an enclave page to EPC if required, and take encl->lock.
+ */
+static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
+						   unsigned long addr,
+						   unsigned long vm_flags)
+{
+	struct sgx_encl_page *entry;
+
+	for ( ; ; ) {
+		mutex_lock(&encl->lock);
+
+		entry = sgx_encl_load_page(encl, addr, vm_flags);
+		if (PTR_ERR(entry) != -EBUSY)
+			break;
+
+		mutex_unlock(&encl->lock);
+	}
+
+	if (IS_ERR(entry))
+		mutex_unlock(&encl->lock);
+
+	return entry;
+}
+
+static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
+			  void *buf, int len, int write)
+{
+	struct sgx_encl *encl = vma->vm_private_data;
+	struct sgx_encl_page *entry = NULL;
+	char data[sizeof(unsigned long)];
+	unsigned long align;
+	int offset;
+	int cnt;
+	int ret = 0;
+	int i;
+
+	/*
+	 * If process was forked, VMA is still there but vm_private_data is set
+	 * to NULL.
+	 */
+	if (!encl)
+		return -EFAULT;
+
+	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags))
+		return -EFAULT;
+
+	for (i = 0; i < len; i += cnt) {
+		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
+					      vma->vm_flags);
+		if (IS_ERR(entry)) {
+			ret = PTR_ERR(entry);
+			break;
+		}
+
+		align = ALIGN_DOWN(addr + i, sizeof(unsigned long));
+		offset = (addr + i) & (sizeof(unsigned long) - 1);
+		cnt = sizeof(unsigned long) - offset;
+		cnt = min(cnt, len - i);
+
+		ret = sgx_encl_debug_read(encl, entry, align, data);
+		if (ret)
+			goto out;
+
+		if (write) {
+			memcpy(data + offset, buf + i, cnt);
+			ret = sgx_encl_debug_write(encl, entry, align, data);
+			if (ret)
+				goto out;
+		} else {
+			memcpy(buf + i, data + offset, cnt);
+		}
+
+out:
+		mutex_unlock(&encl->lock);
+
+		if (ret)
+			break;
+	}
+
+	return ret < 0 ? ret : i;
+}
+
 const struct vm_operations_struct sgx_vm_ops = {
 	.fault = sgx_vma_fault,
 	.mprotect = sgx_vma_mprotect,
 	.open = sgx_vma_open,
+	.access = sgx_vma_access,
 };
 
 /**
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-12 22:01 ` [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
@ 2020-11-13 10:25   ` Mel Gorman
  2020-11-17 18:16     ` Jarkko Sakkinen
  2020-11-15 17:08   ` Dr. Greg
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2020-11-13 10:25 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, Sean Christopherson, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Dave Hansen,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, haitao.huang, kai.huang, kai.svahn, kmoy,
	ludloff, luto, nhorman, npmccallum, puiterwijk, rientjes, tglx,
	yaozhangx, mikko.ylinen

On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Background
> ==========
> 
> 1. SGX enclave pages are populated with data by copying from normal memory
>    via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
>    this series.
> 2. It is desirable to be able to restrict those normal memory data sources.
>    For instance, to ensure that the source data is executable before
>    copying data to an executable enclave page.
> 3. Enclave page permissions are dynamic (just like normal permissions) and
>    can be adjusted at runtime with mprotect().
> 
> This creates a problem because the original data source may have long since
> vanished at the time when enclave page permissions are established (mmap()
> or mprotect()).
> 
> The solution (elsewhere in this series) is to force enclaves creators to
> declare their paging permission *intent* up front to the ioctl().  This
> intent can be immediately compared to the source data???s mapping and
> rejected if necessary.
> 
> The ???intent??? is also stashed off for later comparison with enclave
> PTEs. This ensures that any future mmap()/mprotect() operations
> performed by the enclave creator or done on behalf of the enclave
> can be compared with the earlier declared permissions.
> 
> Problem
> =======
> 
> There is an existing mmap() hook which allows SGX to perform this
> permission comparison at mmap() time.  However, there is no corresponding
> ->mprotect() hook.
> 
> Solution
> ========
> 
> Add a vm_ops->mprotect() hook so that mprotect() operations which are
> inconsistent with any page's stashed intent can be rejected by the driver.
> 
> Cc: linux-mm@kvack.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Jethro Beekman <jethro@fortanix.com> # v40
> Acked-by: Dave Hansen <dave.hansen@intel.com> # v40
> # Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-12 22:01 ` [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
  2020-11-13 10:25   ` Mel Gorman
@ 2020-11-15 17:08   ` Dr. Greg
  2020-11-15 17:32   ` Matthew Wilcox
  2020-11-18 17:18   ` [tip: x86/sgx] " tip-bot2 for Sean Christopherson
  3 siblings, 0 replies; 12+ messages in thread
From: Dr. Greg @ 2020-11-15 17:08 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, Sean Christopherson, linux-mm,
	Andrew Morton, Matthew Wilcox, Mel Gorman, Jethro Beekman,
	Dave Hansen, andriy.shevchenko, asapek, bp, cedric.xing,
	chenalexchen, conradparker, cyhanish, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen

On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:

Good morning, I hope the weekend is going well for everyone.

> From: Sean Christopherson <sean.j.christopherson@intel.com>

We wish Sean well in whatever new avocation he has chosen.

> Background
> ==========
> 
> 1. SGX enclave pages are populated with data by copying from normal memory
>    via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
>    this series.
> 2. It is desirable to be able to restrict those normal memory data sources.
>    For instance, to ensure that the source data is executable before
>    copying data to an executable enclave page.
> 3. Enclave page permissions are dynamic (just like normal permissions) and
>    can be adjusted at runtime with mprotect().
>
> This creates a problem because the original data source may have long since
> vanished at the time when enclave page permissions are established (mmap()
> or mprotect()).
> 
> The solution (elsewhere in this series) is to force enclaves creators to
> declare their paging permission *intent* up front to the ioctl().  This
> intent can be immediately compared to the source data???s mapping and
> rejected if necessary.
> 
> The ???intent??? is also stashed off for later comparison with enclave
> PTEs. This ensures that any future mmap()/mprotect() operations
> performed by the enclave creator or done on behalf of the enclave
> can be compared with the earlier declared permissions.

The new mprotect hook in vm_operations_struct is indeed useful, as I
will demonstrate in a subsequent patch for consideration.  However,
the officially stated intent of this version of the driver is to
implement SGX1 semantics even on hardware (SGX2) that implements the
instructions needed for Enclave Dynamic Memory Management (EDMM).

As a result, at this stage of the driver's development, the
implementation that walks the page permission intents can be
substituted with a simple denial of mmap and mprotect on an
initialized enclave.  With this prohibition in place, the hardware
itself will enforce the page permission intents that were encoded when
the enclave was loaded, thus making the subsequent scan irrelevant.

The following patch implements this functionality.

Dr. Greg

---------------------------------------------------------------------------
Subject: [PATCH] Unconditionally block permission changes on enclave memory.

In SGX there are two levels of memory protection, the classic
page table mechanism and SGX hardware based page protections
that are codified in the Enclave Page Cache Metadata.  A
successful memory access requires that both mechanisms agree
that the access is permitted.

In the initial implementation of SGX (SGX1), the page permissions
are immutable after the enclave is initialized.  Even if classic
page protections are modified via mprotect, any attempt to access
enclave memory with alternative permissions will be blocked.

One of the architectural changes implemented in the second
generation of SGX (SGX2) is the ability for page access
permissions to be dynamically manipulated after the enclave is
initialized.  This requires coordination between trusted code
running in the enclave and untrusted code using mprotect and
special ring-0 instructions.

One of the security threats associated with SGX2 hardware is that
enclave based code can conspire with its untrusted runtime to make
executable enclave memory writable.  This provides the opportunity for
completely anonymous code execution that the operating system has no
visibility into.

All that is needed to, simply, close this vulnerability is to
eliminate the ability to call mprotect or mmap against the virtual
memory range of an enclave after it is initialized.  Any
permission changes made prior to initialization that are inconsistent
with the permissions codified in the enclave will cause initialization
or execution of the enclave to fail.

Tested-by: Dr. Greg <greg@enjellic.com>
Signed-off-by: Dr. Greg <greg@enjellic.com>
---
 arch/x86/kernel/cpu/sgx/encl.c | 50 +++++++++-------------------------
 1 file changed, 13 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 5551c7d36483..3bd770fbfc32 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -212,27 +212,25 @@ static void sgx_vma_open(struct vm_area_struct *vma)
  * @end:		upper bound of the address range, exclusive
  * @vm_flags:		VMA flags
  *
- * Iterate through the enclave pages contained within [@start, @end) to verify
- * that the permissions requested by a subset of {VM_READ, VM_WRITE, VM_EXEC}
- * does not contain any permissions that are not contained in the build time
- * permissions of any of the enclave pages within the given address range.
+ * This function provides a method for determining whether or not mmap
+ * or mprotect can be invoked called on any pages in the virtual
+ * address range of an enclave.
  *
- * An enclave creator must declare the strongest permissions that will be
- * needed for each enclave page  This ensures that mappings  have the identical
- * or weaker permissions that the earlier declared permissions.
+ * Since this driver is not designed to support Enclave Dynamic Memory
+ * Management (EDMM), any attempts to modify enclave memory map after
+ * the enclave is initialized are simply blocked.
+ *
+ * The function signature is left intact since future versions of the
+ * driver may implement verifications that the requested permission
+ * changes are consistent with the desire of the enclave author.
  *
  * Return: 0 on success, -EACCES otherwise
  */
 int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 		     unsigned long end, unsigned long vm_flags)
 {
-	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
-	struct sgx_encl_page *page;
-	unsigned long count = 0;
 	int ret = 0;
 
-	XA_STATE(xas, &encl->page_array, PFN_DOWN(start));
-
 	/*
 	 * Disallow READ_IMPLIES_EXEC tasks as their VMA permissions might
 	 * conflict with the enclave page permissions.
@@ -240,31 +238,9 @@ int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 	if (current->personality & READ_IMPLIES_EXEC)
 		return -EACCES;
 
-	mutex_lock(&encl->lock);
-	xas_lock(&xas);
-	xas_for_each(&xas, page, PFN_DOWN(end - 1)) {
-		if (!page)
-			break;
-
-		if (~page->vm_max_prot_bits & vm_prot_bits) {
-			ret = -EACCES;
-			break;
-		}
-
-		/* Reschedule on every XA_CHECK_SCHED iteration. */
-		if (!(++count % XA_CHECK_SCHED)) {
-			xas_pause(&xas);
-			xas_unlock(&xas);
-			mutex_unlock(&encl->lock);
-
-			cond_resched();
-
-			mutex_lock(&encl->lock);
-			xas_lock(&xas);
-		}
-	}
-	xas_unlock(&xas);
-	mutex_unlock(&encl->lock);
+	/* Disallow mapping on an initialized enclave. */
+	if (test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
+		ret = -EACCES;
 
 	return ret;
 }
-- 
2.19.2

---------------------------------------------------------------------------

As always,
Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: dg@enjellic.com
------------------------------------------------------------------------------
"For a successful technology, reality must take precedence over public
 relations, for nature cannot be fooled."
                                -- Richard Feynmann


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-12 22:01 ` [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
  2020-11-13 10:25   ` Mel Gorman
  2020-11-15 17:08   ` Dr. Greg
@ 2020-11-15 17:32   ` Matthew Wilcox
  2020-11-15 18:36     ` Dave Hansen
  2020-11-18 17:18   ` [tip: x86/sgx] " tip-bot2 for Sean Christopherson
  3 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2020-11-15 17:32 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, Sean Christopherson, linux-mm,
	Andrew Morton, Mel Gorman, Jethro Beekman, Dave Hansen,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, haitao.huang, kai.huang, kai.svahn, kmoy,
	ludloff, luto, nhorman, npmccallum, puiterwijk, rientjes, tglx,
	yaozhangx, mikko.ylinen

On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
> +++ b/include/linux/mm.h
> @@ -559,6 +559,13 @@ struct vm_operations_struct {
>  	void (*close)(struct vm_area_struct * area);
>  	int (*split)(struct vm_area_struct * area, unsigned long addr);
>  	int (*mremap)(struct vm_area_struct * area);
> +	/*
> +	 * Called by mprotect() to make driver-specific permission
> +	 * checks before mprotect() is finalised.   The VMA must not
> +	 * be modified.  Returns 0 if eprotect() can proceed.
> +	 */

This is the wrong place for this documentation, and it's absurdly
specific to your implementation.  It should be in
Documentation/filesystems/locking.rst.

> +	int (*mprotect)(struct vm_area_struct *vma, unsigned long start,
> +			unsigned long end, unsigned long newflags);
> +
> +		if (vma->vm_ops && vma->vm_ops->mprotect)
> +			error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
> +		if (error)
> +			goto out;
> +
>  		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
>  		if (error)
>  			goto out;
> +
>  		nstart = tmp;

Spurious newline added.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-15 17:32   ` Matthew Wilcox
@ 2020-11-15 18:36     ` Dave Hansen
  2020-11-16 10:09       ` Mel Gorman
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2020-11-15 18:36 UTC (permalink / raw)
  To: Matthew Wilcox, Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, Sean Christopherson, linux-mm,
	Andrew Morton, Mel Gorman, Jethro Beekman, andriy.shevchenko,
	asapek, bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	npmccallum, puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On 11/15/20 9:32 AM, Matthew Wilcox wrote:
> On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
>> +++ b/include/linux/mm.h
>> @@ -559,6 +559,13 @@ struct vm_operations_struct {
>>  	void (*close)(struct vm_area_struct * area);
>>  	int (*split)(struct vm_area_struct * area, unsigned long addr);
>>  	int (*mremap)(struct vm_area_struct * area);
>> +	/*
>> +	 * Called by mprotect() to make driver-specific permission
>> +	 * checks before mprotect() is finalised.   The VMA must not
>> +	 * be modified.  Returns 0 if eprotect() can proceed.
>> +	 */
> 
> This is the wrong place for this documentation, and it's absurdly
> specific to your implementation.  It should be in
> Documentation/filesystems/locking.rst.

I'll let you and Mel duke that one out:

> https://lore.kernel.org/linux-mm/20201106100409.GD3371@techsingularity.net/

As for 'locking.rst', I didn't even know that was there.  It's also a
_bit_ silly that we would depend on folks finding that for code like SGX
that has nothing to do with filesystems.  If we expect folks to update
that, we need a comment spelling that out in the struct.

The comment also _looks_ reasonably generic to me.  None of the
parameters can be modified, so the impact on the core mm code must only
be from the result of the return code, even if the handler does some
other magic in addition to plain checks.

For one thing, I guess we could take the literal mention of "driver"
out.  vm_operations are certainly supplied by code that isn't a "driver"
per se.

>> +	int (*mprotect)(struct vm_area_struct *vma, unsigned long start,
>> +			unsigned long end, unsigned long newflags);
>> +
>> +		if (vma->vm_ops && vma->vm_ops->mprotect)
>> +			error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
>> +		if (error)
>> +			goto out;
>> +
>>  		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
>>  		if (error)
>>  			goto out;
>> +
>>  		nstart = tmp;
> 
> Spurious newline added.

I don't actually think this is spurious.  It ends up grouping the code
into the logical chunks with the ->mprotect() and goto in one chunk and
mprotect_fixup() and its goto in another.  Without this newline, the two
logical parts don't look separate.

It's kinda hard to see with just the diff context, though.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-15 18:36     ` Dave Hansen
@ 2020-11-16 10:09       ` Mel Gorman
  2020-11-17 19:15         ` Jarkko Sakkinen
  0 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2020-11-16 10:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Matthew Wilcox, Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Jethro Beekman,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, haitao.huang, kai.huang, kai.svahn, kmoy,
	ludloff, luto, nhorman, npmccallum, puiterwijk, rientjes, tglx,
	yaozhangx, mikko.ylinen

On Sun, Nov 15, 2020 at 10:36:51AM -0800, Dave Hansen wrote:
> On 11/15/20 9:32 AM, Matthew Wilcox wrote:
> > On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
> >> +++ b/include/linux/mm.h
> >> @@ -559,6 +559,13 @@ struct vm_operations_struct {
> >>  	void (*close)(struct vm_area_struct * area);
> >>  	int (*split)(struct vm_area_struct * area, unsigned long addr);
> >>  	int (*mremap)(struct vm_area_struct * area);
> >> +	/*
> >> +	 * Called by mprotect() to make driver-specific permission
> >> +	 * checks before mprotect() is finalised.   The VMA must not
> >> +	 * be modified.  Returns 0 if eprotect() can proceed.
> >> +	 */
> > 
> > This is the wrong place for this documentation, and it's absurdly
> > specific to your implementation.  It should be in
> > Documentation/filesystems/locking.rst.
> 
> I'll let you and Mel duke that one out:
> 

I suggested placing the comment there to make it clear what the expected
semantics of the hook was to reduce the chances of abuse or surprises. The
hook does not affect locking so Documentation/filesystems/locking.rst
didn't appear appropriate other than maybe adding a note there
that it doesn't affect locks. The hook also is not expecting any
filesystems-specific action that I aware of but a note could be added to
the effect that filesystems should not need to take special action for it.
Protections on the filesystem level are for the inode, I can't imagine what
a filesystem would do with a protection change on the page table level
but maybe I'm not particularly imaginative today.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-13 10:25   ` Mel Gorman
@ 2020-11-17 18:16     ` Jarkko Sakkinen
  0 siblings, 0 replies; 12+ messages in thread
From: Jarkko Sakkinen @ 2020-11-17 18:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: x86, linux-sgx, linux-kernel, Sean Christopherson, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Dave Hansen,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, haitao.huang, kai.huang, kai.svahn, kmoy,
	ludloff, luto, nhorman, npmccallum, puiterwijk, rientjes, tglx,
	yaozhangx, mikko.ylinen

On Fri, Nov 13, 2020 at 10:25:43AM +0000, Mel Gorman wrote:
> On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Background
> > ==========
> > 
> > 1. SGX enclave pages are populated with data by copying from normal memory
> >    via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
> >    this series.
> > 2. It is desirable to be able to restrict those normal memory data sources.
> >    For instance, to ensure that the source data is executable before
> >    copying data to an executable enclave page.
> > 3. Enclave page permissions are dynamic (just like normal permissions) and
> >    can be adjusted at runtime with mprotect().
> > 
> > This creates a problem because the original data source may have long since
> > vanished at the time when enclave page permissions are established (mmap()
> > or mprotect()).
> > 
> > The solution (elsewhere in this series) is to force enclaves creators to
> > declare their paging permission *intent* up front to the ioctl().  This
> > intent can be immediately compared to the source data???s mapping and
> > rejected if necessary.
> > 
> > The ???intent??? is also stashed off for later comparison with enclave
> > PTEs. This ensures that any future mmap()/mprotect() operations
> > performed by the enclave creator or done on behalf of the enclave
> > can be compared with the earlier declared permissions.
> > 
> > Problem
> > =======
> > 
> > There is an existing mmap() hook which allows SGX to perform this
> > permission comparison at mmap() time.  However, there is no corresponding
> > ->mprotect() hook.
> > 
> > Solution
> > ========
> > 
> > Add a vm_ops->mprotect() hook so that mprotect() operations which are
> > inconsistent with any page's stashed intent can be rejected by the driver.
> > 
> > Cc: linux-mm@kvack.org
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Jethro Beekman <jethro@fortanix.com> # v40
> > Acked-by: Dave Hansen <dave.hansen@intel.com> # v40
> > # Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
> > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
> 
> Acked-by: Mel Gorman <mgorman@techsingularity.net>

Thank you.

/Jarkko


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-16 10:09       ` Mel Gorman
@ 2020-11-17 19:15         ` Jarkko Sakkinen
  0 siblings, 0 replies; 12+ messages in thread
From: Jarkko Sakkinen @ 2020-11-17 19:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Hansen, Matthew Wilcox, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Jethro Beekman,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, haitao.huang, kai.huang, kai.svahn, kmoy,
	ludloff, luto, nhorman, npmccallum, puiterwijk, rientjes, tglx,
	yaozhangx, mikko.ylinen

On Mon, Nov 16, 2020 at 10:09:57AM +0000, Mel Gorman wrote:
> On Sun, Nov 15, 2020 at 10:36:51AM -0800, Dave Hansen wrote:
> > On 11/15/20 9:32 AM, Matthew Wilcox wrote:
> > > On Fri, Nov 13, 2020 at 12:01:21AM +0200, Jarkko Sakkinen wrote:
> > >> +++ b/include/linux/mm.h
> > >> @@ -559,6 +559,13 @@ struct vm_operations_struct {
> > >>  	void (*close)(struct vm_area_struct * area);
> > >>  	int (*split)(struct vm_area_struct * area, unsigned long addr);
> > >>  	int (*mremap)(struct vm_area_struct * area);
> > >> +	/*
> > >> +	 * Called by mprotect() to make driver-specific permission
> > >> +	 * checks before mprotect() is finalised.   The VMA must not
> > >> +	 * be modified.  Returns 0 if eprotect() can proceed.
> > >> +	 */

Wonder if this should also document the negative case for the return
value, i.e. -EACCES is returned otherwise.

> > > 
> > > This is the wrong place for this documentation, and it's absurdly
> > > specific to your implementation.  It should be in
> > > Documentation/filesystems/locking.rst.
> > 
> > I'll let you and Mel duke that one out:
> > 
> 
> I suggested placing the comment there to make it clear what the expected
> semantics of the hook was to reduce the chances of abuse or surprises. The
> hook does not affect locking so Documentation/filesystems/locking.rst
> didn't appear appropriate other than maybe adding a note there
> that it doesn't affect locks. The hook also is not expecting any
> filesystems-specific action that I aware of but a note could be added to
> the effect that filesystems should not need to take special action for it.
> Protections on the filesystem level are for the inode, I can't imagine what
> a filesystem would do with a protection change on the page table level
> but maybe I'm not particularly imaginative today.

I try to decipher this in generic context.

In a permission check of a filesystem, truncated pages should be
encapsulated in to the permission decision. It's a just a query.

So maybe I'll add something like:

"This callback does only a permission query, and thus does never
return locked pages."

> -- 
> Mel Gorman
> SUSE Labs

/Jarkko


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip: x86/sgx] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-12 22:01 ` [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
                     ` (2 preceding siblings ...)
  2020-11-15 17:32   ` Matthew Wilcox
@ 2020-11-18 17:18   ` tip-bot2 for Sean Christopherson
  3 siblings, 0 replies; 12+ messages in thread
From: tip-bot2 for Sean Christopherson @ 2020-11-18 17:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sean Christopherson, Jarkko Sakkinen, Borislav Petkov,
	Jethro Beekman, Dave Hansen, Mel Gorman, Hillf Danton, linux-mm,
	x86, linux-kernel

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID:     95bb7c42ac8a94ce3d0eb059ad64430390351ccb
Gitweb:        https://git.kernel.org/tip/95bb7c42ac8a94ce3d0eb059ad64430390351ccb
Author:        Sean Christopherson <sean.j.christopherson@intel.com>
AuthorDate:    Fri, 13 Nov 2020 00:01:21 +02:00
Committer:     Borislav Petkov <bp@suse.de>
CommitterDate: Tue, 17 Nov 2020 14:36:14 +01:00

mm: Add 'mprotect' hook to struct vm_operations_struct

Background
==========

1. SGX enclave pages are populated with data by copying from normal memory
   via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
   this series.
2. It is desirable to be able to restrict those normal memory data sources.
   For instance, to ensure that the source data is executable before
   copying data to an executable enclave page.
3. Enclave page permissions are dynamic (just like normal permissions) and
   can be adjusted at runtime with mprotect().

This creates a problem because the original data source may have long since
vanished at the time when enclave page permissions are established (mmap()
or mprotect()).

The solution (elsewhere in this series) is to force enclave creators to
declare their paging permission *intent* up front to the ioctl().  This
intent can be immediately compared to the source data’s mapping and
rejected if necessary.

The “intent” is also stashed off for later comparison with enclave
PTEs. This ensures that any future mmap()/mprotect() operations
performed by the enclave creator or done on behalf of the enclave
can be compared with the earlier declared permissions.

Problem
=======

There is an existing mmap() hook which allows SGX to perform this
permission comparison at mmap() time.  However, there is no corresponding
->mprotect() hook.

Solution
========

Add a vm_ops->mprotect() hook so that mprotect() operations which are
inconsistent with any page's stashed intent can be rejected by the driver.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hdanton@sina.com>
Cc: linux-mm@kvack.org
Link: https://lkml.kernel.org/r/20201112220135.165028-11-jarkko@kernel.org
---
 include/linux/mm.h | 7 +++++++
 mm/mprotect.c      | 7 +++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d..1813fa8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -559,6 +559,13 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	int (*split)(struct vm_area_struct * area, unsigned long addr);
 	int (*mremap)(struct vm_area_struct * area);
+	/*
+	 * Called by mprotect() to make driver-specific permission
+	 * checks before mprotect() is finalised.   The VMA must not
+	 * be modified.  Returns 0 if eprotect() can proceed.
+	 */
+	int (*mprotect)(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, unsigned long newflags);
 	vm_fault_t (*fault)(struct vm_fault *vmf);
 	vm_fault_t (*huge_fault)(struct vm_fault *vmf,
 			enum page_entry_size pe_size);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 56c02be..ab70902 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -616,9 +616,16 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		tmp = vma->vm_end;
 		if (tmp > end)
 			tmp = end;
+
+		if (vma->vm_ops && vma->vm_ops->mprotect)
+			error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
+		if (error)
+			goto out;
+
 		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
 		if (error)
 			goto out;
+
 		nstart = tmp;
 
 		if (nstart < prev->vm_end)


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-11-18 17:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201112220135.165028-1-jarkko@kernel.org>
2020-11-12 22:01 ` [PATCH v41 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
2020-11-13 10:25   ` Mel Gorman
2020-11-17 18:16     ` Jarkko Sakkinen
2020-11-15 17:08   ` Dr. Greg
2020-11-15 17:32   ` Matthew Wilcox
2020-11-15 18:36     ` Dave Hansen
2020-11-16 10:09       ` Mel Gorman
2020-11-17 19:15         ` Jarkko Sakkinen
2020-11-18 17:18   ` [tip: x86/sgx] " tip-bot2 for Sean Christopherson
2020-11-12 22:01 ` [PATCH v41 11/24] x86/sgx: Add SGX misc driver interface Jarkko Sakkinen
2020-11-12 22:01 ` [PATCH v41 21/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
2020-11-12 22:01 ` [PATCH v41 22/24] x86/sgx: Add ptrace() support for the SGX driver Jarkko Sakkinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).