linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
       [not found] <20201104145430.300542-1-jarkko.sakkinen@linux.intel.com>
@ 2020-11-04 14:54 ` Jarkko Sakkinen
  2020-11-05 16:04   ` Borislav Petkov
                     ` (2 more replies)
  2020-11-04 14:54 ` [PATCH v40 11/24] x86/sgx: Add SGX misc driver interface Jarkko Sakkinen
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 33+ messages in thread
From: Jarkko Sakkinen @ 2020-11-04 14:54 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Sean Christopherson, linux-mm, Andrew Morton,
	Matthew Wilcox, Jethro Beekman, Darren Kenny, Jarkko Sakkinen,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen

From: Sean Christopherson <sean.j.christopherson@intel.com>

Background
==========

1. SGX enclave pages are populated with data by copying from normal memory
   via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
   this series.
2. It is desirable to be able to restrict those normal memory data sources.
   For instance, to ensure that the source data is executable before
   copying data to an executable enclave page.
3. Enclave page permissions are dynamic (just like normal permissions) and
   can be adjusted at runtime with mprotect().

This creates a problem because the original data source may have long since
vanished at the time when enclave page permissions are established (mmap()
or mprotect()).

The solution (elsewhere in this series) is to force enclaves creators to
declare their paging permission *intent* up front to the ioctl().  This
intent can me immediately compared to the source data’s mapping and
rejected if necessary.

The “intent” is also stashed off for later comparison with enclave
PTEs. This ensures that any future mmap()/mprotect() operations
performed by the enclave creator or done on behalf of the enclave
can be compared with the earlier declared permissions.

Problem
=======

There is an existing mmap() hook which allows SGX to perform this
permission comparison at mmap() time.  However, there is no corresponding
->mprotect() hook.

Solution
========

Add a vm_ops->mprotect() hook so that mprotect() operations which are
inconsistent with any page's stashed intent can be rejected by the driver.

Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
 include/linux/mm.h | 3 +++
 mm/mprotect.c      | 5 ++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..eb38eabc5039 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -559,6 +559,9 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	int (*split)(struct vm_area_struct * area, unsigned long addr);
 	int (*mremap)(struct vm_area_struct * area);
+	int (*mprotect)(struct vm_area_struct *vma,
+			struct vm_area_struct **pprev, unsigned long start,
+			unsigned long end, unsigned long newflags);
 	vm_fault_t (*fault)(struct vm_fault *vmf);
 	vm_fault_t (*huge_fault)(struct vm_fault *vmf,
 			enum page_entry_size pe_size);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 56c02beb6041..1fd4fa71ce16 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -616,7 +616,10 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		tmp = vma->vm_end;
 		if (tmp > end)
 			tmp = end;
-		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
+		if (vma->vm_ops && vma->vm_ops->mprotect)
+			error = vma->vm_ops->mprotect(vma, &prev, nstart, tmp, newflags);
+		else
+			error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
 		if (error)
 			goto out;
 		nstart = tmp;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v40 11/24] x86/sgx: Add SGX misc driver interface
       [not found] <20201104145430.300542-1-jarkko.sakkinen@linux.intel.com>
  2020-11-04 14:54 ` [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
@ 2020-11-04 14:54 ` Jarkko Sakkinen
       [not found]   ` <20201105011043.GA700495@kernel.org>
  2020-11-04 14:54 ` [PATCH v40 21/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
  2020-11-04 14:54 ` [PATCH v40 22/24] x86/sgx: Add ptrace() support for the SGX driver Jarkko Sakkinen
  3 siblings, 1 reply; 33+ messages in thread
From: Jarkko Sakkinen @ 2020-11-04 14:54 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Jarkko Sakkinen, linux-security-module, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Haitao Huang,
	Chunyang Hui, Jordan Hand, Nathaniel McCallum, Seth Moore,
	Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk, rientjes,
	tglx, yaozhangx, mikko.ylinen

Intel(R) SGX is new hardware functionality that can be used by applications
to set aside private regions of code and data called enclaves. New hardware
protects enclave code and data from outside access and modification.

Add a driver that presents a device file and ioctl API to build and manage
enclaves.  Subsequent patches will expend the ioctl()’s functionality.

Cc: linux-security-module@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Haitao Huang <haitao.huang@linux.intel.com>
Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
Tested-by: Seth Moore <sethmo@google.com>
Tested-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
Changes from v39:
* Rename /dev/sgx/enclave as /dev/sgx_enclave.
* In the page fault handler, do not check for SGX_ENCL_DEAD. This allows
  to do forensics to the memory of debug enclaves.

 arch/x86/kernel/cpu/sgx/Makefile |   2 +
 arch/x86/kernel/cpu/sgx/driver.c | 112 ++++++++++++++++++
 arch/x86/kernel/cpu/sgx/driver.h |  16 +++
 arch/x86/kernel/cpu/sgx/encl.c   | 188 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/encl.h   |  46 ++++++++
 arch/x86/kernel/cpu/sgx/main.c   |  12 +-
 6 files changed, 375 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/driver.c
 create mode 100644 arch/x86/kernel/cpu/sgx/driver.h
 create mode 100644 arch/x86/kernel/cpu/sgx/encl.c
 create mode 100644 arch/x86/kernel/cpu/sgx/encl.h

diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 79510ce01b3b..3fc451120735 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -1,2 +1,4 @@
 obj-y += \
+	driver.o \
+	encl.o \
 	main.o
diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
new file mode 100644
index 000000000000..248213dea78e
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+/*  Copyright(c) 2016-20 Intel Corporation. */
+
+#include <linux/acpi.h>
+#include <linux/miscdevice.h>
+#include <linux/mman.h>
+#include <linux/security.h>
+#include <linux/suspend.h>
+#include <asm/traps.h>
+#include "driver.h"
+#include "encl.h"
+
+static int sgx_open(struct inode *inode, struct file *file)
+{
+	struct sgx_encl *encl;
+
+	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
+	if (!encl)
+		return -ENOMEM;
+
+	xa_init(&encl->page_array);
+	mutex_init(&encl->lock);
+
+	file->private_data = encl;
+
+	return 0;
+}
+
+static int sgx_release(struct inode *inode, struct file *file)
+{
+	struct sgx_encl *encl = file->private_data;
+	struct sgx_encl_page *entry;
+	unsigned long index;
+
+	xa_for_each(&encl->page_array, index, entry) {
+		if (entry->epc_page) {
+			sgx_free_epc_page(entry->epc_page);
+			encl->secs_child_cnt--;
+			entry->epc_page = NULL;
+		}
+
+		kfree(entry);
+	}
+
+	xa_destroy(&encl->page_array);
+
+	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_free_epc_page(encl->secs.epc_page);
+		encl->secs.epc_page = NULL;
+	}
+
+	/* Detect EPC page leak's. */
+	WARN_ON_ONCE(encl->secs_child_cnt);
+	WARN_ON_ONCE(encl->secs.epc_page);
+
+	kfree(encl);
+	return 0;
+}
+
+static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct sgx_encl *encl = file->private_data;
+	int ret;
+
+	ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags);
+	if (ret)
+		return ret;
+
+	vma->vm_ops = &sgx_vm_ops;
+	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
+	vma->vm_private_data = encl;
+
+	return 0;
+}
+
+static unsigned long sgx_get_unmapped_area(struct file *file,
+					   unsigned long addr,
+					   unsigned long len,
+					   unsigned long pgoff,
+					   unsigned long flags)
+{
+	if ((flags & MAP_TYPE) == MAP_PRIVATE)
+		return -EINVAL;
+
+	if (flags & MAP_FIXED)
+		return addr;
+
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
+static const struct file_operations sgx_encl_fops = {
+	.owner			= THIS_MODULE,
+	.open			= sgx_open,
+	.release		= sgx_release,
+	.mmap			= sgx_mmap,
+	.get_unmapped_area	= sgx_get_unmapped_area,
+};
+
+static struct miscdevice sgx_dev_enclave = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "sgx_enclave",
+	.nodename = "sgx_enclave",
+	.fops = &sgx_encl_fops,
+};
+
+int __init sgx_drv_init(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SGX_LC))
+		return -ENODEV;
+
+	return misc_register(&sgx_dev_enclave);
+}
diff --git a/arch/x86/kernel/cpu/sgx/driver.h b/arch/x86/kernel/cpu/sgx/driver.h
new file mode 100644
index 000000000000..cda9c43b7543
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/driver.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ARCH_SGX_DRIVER_H__
+#define __ARCH_SGX_DRIVER_H__
+
+#include <crypto/hash.h>
+#include <linux/kref.h>
+#include <linux/mmu_notifier.h>
+#include <linux/radix-tree.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/workqueue.h>
+#include "sgx.h"
+
+int sgx_drv_init(void);
+
+#endif /* __ARCH_X86_SGX_DRIVER_H__ */
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
new file mode 100644
index 000000000000..d47caa106350
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -0,0 +1,188 @@
+// SPDX-License-Identifier: GPL-2.0
+/*  Copyright(c) 2016-20 Intel Corporation. */
+
+#include <linux/lockdep.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/shmem_fs.h>
+#include <linux/suspend.h>
+#include <linux/sched/mm.h>
+#include "arch.h"
+#include "encl.h"
+#include "encls.h"
+#include "sgx.h"
+
+static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
+						unsigned long addr,
+						unsigned long vm_flags)
+{
+	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+	struct sgx_encl_page *entry;
+
+	entry = xa_load(&encl->page_array, PFN_DOWN(addr));
+	if (!entry)
+		return ERR_PTR(-EFAULT);
+
+	/*
+	 * Verify that the faulted page has equal or higher build time
+	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
+	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
+	 */
+	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
+		return ERR_PTR(-EFAULT);
+
+	/* No page found. */
+	if (!entry->epc_page)
+		return ERR_PTR(-EFAULT);
+
+	/* Entry successfully located. */
+	return entry;
+}
+
+static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
+{
+	unsigned long addr = (unsigned long)vmf->address;
+	struct vm_area_struct *vma = vmf->vma;
+	struct sgx_encl_page *entry;
+	unsigned long phys_addr;
+	struct sgx_encl *encl;
+	vm_fault_t ret;
+
+	encl = vma->vm_private_data;
+
+	mutex_lock(&encl->lock);
+
+	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
+	if (IS_ERR(entry)) {
+		mutex_unlock(&encl->lock);
+
+		return VM_FAULT_SIGBUS;
+	}
+
+	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
+
+	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
+	if (ret != VM_FAULT_NOPAGE) {
+		mutex_unlock(&encl->lock);
+
+		return VM_FAULT_SIGBUS;
+	}
+
+	mutex_unlock(&encl->lock);
+
+	return VM_FAULT_NOPAGE;
+}
+
+/**
+ * sgx_encl_may_map() - Check if a requested VMA mapping is allowed
+ * @encl:		an enclave pointer
+ * @start:		lower bound of the address range, inclusive
+ * @end:		upper bound of the address range, exclusive
+ * @vm_flags:		VMA flags
+ *
+ * Iterate through the enclave pages contained within [@start, @end) to verify
+ * that the permissions requested by a subset of {VM_READ, VM_WRITE, VM_EXEC}
+ * does not contain any permissions that are not contained in the build time
+ * permissions of any of the enclave pages within the given address range.
+ *
+ * An enclave creator must declare the strongest permissions that will be
+ * needed for each enclave page  This ensures that mappings  have the identical
+ * or weaker permissions that the earlier declared permissions.
+ *
+ * Return: 0 on success, -EACCES otherwise
+ */
+int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
+		     unsigned long end, unsigned long vm_flags)
+{
+	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+	struct sgx_encl_page *page;
+	unsigned long count = 0;
+	int ret = 0;
+
+	XA_STATE(xas, &encl->page_array, PFN_DOWN(start));
+
+	/*
+	 * Disallow READ_IMPLIES_EXEC tasks as their VMA permissions might
+	 * conflict with the enclave page permissions.
+	 */
+	if (current->personality & READ_IMPLIES_EXEC)
+		return -EACCES;
+
+	mutex_lock(&encl->lock);
+	xas_lock(&xas);
+	xas_for_each(&xas, page, PFN_DOWN(end - 1)) {
+		if (!page)
+			break;
+
+		if (~page->vm_max_prot_bits & vm_prot_bits) {
+			ret = -EACCES;
+			break;
+		}
+
+		/* Reschedule on every XA_CHECK_SCHED iteration. */
+		if (!(++count % XA_CHECK_SCHED)) {
+			xas_pause(&xas);
+			xas_unlock(&xas);
+			mutex_unlock(&encl->lock);
+
+			cond_resched();
+
+			mutex_lock(&encl->lock);
+			xas_lock(&xas);
+		}
+	}
+	xas_unlock(&xas);
+	mutex_unlock(&encl->lock);
+
+	return ret;
+}
+
+static int sgx_vma_mprotect(struct vm_area_struct *vma,
+			    struct vm_area_struct **pprev, unsigned long start,
+			    unsigned long end, unsigned long newflags)
+{
+	int ret;
+
+	ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
+	if (ret)
+		return ret;
+
+	return mprotect_fixup(vma, pprev, start, end, newflags);
+}
+
+const struct vm_operations_struct sgx_vm_ops = {
+	.fault = sgx_vma_fault,
+	.mprotect = sgx_vma_mprotect,
+};
+
+/**
+ * sgx_encl_find - find an enclave
+ * @mm:		mm struct of the current process
+ * @addr:	address in the ELRANGE
+ * @vma:	the resulting VMA
+ *
+ * Find an enclave identified by the given address. Give back a VMA that is
+ * part of the enclave and located in that address. The VMA is given back if it
+ * is a proper enclave VMA even if an &sgx_encl instance does not exist yet
+ * (enclave creation has not been performed).
+ *
+ * Return:
+ *   0 on success,
+ *   -EINVAL if an enclave was not found,
+ *   -ENOENT if the enclave has not been created yet
+ */
+int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
+		  struct vm_area_struct **vma)
+{
+	struct vm_area_struct *result;
+	struct sgx_encl *encl;
+
+	result = find_vma(mm, addr);
+	if (!result || result->vm_ops != &sgx_vm_ops || addr < result->vm_start)
+		return -EINVAL;
+
+	encl = result->vm_private_data;
+	*vma = result;
+
+	return encl ? 0 : -ENOENT;
+}
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
new file mode 100644
index 000000000000..8eb34e95feda
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/**
+ * Copyright(c) 2016-20 Intel Corporation.
+ *
+ * Contains the software defined data structures for enclaves.
+ */
+#ifndef _X86_ENCL_H
+#define _X86_ENCL_H
+
+#include <linux/cpumask.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/srcu.h>
+#include <linux/workqueue.h>
+#include <linux/xarray.h>
+#include "sgx.h"
+
+struct sgx_encl_page {
+	unsigned long desc;
+	unsigned long vm_max_prot_bits;
+	struct sgx_epc_page *epc_page;
+	struct sgx_encl *encl;
+};
+
+struct sgx_encl {
+	unsigned long base;
+	unsigned long size;
+	unsigned int page_cnt;
+	unsigned int secs_child_cnt;
+	struct mutex lock;
+	struct xarray page_array;
+	struct sgx_encl_page secs;
+};
+
+extern const struct vm_operations_struct sgx_vm_ops;
+
+int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
+		  struct vm_area_struct **vma);
+int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
+		     unsigned long end, unsigned long vm_flags);
+
+#endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b9ac438a13a4..c2740e0630d1 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -9,6 +9,8 @@
 #include <linux/sched/mm.h>
 #include <linux/sched/signal.h>
 #include <linux/slab.h>
+#include "driver.h"
+#include "encl.h"
 #include "encls.h"
 
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
@@ -229,9 +231,10 @@ static bool __init sgx_page_cache_init(void)
 
 static void __init sgx_init(void)
 {
+	int ret;
 	int i;
 
-	if (!boot_cpu_has(X86_FEATURE_SGX))
+	if (!cpu_feature_enabled(X86_FEATURE_SGX))
 		return;
 
 	if (!sgx_page_cache_init())
@@ -240,8 +243,15 @@ static void __init sgx_init(void)
 	if (!sgx_page_reclaimer_init())
 		goto err_page_cache;
 
+	ret = sgx_drv_init();
+	if (ret)
+		goto err_kthread;
+
 	return;
 
+err_kthread:
+	kthread_stop(ksgxswapd_tsk);
+
 err_page_cache:
 	for (i = 0; i < sgx_nr_epc_sections; i++) {
 		vfree(sgx_epc_sections[i].pages);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v40 21/24] x86/sgx: Add a page reclaimer
       [not found] <20201104145430.300542-1-jarkko.sakkinen@linux.intel.com>
  2020-11-04 14:54 ` [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
  2020-11-04 14:54 ` [PATCH v40 11/24] x86/sgx: Add SGX misc driver interface Jarkko Sakkinen
@ 2020-11-04 14:54 ` Jarkko Sakkinen
  2020-11-08  3:56   ` Hillf Danton
  2020-11-04 14:54 ` [PATCH v40 22/24] x86/sgx: Add ptrace() support for the SGX driver Jarkko Sakkinen
  3 siblings, 1 reply; 33+ messages in thread
From: Jarkko Sakkinen @ 2020-11-04 14:54 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Jarkko Sakkinen, linux-mm, Jethro Beekman,
	Jordan Hand, Nathaniel McCallum, Chunyang Hui, Seth Moore,
	Sean Christopherson, akpm, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

Just like normal RAM, there is a limited amount of enclave memory available
and overcommitting it is a very valuable tool to reduce resource use.
Introduce a simple reclaim mechanism for enclave pages.

In contrast to normal page reclaim, the kernel cannot directly access
enclave memory.  To get around this, the SGX architecture provides a set of
functions to help.  Among other things, these functions copy enclave memory
to and from normal memory, encrypting it and protecting its integrity in
the process.

Implement a page reclaimer by using these functions. Picks victim pages in
LRU fashion from all the enclaves running in the system.  A new kernel
thread (ksgxswapd) reclaims pages in the background based on watermarks,
similar to normal kswapd.

All enclave pages can be reclaimed, architecturally.  But, there are some
limits on this, such as the special SECS metadata page which must be
reclaimed last.  The page version array (used to mitigate replaying old
reclaimed pages) is also architecturally reclaimable, but not yet
implemented.  The end result is that the vast majority of enclave pages are
currently reclaimable.

Cc: linux-mm@kvack.org
Acked-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
Tested-by: Seth Moore <sethmo@google.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
Changes from v39:
* Remove SGX_ENCL_DEAD checks from the page reclaimer. They are no
  longer needed as enclave contents are deleted only when it is
  released.

 arch/x86/kernel/cpu/sgx/driver.c |  59 ++--
 arch/x86/kernel/cpu/sgx/encl.c   | 483 ++++++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/sgx/encl.h   |  51 ++++
 arch/x86/kernel/cpu/sgx/ioctl.c  |  89 +++++-
 arch/x86/kernel/cpu/sgx/main.c   | 466 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h    |  13 +
 6 files changed, 1134 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index f618a04c4224..f2eac41bb4ff 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -17,13 +17,24 @@ u32 sgx_misc_reserved_mask;
 static int sgx_open(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl;
+	int ret;
 
 	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
 	if (!encl)
 		return -ENOMEM;
 
+	kref_init(&encl->refcount);
 	xa_init(&encl->page_array);
 	mutex_init(&encl->lock);
+	INIT_LIST_HEAD(&encl->va_pages);
+	INIT_LIST_HEAD(&encl->mm_list);
+	spin_lock_init(&encl->mm_lock);
+
+	ret = init_srcu_struct(&encl->srcu);
+	if (ret) {
+		kfree(encl);
+		return ret;
+	}
 
 	file->private_data = encl;
 
@@ -33,31 +44,37 @@ static int sgx_open(struct inode *inode, struct file *file)
 static int sgx_release(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl = file->private_data;
-	struct sgx_encl_page *entry;
-	unsigned long index;
-
-	xa_for_each(&encl->page_array, index, entry) {
-		if (entry->epc_page) {
-			sgx_free_epc_page(entry->epc_page);
-			encl->secs_child_cnt--;
-			entry->epc_page = NULL;
+	struct sgx_encl_mm *encl_mm;
+
+	/*
+	 * Drain the remaining mm_list entries. At this point the list contains
+	 * entries for processes, which have closed the enclave file but have
+	 * not exited yet. The processes, which have exited, are gone from the
+	 * list by sgx_mmu_notifier_release().
+	 */
+	for ( ; ; )  {
+		spin_lock(&encl->mm_lock);
+
+		if (list_empty(&encl->mm_list)) {
+			encl_mm = NULL;
+		} else {
+			encl_mm = list_first_entry(&encl->mm_list,
+						   struct sgx_encl_mm, list);
+			list_del_rcu(&encl_mm->list);
 		}
 
-		kfree(entry);
-	}
+		spin_unlock(&encl->mm_lock);
 
-	xa_destroy(&encl->page_array);
+		/* The enclave is no longer mapped by any mm. */
+		if (!encl_mm)
+			break;
 
-	if (!encl->secs_child_cnt && encl->secs.epc_page) {
-		sgx_free_epc_page(encl->secs.epc_page);
-		encl->secs.epc_page = NULL;
+		synchronize_srcu(&encl->srcu);
+		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+		kfree(encl_mm);
 	}
 
-	/* Detect EPC page leak's. */
-	WARN_ON_ONCE(encl->secs_child_cnt);
-	WARN_ON_ONCE(encl->secs.epc_page);
-
-	kfree(encl);
+	kref_put(&encl->refcount, sgx_encl_release);
 	return 0;
 }
 
@@ -70,6 +87,10 @@ static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;
 
+	ret = sgx_encl_mm_add(encl, vma->vm_mm);
+	if (ret)
+		return ret;
+
 	vma->vm_ops = &sgx_vm_ops;
 	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
 	vma->vm_private_data = encl;
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index dfcc306d297c..328d5f61f1cc 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -12,11 +12,90 @@
 #include "encls.h"
 #include "sgx.h"
 
+/*
+ * ELDU: Load an EPC page as unblocked. For more info, see "OS Management of EPC
+ * Pages" in the SDM.
+ */
+static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
+			   struct sgx_epc_page *epc_page,
+			   struct sgx_epc_page *secs_page)
+{
+	unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK;
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_pageinfo pginfo;
+	struct sgx_backing b;
+	pgoff_t page_index;
+	int ret;
+
+	if (secs_page)
+		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
+	else
+		page_index = PFN_DOWN(encl->size);
+
+	ret = sgx_encl_get_backing(encl, page_index, &b);
+	if (ret)
+		return ret;
+
+	pginfo.addr = encl_page->desc & PAGE_MASK;
+	pginfo.contents = (unsigned long)kmap_atomic(b.contents);
+	pginfo.metadata = (unsigned long)kmap_atomic(b.pcmd) +
+			  b.pcmd_offset;
+
+	if (secs_page)
+		pginfo.secs = (u64)sgx_get_epc_virt_addr(secs_page);
+	else
+		pginfo.secs = 0;
+
+	ret = __eldu(&pginfo, sgx_get_epc_virt_addr(epc_page),
+		     sgx_get_epc_virt_addr(encl_page->va_page->epc_page) + va_offset);
+	if (ret) {
+		if (encls_failed(ret))
+			ENCLS_WARN(ret, "ELDU");
+
+		ret = -EFAULT;
+	}
+
+	kunmap_atomic((void *)(unsigned long)(pginfo.metadata - b.pcmd_offset));
+	kunmap_atomic((void *)(unsigned long)pginfo.contents);
+
+	sgx_encl_put_backing(&b, false);
+
+	return ret;
+}
+
+static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
+					  struct sgx_epc_page *secs_page)
+{
+
+	unsigned long va_offset = encl_page->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK;
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_epc_page *epc_page;
+	int ret;
+
+	epc_page = sgx_alloc_epc_page(encl_page, false);
+	if (IS_ERR(epc_page))
+		return epc_page;
+
+	ret = __sgx_encl_eldu(encl_page, epc_page, secs_page);
+	if (ret) {
+		sgx_free_epc_page(epc_page);
+		return ERR_PTR(ret);
+	}
+
+	sgx_free_va_slot(encl_page->va_page, va_offset);
+	list_move(&encl_page->va_page->list, &encl->va_pages);
+	encl_page->desc &= ~SGX_ENCL_PAGE_VA_OFFSET_MASK;
+	encl_page->epc_page = epc_page;
+
+	return epc_page;
+}
+
 static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 						unsigned long addr,
 						unsigned long vm_flags)
 {
 	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+	struct sgx_epc_page *epc_page;
 	struct sgx_encl_page *entry;
 
 	entry = xa_load(&encl->page_array, PFN_DOWN(addr));
@@ -31,11 +110,27 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
 		return ERR_PTR(-EFAULT);
 
-	/* No page found. */
-	if (!entry->epc_page)
-		return ERR_PTR(-EFAULT);
-
 	/* Entry successfully located. */
+	if (entry->epc_page) {
+		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+			return ERR_PTR(-EBUSY);
+
+		return entry;
+	}
+
+	if (!(encl->secs.epc_page)) {
+		epc_page = sgx_encl_eldu(&encl->secs, NULL);
+		if (IS_ERR(epc_page))
+			return ERR_CAST(epc_page);
+	}
+
+	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
+	if (IS_ERR(epc_page))
+		return ERR_CAST(epc_page);
+
+	encl->secs_child_cnt++;
+	sgx_mark_page_reclaimable(entry->epc_page);
+
 	return entry;
 }
 
@@ -51,12 +146,23 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 
 	encl = vma->vm_private_data;
 
+	/*
+	 * It's very unlikely but possible that allocating memory for the
+	 * mm_list entry of a forked process failed in sgx_vma_open(). When
+	 * this happens, vm_private_data is set to NULL.
+	 */
+	if (unlikely(!encl))
+		return VM_FAULT_SIGBUS;
+
 	mutex_lock(&encl->lock);
 
 	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
 	if (IS_ERR(entry)) {
 		mutex_unlock(&encl->lock);
 
+		if (PTR_ERR(entry) == -EBUSY)
+			return VM_FAULT_NOPAGE;
+
 		return VM_FAULT_SIGBUS;
 	}
 
@@ -76,11 +182,29 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 	}
 
+	sgx_encl_test_and_clear_young(vma->vm_mm, entry);
 	mutex_unlock(&encl->lock);
 
 	return VM_FAULT_NOPAGE;
 }
 
+static void sgx_vma_open(struct vm_area_struct *vma)
+{
+	struct sgx_encl *encl = vma->vm_private_data;
+
+	/*
+	 * It's possible but unlikely that vm_private_data is NULL. This can
+	 * happen in a grandchild of a process, when sgx_encl_mm_add() had
+	 * failed to allocate memory in this callback.
+	 */
+	if (unlikely(!encl))
+		return;
+
+	if (sgx_encl_mm_add(encl, vma->vm_mm))
+		vma->vm_private_data = NULL;
+}
+
+
 /**
  * sgx_encl_may_map() - Check if a requested VMA mapping is allowed
  * @encl:		an enclave pointer
@@ -161,6 +285,7 @@ static int sgx_vma_mprotect(struct vm_area_struct *vma,
 const struct vm_operations_struct sgx_vm_ops = {
 	.fault = sgx_vma_fault,
 	.mprotect = sgx_vma_mprotect,
+	.open = sgx_vma_open,
 };
 
 /**
@@ -194,3 +319,353 @@ int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
 
 	return encl ? 0 : -ENOENT;
 }
+
+/**
+ * sgx_encl_release - Destroy an enclave instance
+ * @kref:	address of a kref inside &sgx_encl
+ *
+ * Used together with kref_put(). Frees all the resources associated with the
+ * enclave and the instance itself.
+ */
+void sgx_encl_release(struct kref *ref)
+{
+	struct sgx_encl *encl = container_of(ref, struct sgx_encl, refcount);
+	struct sgx_va_page *va_page;
+	struct sgx_encl_page *entry;
+	unsigned long index;
+
+	xa_for_each(&encl->page_array, index, entry) {
+		if (entry->epc_page) {
+			/*
+			 * The page and its radix tree entry cannot be freed
+			 * if the page is being held by the reclaimer.
+			 */
+			if (sgx_unmark_page_reclaimable(entry->epc_page))
+				continue;
+
+			sgx_free_epc_page(entry->epc_page);
+			encl->secs_child_cnt--;
+			entry->epc_page = NULL;
+		}
+
+		kfree(entry);
+	}
+
+	xa_destroy(&encl->page_array);
+
+	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_free_epc_page(encl->secs.epc_page);
+		encl->secs.epc_page = NULL;
+	}
+
+	while (!list_empty(&encl->va_pages)) {
+		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
+					   list);
+		list_del(&va_page->list);
+		sgx_free_epc_page(va_page->epc_page);
+		kfree(va_page);
+	}
+
+	if (encl->backing)
+		fput(encl->backing);
+
+	cleanup_srcu_struct(&encl->srcu);
+
+	WARN_ON_ONCE(!list_empty(&encl->mm_list));
+
+	/* Detect EPC page leak's. */
+	WARN_ON_ONCE(encl->secs_child_cnt);
+	WARN_ON_ONCE(encl->secs.epc_page);
+
+	kfree(encl);
+}
+
+/*
+ * 'mm' is exiting and no longer needs mmu notifications.
+ */
+static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
+				     struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm = container_of(mn, struct sgx_encl_mm, mmu_notifier);
+	struct sgx_encl_mm *tmp = NULL;
+
+	/*
+	 * The enclave itself can remove encl_mm.  Note, objects can't be moved
+	 * off an RCU protected list, but deletion is ok.
+	 */
+	spin_lock(&encl_mm->encl->mm_lock);
+	list_for_each_entry(tmp, &encl_mm->encl->mm_list, list) {
+		if (tmp == encl_mm) {
+			list_del_rcu(&encl_mm->list);
+			break;
+		}
+	}
+	spin_unlock(&encl_mm->encl->mm_lock);
+
+	if (tmp == encl_mm) {
+		synchronize_srcu(&encl_mm->encl->srcu);
+		mmu_notifier_put(mn);
+	}
+}
+
+static void sgx_mmu_notifier_free(struct mmu_notifier *mn)
+{
+	struct sgx_encl_mm *encl_mm = container_of(mn, struct sgx_encl_mm, mmu_notifier);
+
+	kfree(encl_mm);
+}
+
+static const struct mmu_notifier_ops sgx_mmu_notifier_ops = {
+	.release		= sgx_mmu_notifier_release,
+	.free_notifier		= sgx_mmu_notifier_free,
+};
+
+static struct sgx_encl_mm *sgx_encl_find_mm(struct sgx_encl *encl,
+					    struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm = NULL;
+	struct sgx_encl_mm *tmp;
+	int idx;
+
+	idx = srcu_read_lock(&encl->srcu);
+
+	list_for_each_entry_rcu(tmp, &encl->mm_list, list) {
+		if (tmp->mm == mm) {
+			encl_mm = tmp;
+			break;
+		}
+	}
+
+	srcu_read_unlock(&encl->srcu, idx);
+
+	return encl_mm;
+}
+
+int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm;
+	int ret;
+
+	/*
+	 * Even though a single enclave may be mapped into an mm more than once,
+	 * each 'mm' only appears once on encl->mm_list. This is guaranteed by
+	 * holding the mm's mmap lock for write before an mm can be added or
+	 * remove to an encl->mm_list.
+	 */
+	mmap_assert_write_locked(mm);
+
+	/*
+	 * It's possible that an entry already exists in the mm_list, because it
+	 * is removed only on VFS release or process exit.
+	 */
+	if (sgx_encl_find_mm(encl, mm))
+		return 0;
+
+	encl_mm = kzalloc(sizeof(*encl_mm), GFP_KERNEL);
+	if (!encl_mm)
+		return -ENOMEM;
+
+	encl_mm->encl = encl;
+	encl_mm->mm = mm;
+	encl_mm->mmu_notifier.ops = &sgx_mmu_notifier_ops;
+
+	ret = __mmu_notifier_register(&encl_mm->mmu_notifier, mm);
+	if (ret) {
+		kfree(encl_mm);
+		return ret;
+	}
+
+	spin_lock(&encl->mm_lock);
+	list_add_rcu(&encl_mm->list, &encl->mm_list);
+	/* Pairs with smp_rmb() in sgx_reclaimer_block(). */
+	smp_wmb();
+	encl->mm_list_version++;
+	spin_unlock(&encl->mm_lock);
+
+	return 0;
+}
+
+static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl,
+					      pgoff_t index)
+{
+	struct inode *inode = encl->backing->f_path.dentry->d_inode;
+	struct address_space *mapping = inode->i_mapping;
+	gfp_t gfpmask = mapping_gfp_mask(mapping);
+
+	return shmem_read_mapping_page_gfp(mapping, index, gfpmask);
+}
+
+/**
+ * sgx_encl_get_backing() - Pin the backing storage
+ * @encl:	an enclave pointer
+ * @page_index:	enclave page index
+ * @backing:	data for accessing backing storage for the page
+ *
+ * Pin the backing storage pages for storing the encrypted contents and Paging
+ * Crypto MetaData (PCMD) of an enclave page.
+ *
+ * Return:
+ *   0 on success,
+ *   -errno otherwise.
+ */
+int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
+			 struct sgx_backing *backing)
+{
+	pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
+	struct page *contents;
+	struct page *pcmd;
+
+	contents = sgx_encl_get_backing_page(encl, page_index);
+	if (IS_ERR(contents))
+		return PTR_ERR(contents);
+
+	pcmd = sgx_encl_get_backing_page(encl, pcmd_index);
+	if (IS_ERR(pcmd)) {
+		put_page(contents);
+		return PTR_ERR(pcmd);
+	}
+
+	backing->page_index = page_index;
+	backing->contents = contents;
+	backing->pcmd = pcmd;
+	backing->pcmd_offset =
+		(page_index & (PAGE_SIZE / sizeof(struct sgx_pcmd) - 1)) *
+		sizeof(struct sgx_pcmd);
+
+	return 0;
+}
+
+/**
+ * sgx_encl_put_backing() - Unpin the backing storage
+ * @backing:	data for accessing backing storage for the page
+ * @do_write:	mark pages dirty
+ */
+void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write)
+{
+	if (do_write) {
+		set_page_dirty(backing->pcmd);
+		set_page_dirty(backing->contents);
+	}
+
+	put_page(backing->pcmd);
+	put_page(backing->contents);
+}
+
+static int sgx_encl_test_and_clear_young_cb(pte_t *ptep, unsigned long addr,
+					    void *data)
+{
+	pte_t pte;
+	int ret;
+
+	ret = pte_young(*ptep);
+	if (ret) {
+		pte = pte_mkold(*ptep);
+		set_pte_at((struct mm_struct *)data, addr, ptep, pte);
+	}
+
+	return ret;
+}
+
+/**
+ * sgx_encl_test_and_clear_young() - Test and reset the accessed bit
+ * @mm:		mm_struct that is checked
+ * @page:	enclave page to be tested for recent access
+ *
+ * Checks the Access (A) bit from the PTE corresponding to the enclave page and
+ * clears it.
+ *
+ * Return: 1 if the page has been recently accessed and 0 if not.
+ */
+int sgx_encl_test_and_clear_young(struct mm_struct *mm,
+				  struct sgx_encl_page *page)
+{
+	unsigned long addr = page->desc & PAGE_MASK;
+	struct sgx_encl *encl = page->encl;
+	struct vm_area_struct *vma;
+	int ret;
+
+	ret = sgx_encl_find(mm, addr, &vma);
+	if (ret)
+		return 0;
+
+	if (encl != vma->vm_private_data)
+		return 0;
+
+	ret = apply_to_page_range(vma->vm_mm, addr, PAGE_SIZE,
+				  sgx_encl_test_and_clear_young_cb, vma->vm_mm);
+	if (ret < 0)
+		return 0;
+
+	return ret;
+}
+
+/**
+ * sgx_alloc_va_page() - Allocate a Version Array (VA) page
+ *
+ * Allocate a free EPC page and convert it to a Version Array (VA) page.
+ *
+ * Return:
+ *   a VA page,
+ *   -errno otherwise
+ */
+struct sgx_epc_page *sgx_alloc_va_page(void)
+{
+	struct sgx_epc_page *epc_page;
+	int ret;
+
+	epc_page = sgx_alloc_epc_page(NULL, true);
+	if (IS_ERR(epc_page))
+		return ERR_CAST(epc_page);
+
+	ret = __epa(sgx_get_epc_virt_addr(epc_page));
+	if (ret) {
+		WARN_ONCE(1, "EPA returned %d (0x%x)", ret, ret);
+		sgx_free_epc_page(epc_page);
+		return ERR_PTR(-EFAULT);
+	}
+
+	return epc_page;
+}
+
+/**
+ * sgx_alloc_va_slot - allocate a VA slot
+ * @va_page:	a &struct sgx_va_page instance
+ *
+ * Allocates a slot from a &struct sgx_va_page instance.
+ *
+ * Return: offset of the slot inside the VA page
+ */
+unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page)
+{
+	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
+
+	if (slot < SGX_VA_SLOT_COUNT)
+		set_bit(slot, va_page->slots);
+
+	return slot << 3;
+}
+
+/**
+ * sgx_free_va_slot - free a VA slot
+ * @va_page:	a &struct sgx_va_page instance
+ * @offset:	offset of the slot inside the VA page
+ *
+ * Frees a slot from a &struct sgx_va_page instance.
+ */
+void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset)
+{
+	clear_bit(offset >> 3, va_page->slots);
+}
+
+/**
+ * sgx_va_page_full - is the VA page full?
+ * @va_page:	a &struct sgx_va_page instance
+ *
+ * Return: true if all slots have been taken
+ */
+bool sgx_va_page_full(struct sgx_va_page *va_page)
+{
+	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
+
+	return slot == SGX_VA_SLOT_COUNT;
+}
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index cb7495854095..244e1d93fce2 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -19,11 +19,18 @@
 #include <linux/xarray.h>
 #include "sgx.h"
 
+/* 'desc' bits holding the offset in the VA (version array) page. */
+#define SGX_ENCL_PAGE_VA_OFFSET_MASK	GENMASK_ULL(11, 3)
+
+/* 'desc' bit marking that the page is being reclaimed. */
+#define SGX_ENCL_PAGE_BEING_RECLAIMED	BIT(3)
+
 struct sgx_encl_page {
 	unsigned long desc;
 	unsigned long vm_max_prot_bits;
 	struct sgx_epc_page *epc_page;
 	struct sgx_encl *encl;
+	struct sgx_va_page *va_page;
 };
 
 enum sgx_encl_flags {
@@ -33,6 +40,13 @@ enum sgx_encl_flags {
 	SGX_ENCL_INITIALIZED	= BIT(3),
 };
 
+struct sgx_encl_mm {
+	struct sgx_encl *encl;
+	struct mm_struct *mm;
+	struct list_head list;
+	struct mmu_notifier mmu_notifier;
+};
+
 struct sgx_encl {
 	unsigned long base;
 	unsigned long size;
@@ -44,6 +58,30 @@ struct sgx_encl {
 	struct sgx_encl_page secs;
 	unsigned long attributes;
 	unsigned long attributes_mask;
+
+	cpumask_t cpumask;
+	struct file *backing;
+	struct kref refcount;
+	struct list_head va_pages;
+	unsigned long mm_list_version;
+	struct list_head mm_list;
+	spinlock_t mm_lock;
+	struct srcu_struct srcu;
+};
+
+#define SGX_VA_SLOT_COUNT 512
+
+struct sgx_va_page {
+	struct sgx_epc_page *epc_page;
+	DECLARE_BITMAP(slots, SGX_VA_SLOT_COUNT);
+	struct list_head list;
+};
+
+struct sgx_backing {
+	pgoff_t page_index;
+	struct page *contents;
+	struct page *pcmd;
+	unsigned long pcmd_offset;
 };
 
 extern const struct vm_operations_struct sgx_vm_ops;
@@ -53,4 +91,17 @@ int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
 int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 		     unsigned long end, unsigned long vm_flags);
 
+void sgx_encl_release(struct kref *ref);
+int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
+int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
+			 struct sgx_backing *backing);
+void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write);
+int sgx_encl_test_and_clear_young(struct mm_struct *mm,
+				  struct sgx_encl_page *page);
+
+struct sgx_epc_page *sgx_alloc_va_page(void);
+unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
+void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
+bool sgx_va_page_full(struct sgx_va_page *va_page);
+
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 0ba0e670e2f0..6d37117ac8a0 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -16,20 +16,77 @@
 #include "encl.h"
 #include "encls.h"
 
+static struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl)
+{
+	struct sgx_va_page *va_page = NULL;
+	void *err;
+
+	BUILD_BUG_ON(SGX_VA_SLOT_COUNT !=
+		(SGX_ENCL_PAGE_VA_OFFSET_MASK >> 3) + 1);
+
+	if (!(encl->page_cnt % SGX_VA_SLOT_COUNT)) {
+		va_page = kzalloc(sizeof(*va_page), GFP_KERNEL);
+		if (!va_page)
+			return ERR_PTR(-ENOMEM);
+
+		va_page->epc_page = sgx_alloc_va_page();
+		if (IS_ERR(va_page->epc_page)) {
+			err = ERR_CAST(va_page->epc_page);
+			kfree(va_page);
+			return err;
+		}
+
+		WARN_ON_ONCE(encl->page_cnt % SGX_VA_SLOT_COUNT);
+	}
+	encl->page_cnt++;
+	return va_page;
+}
+
+static void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
+{
+	encl->page_cnt--;
+
+	if (va_page) {
+		sgx_free_epc_page(va_page->epc_page);
+		list_del(&va_page->list);
+		kfree(va_page);
+	}
+}
+
 static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 {
 	struct sgx_epc_page *secs_epc;
+	struct sgx_va_page *va_page;
 	struct sgx_pageinfo pginfo;
 	struct sgx_secinfo secinfo;
 	unsigned long encl_size;
+	struct file *backing;
 	long ret;
 
+	va_page = sgx_encl_grow(encl);
+	if (IS_ERR(va_page))
+		return PTR_ERR(va_page);
+	else if (va_page)
+		list_add(&va_page->list, &encl->va_pages);
+	/* else the tail page of the VA page list had free slots. */
+
 	/* The extra page goes to SECS. */
 	encl_size = secs->size + PAGE_SIZE;
 
-	secs_epc = __sgx_alloc_epc_page();
-	if (IS_ERR(secs_epc))
-		return PTR_ERR(secs_epc);
+	backing = shmem_file_setup("SGX backing", encl_size + (encl_size >> 5),
+				   VM_NORESERVE);
+	if (IS_ERR(backing)) {
+		ret = PTR_ERR(backing);
+		goto err_out_shrink;
+	}
+
+	encl->backing = backing;
+
+	secs_epc = sgx_alloc_epc_page(&encl->secs, true);
+	if (IS_ERR(secs_epc)) {
+		ret = PTR_ERR(secs_epc);
+		goto err_out_backing;
+	}
 
 	encl->secs.epc_page = secs_epc;
 
@@ -63,6 +120,13 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	sgx_free_epc_page(encl->secs.epc_page);
 	encl->secs.epc_page = NULL;
 
+err_out_backing:
+	fput(encl->backing);
+	encl->backing = NULL;
+
+err_out_shrink:
+	sgx_encl_shrink(encl, va_page);
+
 	return ret;
 }
 
@@ -228,21 +292,35 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 {
 	struct sgx_encl_page *encl_page;
 	struct sgx_epc_page *epc_page;
+	struct sgx_va_page *va_page;
 	int ret;
 
 	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
 	if (IS_ERR(encl_page))
 		return PTR_ERR(encl_page);
 
-	epc_page = __sgx_alloc_epc_page();
+	epc_page = sgx_alloc_epc_page(encl_page, true);
 	if (IS_ERR(epc_page)) {
 		kfree(encl_page);
 		return PTR_ERR(epc_page);
 	}
 
+	va_page = sgx_encl_grow(encl);
+	if (IS_ERR(va_page)) {
+		ret = PTR_ERR(va_page);
+		goto err_out_free;
+	}
+
 	mmap_read_lock(current->mm);
 	mutex_lock(&encl->lock);
 
+	/*
+	 * Adding to encl->va_pages must be done under encl->lock.  Ditto for
+	 * deleting (via sgx_encl_shrink()) in the error path.
+	 */
+	if (va_page)
+		list_add(&va_page->list, &encl->va_pages);
+
 	/*
 	 * Insert prior to EADD in case of OOM.  EADD modifies MRENCLAVE, i.e.
 	 * can't be gracefully unwound, while failure on EADD/EXTEND is limited
@@ -273,6 +351,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
+	sgx_mark_page_reclaimable(encl_page->epc_page);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -281,9 +360,11 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 	xa_erase(&encl->page_array, PFN_DOWN(encl_page->desc));
 
 err_out_unlock:
+	sgx_encl_shrink(encl, va_page);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 
+err_out_free:
 	sgx_free_epc_page(epc_page);
 	kfree(encl_page);
 
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c2740e0630d1..1588dad06bab 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -16,6 +16,15 @@
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
 static struct task_struct *ksgxswapd_tsk;
+static DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
+
+/*
+ * These variables are part of the state of the reclaimer, and must be accessed
+ * with sgx_reclaimer_lock acquired.
+ */
+static LIST_HEAD(sgx_active_page_list);
+
+static DEFINE_SPINLOCK(sgx_reclaimer_lock);
 
 /*
  * Reset dirty EPC pages to uninitialized state. Laundry can be left with SECS
@@ -50,6 +59,348 @@ static void sgx_sanitize_section(struct sgx_epc_section *section)
 	list_splice(&dirty, &section->laundry_list);
 }
 
+static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl *encl = page->encl;
+	struct sgx_encl_mm *encl_mm;
+	bool ret = true;
+	int idx;
+
+	idx = srcu_read_lock(&encl->srcu);
+
+	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+		if (!mmget_not_zero(encl_mm->mm))
+			continue;
+
+		mmap_read_lock(encl_mm->mm);
+		ret = !sgx_encl_test_and_clear_young(encl_mm->mm, page);
+		mmap_read_unlock(encl_mm->mm);
+
+		mmput_async(encl_mm->mm);
+
+		if (!ret)
+			break;
+	}
+
+	srcu_read_unlock(&encl->srcu, idx);
+
+	if (!ret)
+		return false;
+
+	return true;
+}
+
+static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl_page *page = epc_page->owner;
+	unsigned long addr = page->desc & PAGE_MASK;
+	struct sgx_encl *encl = page->encl;
+	unsigned long mm_list_version;
+	struct sgx_encl_mm *encl_mm;
+	struct vm_area_struct *vma;
+	int idx, ret;
+
+	do {
+		mm_list_version = encl->mm_list_version;
+
+		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
+		smp_rmb();
+
+		idx = srcu_read_lock(&encl->srcu);
+
+		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+			if (!mmget_not_zero(encl_mm->mm))
+				continue;
+
+			mmap_read_lock(encl_mm->mm);
+
+			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
+			if (!ret && encl == vma->vm_private_data)
+				zap_vma_ptes(vma, addr, PAGE_SIZE);
+
+			mmap_read_unlock(encl_mm->mm);
+
+			mmput_async(encl_mm->mm);
+		}
+
+		srcu_read_unlock(&encl->srcu, idx);
+	} while (unlikely(encl->mm_list_version != mm_list_version));
+
+	mutex_lock(&encl->lock);
+
+	ret = __eblock(sgx_get_epc_virt_addr(epc_page));
+	if (encls_failed(ret))
+		ENCLS_WARN(ret, "EBLOCK");
+
+	mutex_unlock(&encl->lock);
+}
+
+static int __sgx_encl_ewb(struct sgx_epc_page *epc_page, void *va_slot,
+			  struct sgx_backing *backing)
+{
+	struct sgx_pageinfo pginfo;
+	int ret;
+
+	pginfo.addr = 0;
+	pginfo.secs = 0;
+
+	pginfo.contents = (unsigned long)kmap_atomic(backing->contents);
+	pginfo.metadata = (unsigned long)kmap_atomic(backing->pcmd) +
+			  backing->pcmd_offset;
+
+	ret = __ewb(&pginfo, sgx_get_epc_virt_addr(epc_page), va_slot);
+
+	kunmap_atomic((void *)(unsigned long)(pginfo.metadata -
+					      backing->pcmd_offset));
+	kunmap_atomic((void *)(unsigned long)pginfo.contents);
+
+	return ret;
+}
+
+static void sgx_ipi_cb(void *info)
+{
+}
+
+static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl)
+{
+	cpumask_t *cpumask = &encl->cpumask;
+	struct sgx_encl_mm *encl_mm;
+	int idx;
+
+	/*
+	 * Can race with sgx_encl_mm_add(), but ETRACK has already been
+	 * executed, which means that the CPUs running in the new mm will enter
+	 * into the enclave with a fresh epoch.
+	 */
+	cpumask_clear(cpumask);
+
+	idx = srcu_read_lock(&encl->srcu);
+
+	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+		if (!mmget_not_zero(encl_mm->mm))
+			continue;
+
+		cpumask_or(cpumask, cpumask, mm_cpumask(encl_mm->mm));
+
+		mmput_async(encl_mm->mm);
+	}
+
+	srcu_read_unlock(&encl->srcu, idx);
+
+	return cpumask;
+}
+
+/*
+ * Swap page to the regular memory transformed to the blocked state by using
+ * EBLOCK, which means that it can no loger be referenced (no new TLB entries).
+ *
+ * The first trial just tries to write the page assuming that some other thread
+ * has reset the count for threads inside the enlave by using ETRACK, and
+ * previous thread count has been zeroed out. The second trial calls ETRACK
+ * before EWB. If that fails we kick all the HW threads out, and then do EWB,
+ * which should be guaranteed the succeed.
+ */
+static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
+			 struct sgx_backing *backing)
+{
+	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_va_page *va_page;
+	unsigned int va_offset;
+	void *va_slot;
+	int ret;
+
+	encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED;
+
+	va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
+				   list);
+	va_offset = sgx_alloc_va_slot(va_page);
+	va_slot = sgx_get_epc_virt_addr(va_page->epc_page) + va_offset;
+	if (sgx_va_page_full(va_page))
+		list_move_tail(&va_page->list, &encl->va_pages);
+
+	ret = __sgx_encl_ewb(epc_page, va_slot, backing);
+	if (ret == SGX_NOT_TRACKED) {
+		ret = __etrack(sgx_get_epc_virt_addr(encl->secs.epc_page));
+		if (ret) {
+			if (encls_failed(ret))
+				ENCLS_WARN(ret, "ETRACK");
+		}
+
+		ret = __sgx_encl_ewb(epc_page, va_slot, backing);
+		if (ret == SGX_NOT_TRACKED) {
+			/*
+			 * Slow path, send IPIs to kick cpus out of the
+			 * enclave.  Note, it's imperative that the cpu
+			 * mask is generated *after* ETRACK, else we'll
+			 * miss cpus that entered the enclave between
+			 * generating the mask and incrementing epoch.
+			 */
+			on_each_cpu_mask(sgx_encl_ewb_cpumask(encl),
+					 sgx_ipi_cb, NULL, 1);
+			ret = __sgx_encl_ewb(epc_page, va_slot, backing);
+		}
+	}
+
+	if (ret) {
+		if (encls_failed(ret))
+			ENCLS_WARN(ret, "EWB");
+
+		sgx_free_va_slot(va_page, va_offset);
+	} else {
+		encl_page->desc |= va_offset;
+		encl_page->va_page = va_page;
+	}
+}
+
+static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
+				struct sgx_backing *backing)
+{
+	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_backing secs_backing;
+	int ret;
+
+	mutex_lock(&encl->lock);
+
+	sgx_encl_ewb(epc_page, backing);
+	encl_page->epc_page = NULL;
+	encl->secs_child_cnt--;
+
+	if (!encl->secs_child_cnt && test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) {
+		ret = sgx_encl_get_backing(encl, PFN_DOWN(encl->size),
+					   &secs_backing);
+		if (ret)
+			goto out;
+
+		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
+
+		sgx_free_epc_page(encl->secs.epc_page);
+		encl->secs.epc_page = NULL;
+
+		sgx_encl_put_backing(&secs_backing, true);
+	}
+
+out:
+	mutex_unlock(&encl->lock);
+}
+
+/*
+ * Take a fixed number of pages from the head of the active page pool and
+ * reclaim them to the enclave's private shmem files. Skip the pages, which have
+ * been accessed since the last scan. Move those pages to the tail of active
+ * page pool so that the pages get scanned in LRU like fashion.
+ *
+ * Batch process a chunk of pages (at the moment 16) in order to degrade amount
+ * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
+ * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI
+ * + EWB) but not sufficiently. Reclaiming one page at a time would also be
+ * problematic as it would increase the lock contention too much, which would
+ * halt forward progress.
+ */
+static void sgx_reclaim_pages(void)
+{
+	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
+	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_epc_section *section;
+	struct sgx_encl_page *encl_page;
+	struct sgx_epc_page *epc_page;
+	pgoff_t page_index;
+	int cnt = 0;
+	int ret;
+	int i;
+
+	spin_lock(&sgx_reclaimer_lock);
+	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
+		if (list_empty(&sgx_active_page_list))
+			break;
+
+		epc_page = list_first_entry(&sgx_active_page_list,
+					    struct sgx_epc_page, list);
+		list_del_init(&epc_page->list);
+		encl_page = epc_page->owner;
+
+		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
+			chunk[cnt++] = epc_page;
+		else
+			/* The owner is freeing the page. No need to add the
+			 * page back to the list of reclaimable pages.
+			 */
+			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+	}
+	spin_unlock(&sgx_reclaimer_lock);
+
+	for (i = 0; i < cnt; i++) {
+		epc_page = chunk[i];
+		encl_page = epc_page->owner;
+
+		if (!sgx_reclaimer_age(epc_page))
+			goto skip;
+
+		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
+		ret = sgx_encl_get_backing(encl_page->encl, page_index, &backing[i]);
+		if (ret)
+			goto skip;
+
+		mutex_lock(&encl_page->encl->lock);
+		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
+		mutex_unlock(&encl_page->encl->lock);
+		continue;
+
+skip:
+		spin_lock(&sgx_reclaimer_lock);
+		list_add_tail(&epc_page->list, &sgx_active_page_list);
+		spin_unlock(&sgx_reclaimer_lock);
+
+		kref_put(&encl_page->encl->refcount, sgx_encl_release);
+
+		chunk[i] = NULL;
+	}
+
+	for (i = 0; i < cnt; i++) {
+		epc_page = chunk[i];
+		if (epc_page)
+			sgx_reclaimer_block(epc_page);
+	}
+
+	for (i = 0; i < cnt; i++) {
+		epc_page = chunk[i];
+		if (!epc_page)
+			continue;
+
+		encl_page = epc_page->owner;
+		sgx_reclaimer_write(epc_page, &backing[i]);
+		sgx_encl_put_backing(&backing[i], true);
+
+		kref_put(&encl_page->encl->refcount, sgx_encl_release);
+		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+
+		section = &sgx_epc_sections[epc_page->section];
+		spin_lock(&section->lock);
+		list_add_tail(&epc_page->list, &section->page_list);
+		section->free_cnt++;
+		spin_unlock(&section->lock);
+	}
+}
+
+static unsigned long sgx_nr_free_pages(void)
+{
+	unsigned long cnt = 0;
+	int i;
+
+	for (i = 0; i < sgx_nr_epc_sections; i++)
+		cnt += sgx_epc_sections[i].free_cnt;
+
+	return cnt;
+}
+
+static bool sgx_should_reclaim(unsigned long watermark)
+{
+	return sgx_nr_free_pages() < watermark &&
+	       !list_empty(&sgx_active_page_list);
+}
+
 static int ksgxswapd(void *p)
 {
 	int i;
@@ -71,6 +422,20 @@ static int ksgxswapd(void *p)
 			WARN(1, "EPC section %d has unsanitized pages.\n", i);
 	}
 
+	while (!kthread_should_stop()) {
+		if (try_to_freeze())
+			continue;
+
+		wait_event_freezable(ksgxswapd_waitq,
+				     kthread_should_stop() ||
+				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
+
+		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
+			sgx_reclaim_pages();
+
+		cond_resched();
+	}
+
 	return 0;
 }
 
@@ -96,6 +461,7 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_sec
 
 	page = list_first_entry(&section->page_list, struct sgx_epc_page, list);
 	list_del_init(&page->list);
+	section->free_cnt--;
 
 	return page;
 }
@@ -129,6 +495,100 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 	return ERR_PTR(-ENOMEM);
 }
 
+/**
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * @page:	EPC page
+ *
+ * Mark a page as reclaimable and add it to the active page list. Pages
+ * are automatically removed from the active list when freed.
+ */
+void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
+{
+	spin_lock(&sgx_reclaimer_lock);
+	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
+	list_add_tail(&page->list, &sgx_active_page_list);
+	spin_unlock(&sgx_reclaimer_lock);
+}
+
+/**
+ * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * @page:	EPC page
+ *
+ * Clear the reclaimable flag and remove the page from the active page list.
+ *
+ * Return:
+ *   0 on success,
+ *   -EBUSY if the page is in the process of being reclaimed
+ */
+int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
+{
+	spin_lock(&sgx_reclaimer_lock);
+	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
+		/* The page is being reclaimed. */
+		if (list_empty(&page->list)) {
+			spin_unlock(&sgx_reclaimer_lock);
+			return -EBUSY;
+		}
+
+		list_del(&page->list);
+		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+	}
+	spin_unlock(&sgx_reclaimer_lock);
+
+	return 0;
+}
+
+/**
+ * sgx_alloc_epc_page() - Allocate an EPC page
+ * @owner:	the owner of the EPC page
+ * @reclaim:	reclaim pages if necessary
+ *
+ * Iterate through EPC sections and borrow a free EPC page to the caller. When a
+ * page is no longer needed it must be released with sgx_free_epc_page(). If
+ * @reclaim is set to true, directly reclaim pages when we are out of pages. No
+ * mm's can be locked when @reclaim is set to true.
+ *
+ * Finally, wake up ksgxswapd when the number of pages goes below the watermark
+ * before returning back to the caller.
+ *
+ * Return:
+ *   an EPC page,
+ *   -errno on error
+ */
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
+{
+	struct sgx_epc_page *entry;
+
+	for ( ; ; ) {
+		entry = __sgx_alloc_epc_page();
+		if (!IS_ERR(entry)) {
+			entry->owner = owner;
+			break;
+		}
+
+		if (list_empty(&sgx_active_page_list))
+			return ERR_PTR(-ENOMEM);
+
+		if (!reclaim) {
+			entry = ERR_PTR(-EBUSY);
+			break;
+		}
+
+		if (signal_pending(current)) {
+			entry = ERR_PTR(-ERESTARTSYS);
+			break;
+		}
+
+		sgx_reclaim_pages();
+		schedule();
+	}
+
+	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
+		wake_up(&ksgxswapd_waitq);
+
+	return entry;
+}
+
 /**
  * sgx_free_epc_page() - Free an EPC page
  * @page:	an EPC page
@@ -140,12 +600,15 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
 	int ret;
 
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+
 	ret = __eremove(sgx_get_epc_virt_addr(page));
 	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
 		return;
 
 	spin_lock(&section->lock);
 	list_add_tail(&page->list, &section->page_list);
+	section->free_cnt++;
 	spin_unlock(&section->lock);
 }
 
@@ -173,9 +636,12 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 
 	for (i = 0; i < nr_pages; i++) {
 		section->pages[i].section = index;
+		section->pages[i].flags = 0;
+		section->pages[i].owner = NULL;
 		list_add_tail(&section->pages[i].list, &section->laundry_list);
 	}
 
+	section->free_cnt = nr_pages;
 	return true;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 91234f425b89..a188a683ffb6 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -15,9 +15,17 @@
 
 #define SGX_MAX_EPC_SECTIONS		8
 #define SGX_EEXTEND_BLOCK_SIZE		256
+#define SGX_NR_TO_SCAN			16
+#define SGX_NR_LOW_PAGES		32
+#define SGX_NR_HIGH_PAGES		64
+
+/* Pages, which are being tracked by the page reclaimer. */
+#define SGX_EPC_PAGE_RECLAIMER_TRACKED	BIT(0)
 
 struct sgx_epc_page {
 	unsigned int section;
+	unsigned int flags;
+	struct sgx_encl_page *owner;
 	struct list_head list;
 };
 
@@ -33,6 +41,7 @@ struct sgx_epc_section {
 	struct list_head page_list;
 	struct list_head laundry_list;
 	struct sgx_epc_page *pages;
+	unsigned long free_cnt;
 	spinlock_t lock;
 };
 
@@ -61,4 +70,8 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
+void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
+int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+
 #endif /* _X86_SGX_H */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v40 22/24] x86/sgx: Add ptrace() support for the SGX driver
       [not found] <20201104145430.300542-1-jarkko.sakkinen@linux.intel.com>
                   ` (2 preceding siblings ...)
  2020-11-04 14:54 ` [PATCH v40 21/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
@ 2020-11-04 14:54 ` Jarkko Sakkinen
  3 siblings, 0 replies; 33+ messages in thread
From: Jarkko Sakkinen @ 2020-11-04 14:54 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Jarkko Sakkinen, linux-mm, Andrew Morton,
	Matthew Wilcox, Jethro Beekman, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	npmccallum, puiterwijk, rientjes, sean.j.christopherson, tglx,
	yaozhangx, mikko.ylinen

Enclave memory is normally inaccessible from outside the enclave.  This
makes enclaves hard to debug.  However, enclaves can be put in a debug mode
when they are being built. In debug enclaves data *can* be read and/or
written by using the ENCLS[EDBGRD] and ENCLS[EDBGWR] functions.

This is obviously only for debugging and destroys all the protections
afforded to normal enclaves.  But, enclaves know their own debug status and
can adjust their behavior appropriately.

Add a vm_ops->access() implementation which can be used to read and write
memory inside debug enclaves.  This is typically used via ptrace() APIs.

Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
Changes from v39:
* Check only for SGX_ENCL_DEBUG in sgx_vma_access(), so that a debug
  enclave's memory can read and written at any phase of its life-cycle.

 arch/x86/kernel/cpu/sgx/encl.c | 111 +++++++++++++++++++++++++++++++++
 1 file changed, 111 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 328d5f61f1cc..5551c7d36483 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -282,10 +282,121 @@ static int sgx_vma_mprotect(struct vm_area_struct *vma,
 	return mprotect_fixup(vma, pprev, start, end, newflags);
 }
 
+static int sgx_encl_debug_read(struct sgx_encl *encl, struct sgx_encl_page *page,
+			       unsigned long addr, void *data)
+{
+	unsigned long offset = addr & ~PAGE_MASK;
+	int ret;
+
+
+	ret = __edbgrd(sgx_get_epc_virt_addr(page->epc_page) + offset, data);
+	if (ret)
+		return -EIO;
+
+	return 0;
+}
+
+static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *page,
+				unsigned long addr, void *data)
+{
+	unsigned long offset = addr & ~PAGE_MASK;
+	int ret;
+
+	ret = __edbgwr(sgx_get_epc_virt_addr(page->epc_page) + offset, data);
+	if (ret)
+		return -EIO;
+
+	return 0;
+}
+
+/*
+ * Load an enclave page to EPC if required, and take encl->lock.
+ */
+static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
+						   unsigned long addr,
+						   unsigned long vm_flags)
+{
+	struct sgx_encl_page *entry;
+
+	for ( ; ; ) {
+		mutex_lock(&encl->lock);
+
+		entry = sgx_encl_load_page(encl, addr, vm_flags);
+		if (PTR_ERR(entry) != -EBUSY)
+			break;
+
+		mutex_unlock(&encl->lock);
+	}
+
+	if (IS_ERR(entry))
+		mutex_unlock(&encl->lock);
+
+	return entry;
+}
+
+static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
+			  void *buf, int len, int write)
+{
+	struct sgx_encl *encl = vma->vm_private_data;
+	struct sgx_encl_page *entry = NULL;
+	char data[sizeof(unsigned long)];
+	unsigned long align;
+	int offset;
+	int cnt;
+	int ret = 0;
+	int i;
+
+	/*
+	 * If process was forked, VMA is still there but vm_private_data is set
+	 * to NULL.
+	 */
+	if (!encl)
+		return -EFAULT;
+
+	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags))
+		return -EFAULT;
+
+	for (i = 0; i < len; i += cnt) {
+		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
+					      vma->vm_flags);
+		if (IS_ERR(entry)) {
+			ret = PTR_ERR(entry);
+			break;
+		}
+
+		align = ALIGN_DOWN(addr + i, sizeof(unsigned long));
+		offset = (addr + i) & (sizeof(unsigned long) - 1);
+		cnt = sizeof(unsigned long) - offset;
+		cnt = min(cnt, len - i);
+
+		ret = sgx_encl_debug_read(encl, entry, align, data);
+		if (ret)
+			goto out;
+
+		if (write) {
+			memcpy(data + offset, buf + i, cnt);
+			ret = sgx_encl_debug_write(encl, entry, align, data);
+			if (ret)
+				goto out;
+		} else {
+			memcpy(buf + i, data + offset, cnt);
+		}
+
+out:
+		mutex_unlock(&encl->lock);
+
+		if (ret)
+			break;
+	}
+
+	return ret < 0 ? ret : i;
+}
+
 const struct vm_operations_struct sgx_vm_ops = {
 	.fault = sgx_vma_fault,
 	.mprotect = sgx_vma_mprotect,
 	.open = sgx_vma_open,
+	.access = sgx_vma_access,
 };
 
 /**
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-04 14:54 ` [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
@ 2020-11-05 16:04   ` Borislav Petkov
  2020-11-05 17:33     ` Dave Hansen
  2020-11-06 10:04   ` Mel Gorman
  2020-11-06 17:43   ` Dr. Greg
  2 siblings, 1 reply; 33+ messages in thread
From: Borislav Petkov @ 2020-11-05 16:04 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, Sean Christopherson, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Darren Kenny,
	andriy.shevchenko, asapek, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen

On Wed, Nov 04, 2020 at 04:54:16PM +0200, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Background
> ==========
> 
> 1. SGX enclave pages are populated with data by copying from normal memory
>    via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
>    this series.
> 2. It is desirable to be able to restrict those normal memory data sources.
>    For instance, to ensure that the source data is executable before
>    copying data to an executable enclave page.
> 3. Enclave page permissions are dynamic (just like normal permissions) and
>    can be adjusted at runtime with mprotect().
> 
> This creates a problem because the original data source may have long since
> vanished at the time when enclave page permissions are established (mmap()
> or mprotect()).
> 
> The solution (elsewhere in this series) is to force enclaves creators to
> declare their paging permission *intent* up front to the ioctl().  This
> intent can me immediately compared to the source data’s mapping and
> rejected if necessary.
> 
> The “intent” is also stashed off for later comparison with enclave
> PTEs. This ensures that any future mmap()/mprotect() operations
> performed by the enclave creator or done on behalf of the enclave
> can be compared with the earlier declared permissions.
> 
> Problem
> =======
> 
> There is an existing mmap() hook which allows SGX to perform this
> permission comparison at mmap() time.  However, there is no corresponding
> ->mprotect() hook.
> 
> Solution
> ========
> 
> Add a vm_ops->mprotect() hook so that mprotect() operations which are
> inconsistent with any page's stashed intent can be rejected by the driver.
> 
> Cc: linux-mm@kvack.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Acked-by: Jethro Beekman <jethro@fortanix.com>
> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> ---
>  include/linux/mm.h | 3 +++
>  mm/mprotect.c      | 5 ++++-
>  2 files changed, 7 insertions(+), 1 deletion(-)

This needs an ACK from an mm person.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 11/24] x86/sgx: Add SGX misc driver interface
       [not found]     ` <20201105011615.GA701257@kernel.org>
@ 2020-11-05 16:05       ` Borislav Petkov
  2020-11-05 17:57         ` Jarkko Sakkinen
  0 siblings, 1 reply; 33+ messages in thread
From: Borislav Petkov @ 2020-11-05 16:05 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Thu, Nov 05, 2020 at 03:16:15AM +0200, Jarkko Sakkinen wrote:
> Further, I'd declare this as an inline function given how trivial it
> turn into.
> 
...

So are you sending a new version of only this patch as a reply to this
subthread?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-05 16:04   ` Borislav Petkov
@ 2020-11-05 17:33     ` Dave Hansen
  0 siblings, 0 replies; 33+ messages in thread
From: Dave Hansen @ 2020-11-05 17:33 UTC (permalink / raw)
  To: Borislav Petkov, Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, Sean Christopherson, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Darren Kenny,
	andriy.shevchenko, asapek, cedric.xing, chenalexchen,
	conradparker, cyhanish, haitao.huang, kai.huang, kai.svahn, kmoy,
	ludloff, luto, nhorman, npmccallum, puiterwijk, rientjes, tglx,
	yaozhangx, mikko.ylinen, Mel Gorman

On 11/5/20 8:04 AM, Borislav Petkov wrote:
...
>> Add a vm_ops->mprotect() hook so that mprotect() operations which are
>> inconsistent with any page's stashed intent can be rejected by the driver.
>>
>> Cc: linux-mm@kvack.org
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Acked-by: Jethro Beekman <jethro@fortanix.com>
>> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Co-developed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
>> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
>> ---
>>  include/linux/mm.h | 3 +++
>>  mm/mprotect.c      | 5 ++++-
>>  2 files changed, 7 insertions(+), 1 deletion(-)
> This needs an ACK from an mm person.

For what it's worth:

Acked-by: Dave Hansen <dave.hansen@intel.com>

The top 5 git-blamees in terms of mprotect.c at the moment are:

     45 Andi Kleen
     50 Peter Xu
     81 Dave Hansen
     90 Mel Gorman
    209 Linus Torvalds


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 11/24] x86/sgx: Add SGX misc driver interface
  2020-11-05 16:05       ` Borislav Petkov
@ 2020-11-05 17:57         ` Jarkko Sakkinen
  2020-11-05 18:10           ` Borislav Petkov
  0 siblings, 1 reply; 33+ messages in thread
From: Jarkko Sakkinen @ 2020-11-05 17:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Thu, Nov 05, 2020 at 05:05:59PM +0100, Borislav Petkov wrote:
> On Thu, Nov 05, 2020 at 03:16:15AM +0200, Jarkko Sakkinen wrote:
> > Further, I'd declare this as an inline function given how trivial it
> > turn into.
> > 
> ...
> 
> So are you sending a new version of only this patch as a reply to this
> subthread?

Just remarked those, so that I will not forget either, e.g. even in the
case the patch was pulled as it is, I would eventually refine these
parts.

I'll rather send a full patch set if required.

> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

/Jarkko


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 11/24] x86/sgx: Add SGX misc driver interface
  2020-11-05 17:57         ` Jarkko Sakkinen
@ 2020-11-05 18:10           ` Borislav Petkov
  2020-11-06 16:07             ` Jarkko Sakkinen
  0 siblings, 1 reply; 33+ messages in thread
From: Borislav Petkov @ 2020-11-05 18:10 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Thu, Nov 05, 2020 at 07:57:45PM +0200, Jarkko Sakkinen wrote:
> I'll rather send a full patch set if required.

Why if the changes all belong to this patch and why should I take a
patch which clearly needs improving?

Just send the fixed version of this and I can take it now.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-04 14:54 ` [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
  2020-11-05 16:04   ` Borislav Petkov
@ 2020-11-06 10:04   ` Mel Gorman
  2020-11-06 16:51     ` Jarkko Sakkinen
  2020-11-06 17:43   ` Dr. Greg
  2 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2020-11-06 10:04 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, Sean Christopherson, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Darren Kenny,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen, Michal Hocko,
	Vlastimil Babka

On Wed, Nov 04, 2020 at 04:54:16PM +0200, Jarkko Sakkinen wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Background
> ==========
> 
> 1. SGX enclave pages are populated with data by copying from normal memory
>    via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
>    this series.
> 2. It is desirable to be able to restrict those normal memory data sources.
>    For instance, to ensure that the source data is executable before
>    copying data to an executable enclave page.
> 3. Enclave page permissions are dynamic (just like normal permissions) and
>    can be adjusted at runtime with mprotect().
> 
> This creates a problem because the original data source may have long since
> vanished at the time when enclave page permissions are established (mmap()
> or mprotect()).
> 
> The solution (elsewhere in this series) is to force enclaves creators to
> declare their paging permission *intent* up front to the ioctl().  This
> intent can me immediately compared to the source data???s mapping and
> rejected if necessary.
> 
> The ???intent??? is also stashed off for later comparison with enclave
> PTEs. This ensures that any future mmap()/mprotect() operations
> performed by the enclave creator or done on behalf of the enclave
> can be compared with the earlier declared permissions.
> 
> Problem
> =======
> 
> There is an existing mmap() hook which allows SGX to perform this
> permission comparison at mmap() time.  However, there is no corresponding
> ->mprotect() hook.
> 
> Solution
> ========
> 
> Add a vm_ops->mprotect() hook so that mprotect() operations which are
> inconsistent with any page's stashed intent can be rejected by the driver.
> 

I have not read the series so this is superficial only. That said...

> Cc: linux-mm@kvack.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Acked-by: Jethro Beekman <jethro@fortanix.com>
> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> ---
>  include/linux/mm.h | 3 +++
>  mm/mprotect.c      | 5 ++++-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef360fe70aaf..eb38eabc5039 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -559,6 +559,9 @@ struct vm_operations_struct {
>  	void (*close)(struct vm_area_struct * area);
>  	int (*split)(struct vm_area_struct * area, unsigned long addr);
>  	int (*mremap)(struct vm_area_struct * area);
> +	int (*mprotect)(struct vm_area_struct *vma,
> +			struct vm_area_struct **pprev, unsigned long start,
> +			unsigned long end, unsigned long newflags);

The first user of this uses the following information

	ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);

It only needs start, end and newflags. The pprev is passed in so the
hook can call mprotect_fixup() which is redundant as the caller knows it
should do that. I don't think an arbitrary driver should be responsible
for poking too much into the mm internals to do the fixup because we do
not know what other users of this hook might require in the future.

Hence, I would suggest that the hook receive the minimum possible
information to do the permissions check for the first in-tree user. If
it returns without failure then mm/mprotect.c would always do the fixup.

>  	vm_fault_t (*fault)(struct vm_fault *vmf);
>  	vm_fault_t (*huge_fault)(struct vm_fault *vmf,
>  			enum page_entry_size pe_size);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 56c02beb6041..1fd4fa71ce16 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -616,7 +616,10 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
>  		tmp = vma->vm_end;
>  		if (tmp > end)
>  			tmp = end;
> -		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
> +		if (vma->vm_ops && vma->vm_ops->mprotect)
> +			error = vma->vm_ops->mprotect(vma, &prev, nstart, tmp, newflags);
> +		else
> +			error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);

That would then become

if (vma->vm_ops && vma->vm_ops->mprotect)
	error = vma->vm_ops->mprotect(vma, &prev, nstart, tmp, newflags);
if (!error)
	error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);

and mprotect_fixup would be removed from the driver.

While vm_operations_struct has borderline zero documentation, a hook for
one in-kernel user should have a comment explaining what the semantics
of the hook is -- what is it responsible for (permission check), what
can it change (nothing), etc. Maybe something like

	/*
	 * Called by mprotect in the event driver-specific permission
	 * checks need to be made before the mprotect is finalised.
	 * No modifications should be done to the VMA, returns 0
	 * if the mprotect is permitted.
	 */
	int (*mprotect)(struct vm_area_struct *vma,
		unsigned long start, unsigned long end,
		unsigned long newflags);

If a future driver *does* need to poke deeper into the VM for mprotect
then at least they'll have to explain why that's a good idea.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 11/24] x86/sgx: Add SGX misc driver interface
  2020-11-05 18:10           ` Borislav Petkov
@ 2020-11-06 16:07             ` Jarkko Sakkinen
  2020-11-06 17:09               ` Borislav Petkov
  0 siblings, 1 reply; 33+ messages in thread
From: Jarkko Sakkinen @ 2020-11-06 16:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

[-- Attachment #1: Type: text/plain, Size: 701 bytes --]

On Thu, Nov 05, 2020 at 07:10:47PM +0100, Borislav Petkov wrote:
> On Thu, Nov 05, 2020 at 07:57:45PM +0200, Jarkko Sakkinen wrote:
> > I'll rather send a full patch set if required.
> 
> Why if the changes all belong to this patch and why should I take a
> patch which clearly needs improving?
> 
> Just send the fixed version of this and I can take it now.
> 
> Thx.

Here's an update patch. I kept the name as sgx_encl_find() so and output
argument instead of return value, so that the change is localized. I
think this is good enough, i.e. the semantically obsolete stuff has been
wiped off.

> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

/Jarkko

[-- Attachment #2: v41-0001-x86-sgx-Add-SGX-misc-driver-interface.patch --]
[-- Type: text/x-diff, Size: 13670 bytes --]

From 1e6735547bf50469cfda971c35c256f992e5b862 Mon Sep 17 00:00:00 2001
From: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Date: Thu, 1 Nov 2018 18:21:58 -0700
Subject: [PATCH v41] x86/sgx: Add SGX misc driver interface
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Intel(R) SGX is new hardware functionality that can be used by applications
to set aside private regions of code and data called enclaves. New hardware
protects enclave code and data from outside access and modification.

Add a driver that presents a device file and ioctl API to build and manage
enclaves.  Subsequent patches will expend the ioctl()’s functionality.

Cc: linux-security-module@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Haitao Huang <haitao.huang@linux.intel.com>
Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
Tested-by: Seth Moore <sethmo@google.com>
Tested-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
Changes from v40:
* Remove !page check from sgx_encl_may_map() as xas_for_each() iterates
  only through valid entries.
* Remove null check for vm_private_data from sgx_encl_find() as since
  v20 an enclave instance has been created at the time when the file
  is opened.

Changes from v39:
* Rename /dev/sgx/enclave as /dev/sgx_enclave.
* In the page fault handler, do not check for SGX_ENCL_DEAD. This allows
  to do forensics to the memory of debug enclaves.

 arch/x86/kernel/cpu/sgx/Makefile |   2 +
 arch/x86/kernel/cpu/sgx/driver.c | 112 ++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/driver.h |  16 ++++
 arch/x86/kernel/cpu/sgx/encl.c   | 153 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/encl.h   |  60 ++++++++++++
 arch/x86/kernel/cpu/sgx/main.c   |  12 ++-
 6 files changed, 354 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/driver.c
 create mode 100644 arch/x86/kernel/cpu/sgx/driver.h
 create mode 100644 arch/x86/kernel/cpu/sgx/encl.c
 create mode 100644 arch/x86/kernel/cpu/sgx/encl.h

diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 79510ce01b3b..3fc451120735 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -1,2 +1,4 @@
 obj-y += \
+	driver.o \
+	encl.o \
 	main.o
diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
new file mode 100644
index 000000000000..248213dea78e
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+/*  Copyright(c) 2016-20 Intel Corporation. */
+
+#include <linux/acpi.h>
+#include <linux/miscdevice.h>
+#include <linux/mman.h>
+#include <linux/security.h>
+#include <linux/suspend.h>
+#include <asm/traps.h>
+#include "driver.h"
+#include "encl.h"
+
+static int sgx_open(struct inode *inode, struct file *file)
+{
+	struct sgx_encl *encl;
+
+	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
+	if (!encl)
+		return -ENOMEM;
+
+	xa_init(&encl->page_array);
+	mutex_init(&encl->lock);
+
+	file->private_data = encl;
+
+	return 0;
+}
+
+static int sgx_release(struct inode *inode, struct file *file)
+{
+	struct sgx_encl *encl = file->private_data;
+	struct sgx_encl_page *entry;
+	unsigned long index;
+
+	xa_for_each(&encl->page_array, index, entry) {
+		if (entry->epc_page) {
+			sgx_free_epc_page(entry->epc_page);
+			encl->secs_child_cnt--;
+			entry->epc_page = NULL;
+		}
+
+		kfree(entry);
+	}
+
+	xa_destroy(&encl->page_array);
+
+	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_free_epc_page(encl->secs.epc_page);
+		encl->secs.epc_page = NULL;
+	}
+
+	/* Detect EPC page leak's. */
+	WARN_ON_ONCE(encl->secs_child_cnt);
+	WARN_ON_ONCE(encl->secs.epc_page);
+
+	kfree(encl);
+	return 0;
+}
+
+static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct sgx_encl *encl = file->private_data;
+	int ret;
+
+	ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags);
+	if (ret)
+		return ret;
+
+	vma->vm_ops = &sgx_vm_ops;
+	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
+	vma->vm_private_data = encl;
+
+	return 0;
+}
+
+static unsigned long sgx_get_unmapped_area(struct file *file,
+					   unsigned long addr,
+					   unsigned long len,
+					   unsigned long pgoff,
+					   unsigned long flags)
+{
+	if ((flags & MAP_TYPE) == MAP_PRIVATE)
+		return -EINVAL;
+
+	if (flags & MAP_FIXED)
+		return addr;
+
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
+static const struct file_operations sgx_encl_fops = {
+	.owner			= THIS_MODULE,
+	.open			= sgx_open,
+	.release		= sgx_release,
+	.mmap			= sgx_mmap,
+	.get_unmapped_area	= sgx_get_unmapped_area,
+};
+
+static struct miscdevice sgx_dev_enclave = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "sgx_enclave",
+	.nodename = "sgx_enclave",
+	.fops = &sgx_encl_fops,
+};
+
+int __init sgx_drv_init(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SGX_LC))
+		return -ENODEV;
+
+	return misc_register(&sgx_dev_enclave);
+}
diff --git a/arch/x86/kernel/cpu/sgx/driver.h b/arch/x86/kernel/cpu/sgx/driver.h
new file mode 100644
index 000000000000..cda9c43b7543
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/driver.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ARCH_SGX_DRIVER_H__
+#define __ARCH_SGX_DRIVER_H__
+
+#include <crypto/hash.h>
+#include <linux/kref.h>
+#include <linux/mmu_notifier.h>
+#include <linux/radix-tree.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/workqueue.h>
+#include "sgx.h"
+
+int sgx_drv_init(void);
+
+#endif /* __ARCH_X86_SGX_DRIVER_H__ */
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
new file mode 100644
index 000000000000..1757bfff6a59
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+/*  Copyright(c) 2016-20 Intel Corporation. */
+
+#include <linux/lockdep.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/shmem_fs.h>
+#include <linux/suspend.h>
+#include <linux/sched/mm.h>
+#include "arch.h"
+#include "encl.h"
+#include "encls.h"
+#include "sgx.h"
+
+static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
+						unsigned long addr,
+						unsigned long vm_flags)
+{
+	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+	struct sgx_encl_page *entry;
+
+	entry = xa_load(&encl->page_array, PFN_DOWN(addr));
+	if (!entry)
+		return ERR_PTR(-EFAULT);
+
+	/*
+	 * Verify that the faulted page has equal or higher build time
+	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
+	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
+	 */
+	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
+		return ERR_PTR(-EFAULT);
+
+	/* No page found. */
+	if (!entry->epc_page)
+		return ERR_PTR(-EFAULT);
+
+	/* Entry successfully located. */
+	return entry;
+}
+
+static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
+{
+	unsigned long addr = (unsigned long)vmf->address;
+	struct vm_area_struct *vma = vmf->vma;
+	struct sgx_encl_page *entry;
+	unsigned long phys_addr;
+	struct sgx_encl *encl;
+	vm_fault_t ret;
+
+	encl = vma->vm_private_data;
+
+	mutex_lock(&encl->lock);
+
+	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
+	if (IS_ERR(entry)) {
+		mutex_unlock(&encl->lock);
+
+		return VM_FAULT_SIGBUS;
+	}
+
+	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
+
+	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
+	if (ret != VM_FAULT_NOPAGE) {
+		mutex_unlock(&encl->lock);
+
+		return VM_FAULT_SIGBUS;
+	}
+
+	mutex_unlock(&encl->lock);
+
+	return VM_FAULT_NOPAGE;
+}
+
+/**
+ * sgx_encl_may_map() - Check if a requested VMA mapping is allowed
+ * @encl:		an enclave pointer
+ * @start:		lower bound of the address range, inclusive
+ * @end:		upper bound of the address range, exclusive
+ * @vm_flags:		VMA flags
+ *
+ * Iterate through the enclave pages contained within [@start, @end) to verify
+ * that the permissions requested by a subset of {VM_READ, VM_WRITE, VM_EXEC}
+ * does not contain any permissions that are not contained in the build time
+ * permissions of any of the enclave pages within the given address range.
+ *
+ * An enclave creator must declare the strongest permissions that will be
+ * needed for each enclave page  This ensures that mappings  have the identical
+ * or weaker permissions that the earlier declared permissions.
+ *
+ * Return: 0 on success, -EACCES otherwise
+ */
+int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
+		     unsigned long end, unsigned long vm_flags)
+{
+	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+	struct sgx_encl_page *page;
+	unsigned long count = 0;
+	int ret = 0;
+
+	XA_STATE(xas, &encl->page_array, PFN_DOWN(start));
+
+	/*
+	 * Disallow READ_IMPLIES_EXEC tasks as their VMA permissions might
+	 * conflict with the enclave page permissions.
+	 */
+	if (current->personality & READ_IMPLIES_EXEC)
+		return -EACCES;
+
+	mutex_lock(&encl->lock);
+	xas_lock(&xas);
+	xas_for_each(&xas, page, PFN_DOWN(end - 1)) {
+		if (~page->vm_max_prot_bits & vm_prot_bits) {
+			ret = -EACCES;
+			break;
+		}
+
+		/* Reschedule on every XA_CHECK_SCHED iteration. */
+		if (!(++count % XA_CHECK_SCHED)) {
+			xas_pause(&xas);
+			xas_unlock(&xas);
+			mutex_unlock(&encl->lock);
+
+			cond_resched();
+
+			mutex_lock(&encl->lock);
+			xas_lock(&xas);
+		}
+	}
+	xas_unlock(&xas);
+	mutex_unlock(&encl->lock);
+
+	return ret;
+}
+
+static int sgx_vma_mprotect(struct vm_area_struct *vma,
+			    struct vm_area_struct **pprev, unsigned long start,
+			    unsigned long end, unsigned long newflags)
+{
+	int ret;
+
+	ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
+	if (ret)
+		return ret;
+
+	return mprotect_fixup(vma, pprev, start, end, newflags);
+}
+
+const struct vm_operations_struct sgx_vm_ops = {
+	.fault = sgx_vma_fault,
+	.mprotect = sgx_vma_mprotect,
+};
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
new file mode 100644
index 000000000000..b7e02eab5868
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/**
+ * Copyright(c) 2016-20 Intel Corporation.
+ *
+ * Contains the software defined data structures for enclaves.
+ */
+#ifndef _X86_ENCL_H
+#define _X86_ENCL_H
+
+#include <linux/cpumask.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/srcu.h>
+#include <linux/workqueue.h>
+#include <linux/xarray.h>
+#include "sgx.h"
+
+struct sgx_encl_page {
+	unsigned long desc;
+	unsigned long vm_max_prot_bits;
+	struct sgx_epc_page *epc_page;
+	struct sgx_encl *encl;
+};
+
+struct sgx_encl {
+	unsigned long base;
+	unsigned long size;
+	unsigned int page_cnt;
+	unsigned int secs_child_cnt;
+	struct mutex lock;
+	struct xarray page_array;
+	struct sgx_encl_page secs;
+};
+
+extern const struct vm_operations_struct sgx_vm_ops;
+
+static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
+				struct vm_area_struct **vma)
+{
+	struct vm_area_struct *result;
+	struct sgx_encl *encl;
+
+	result = find_vma(mm, addr);
+	if (!result || result->vm_ops != &sgx_vm_ops || addr < result->vm_start)
+		return -EINVAL;
+
+	encl = result->vm_private_data;
+	*vma = result;
+
+	return 0;
+}
+
+int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
+		     unsigned long end, unsigned long vm_flags);
+
+#endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b9ac438a13a4..c2740e0630d1 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -9,6 +9,8 @@
 #include <linux/sched/mm.h>
 #include <linux/sched/signal.h>
 #include <linux/slab.h>
+#include "driver.h"
+#include "encl.h"
 #include "encls.h"
 
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
@@ -229,9 +231,10 @@ static bool __init sgx_page_cache_init(void)
 
 static void __init sgx_init(void)
 {
+	int ret;
 	int i;
 
-	if (!boot_cpu_has(X86_FEATURE_SGX))
+	if (!cpu_feature_enabled(X86_FEATURE_SGX))
 		return;
 
 	if (!sgx_page_cache_init())
@@ -240,8 +243,15 @@ static void __init sgx_init(void)
 	if (!sgx_page_reclaimer_init())
 		goto err_page_cache;
 
+	ret = sgx_drv_init();
+	if (ret)
+		goto err_kthread;
+
 	return;
 
+err_kthread:
+	kthread_stop(ksgxswapd_tsk);
+
 err_page_cache:
 	for (i = 0; i < sgx_nr_epc_sections; i++) {
 		vfree(sgx_epc_sections[i].pages);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-06 10:04   ` Mel Gorman
@ 2020-11-06 16:51     ` Jarkko Sakkinen
  2020-11-06 20:37       ` Borislav Petkov
  0 siblings, 1 reply; 33+ messages in thread
From: Jarkko Sakkinen @ 2020-11-06 16:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	npmccallum, puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen,
	Michal Hocko, Vlastimil Babka

On Fri, Nov 06, 2020 at 10:04:09AM +0000, Mel Gorman wrote:
> On Wed, Nov 04, 2020 at 04:54:16PM +0200, Jarkko Sakkinen wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Background
> > ==========
> > 
> > 1. SGX enclave pages are populated with data by copying from normal memory
> >    via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
> >    this series.
> > 2. It is desirable to be able to restrict those normal memory data sources.
> >    For instance, to ensure that the source data is executable before
> >    copying data to an executable enclave page.
> > 3. Enclave page permissions are dynamic (just like normal permissions) and
> >    can be adjusted at runtime with mprotect().
> > 
> > This creates a problem because the original data source may have long since
> > vanished at the time when enclave page permissions are established (mmap()
> > or mprotect()).
> > 
> > The solution (elsewhere in this series) is to force enclaves creators to
> > declare their paging permission *intent* up front to the ioctl().  This
> > intent can me immediately compared to the source data???s mapping and
> > rejected if necessary.
> > 
> > The ???intent??? is also stashed off for later comparison with enclave
> > PTEs. This ensures that any future mmap()/mprotect() operations
> > performed by the enclave creator or done on behalf of the enclave
> > can be compared with the earlier declared permissions.
> > 
> > Problem
> > =======
> > 
> > There is an existing mmap() hook which allows SGX to perform this
> > permission comparison at mmap() time.  However, there is no corresponding
> > ->mprotect() hook.
> > 
> > Solution
> > ========
> > 
> > Add a vm_ops->mprotect() hook so that mprotect() operations which are
> > inconsistent with any page's stashed intent can be rejected by the driver.
> > 
> 
> I have not read the series so this is superficial only. That said...
> 
> > Cc: linux-mm@kvack.org
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Acked-by: Jethro Beekman <jethro@fortanix.com>
> > Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Co-developed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > ---
> >  include/linux/mm.h | 3 +++
> >  mm/mprotect.c      | 5 ++++-
> >  2 files changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ef360fe70aaf..eb38eabc5039 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -559,6 +559,9 @@ struct vm_operations_struct {
> >  	void (*close)(struct vm_area_struct * area);
> >  	int (*split)(struct vm_area_struct * area, unsigned long addr);
> >  	int (*mremap)(struct vm_area_struct * area);
> > +	int (*mprotect)(struct vm_area_struct *vma,
> > +			struct vm_area_struct **pprev, unsigned long start,
> > +			unsigned long end, unsigned long newflags);
> 
> The first user of this uses the following information
> 
> 	ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
> 
> It only needs start, end and newflags. The pprev is passed in so the
> hook can call mprotect_fixup() which is redundant as the caller knows it
> should do that. I don't think an arbitrary driver should be responsible
> for poking too much into the mm internals to do the fixup because we do
> not know what other users of this hook might require in the future.
> 
> Hence, I would suggest that the hook receive the minimum possible
> information to do the permissions check for the first in-tree user. If
> it returns without failure then mm/mprotect.c would always do the fixup.
> 
> >  	vm_fault_t (*fault)(struct vm_fault *vmf);
> >  	vm_fault_t (*huge_fault)(struct vm_fault *vmf,
> >  			enum page_entry_size pe_size);
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 56c02beb6041..1fd4fa71ce16 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -616,7 +616,10 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
> >  		tmp = vma->vm_end;
> >  		if (tmp > end)
> >  			tmp = end;
> > -		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
> > +		if (vma->vm_ops && vma->vm_ops->mprotect)
> > +			error = vma->vm_ops->mprotect(vma, &prev, nstart, tmp, newflags);
> > +		else
> > +			error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
> 
> That would then become
> 
> if (vma->vm_ops && vma->vm_ops->mprotect)
> 	error = vma->vm_ops->mprotect(vma, &prev, nstart, tmp, newflags);
> if (!error)
> 	error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
> 
> and mprotect_fixup would be removed from the driver.
> 
> While vm_operations_struct has borderline zero documentation, a hook for
> one in-kernel user should have a comment explaining what the semantics
> of the hook is -- what is it responsible for (permission check), what
> can it change (nothing), etc. Maybe something like
> 
> 	/*
> 	 * Called by mprotect in the event driver-specific permission
> 	 * checks need to be made before the mprotect is finalised.
> 	 * No modifications should be done to the VMA, returns 0
> 	 * if the mprotect is permitted.
> 	 */
> 	int (*mprotect)(struct vm_area_struct *vma,
> 		unsigned long start, unsigned long end,
> 		unsigned long newflags);
> 
> If a future driver *does* need to poke deeper into the VM for mprotect
> then at least they'll have to explain why that's a good idea.

Both comments make sense to me. I'll refine this patch on Monday and
also "x86/sgx: Add SGX misc driver interface", which uses this callback.

Thanks a lot for valuable feedback!

> -- 
> Mel Gorman
> SUSE Labs

/Jarkko


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 11/24] x86/sgx: Add SGX misc driver interface
  2020-11-06 16:07             ` Jarkko Sakkinen
@ 2020-11-06 17:09               ` Borislav Petkov
  2020-11-06 22:01                 ` Jarkko Sakkinen
  0 siblings, 1 reply; 33+ messages in thread
From: Borislav Petkov @ 2020-11-06 17:09 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Fri, Nov 06, 2020 at 06:07:42PM +0200, Jarkko Sakkinen wrote:
> On Thu, Nov 05, 2020 at 07:10:47PM +0100, Borislav Petkov wrote:
> > On Thu, Nov 05, 2020 at 07:57:45PM +0200, Jarkko Sakkinen wrote:
> > > I'll rather send a full patch set if required.
> > 
> > Why if the changes all belong to this patch and why should I take a
> > patch which clearly needs improving?
> > 
> > Just send the fixed version of this and I can take it now.
> > 
> > Thx.
> 
> Here's an update patch. I kept the name as sgx_encl_find() so and output
> argument instead of return value, so that the change is localized. I
> think this is good enough, i.e. the semantically obsolete stuff has been
> wiped off.

Thanks.

> Tested-by: Jethro Beekman <jethro@fortanix.com>
> Tested-by: Haitao Huang <haitao.huang@linux.intel.com>
> Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
> Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
> Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
> Tested-by: Seth Moore <sethmo@google.com>
> Tested-by: Darren Kenny <darren.kenny@oracle.com>

Btw, you do know that when you change the patch, those tested-by's don't
hold true anymore, right?

The Reviewed-by's too, actually.

I'll zap them.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-04 14:54 ` [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
  2020-11-05 16:04   ` Borislav Petkov
  2020-11-06 10:04   ` Mel Gorman
@ 2020-11-06 17:43   ` Dr. Greg
  2020-11-06 17:54     ` Dave Hansen
  2020-11-06 21:13     ` Matthew Wilcox
  2 siblings, 2 replies; 33+ messages in thread
From: Dr. Greg @ 2020-11-06 17:43 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, Sean Christopherson, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Darren Kenny,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen

On Wed, Nov 04, 2020 at 04:54:16PM +0200, Jarkko Sakkinen wrote:

Good morning, I hope the week has gone well for everyone.

> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Background
> ==========
> 
> 1. SGX enclave pages are populated with data by copying from normal memory
>    via ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES), which will be added later in
>    this series.
> 2. It is desirable to be able to restrict those normal memory data sources.
>    For instance, to ensure that the source data is executable before
>    copying data to an executable enclave page.
> 3. Enclave page permissions are dynamic (just like normal permissions) and
>    can be adjusted at runtime with mprotect().

Only relevant on SGX2 hardware, see discussion below.

> This creates a problem because the original data source may have
> long since vanished at the time when enclave page permissions are
> established (mmap() or mprotect()).
>
> The solution (elsewhere in this series) is to force enclaves

I don't believe that enclaves should be plural in this context.

> creators to declare their paging permission *intent* up front to the
> ioctl().  This intent can me immediately compared to the source

The 'me' should be 'be' in the above line.

> data???s mapping and rejected if necessary.
>
> The ???intent??? is also stashed off for later comparison with
> enclave PTEs. This ensures that any future mmap()/mprotect()
> operations performed by the enclave creator or done on behalf of the
> enclave can be compared with the earlier declared permissions.

Just some further clarifications that should, arguably, be included in
the kernel documentation given their security implications.

The 900 pound primate in the room, that no one is acknowledging, is
that this technology was designed to not allow the operating system to
have any control over what it is doing.  In the mindset of kernel
developers, the operating system is the absolute authority on
security, so we find ourselves in a situation where the kernel needs
to try and work around this fact so any solutions will be imperfect at
best.

As I've noted before, this is actually a primary objective of enclave
authors, since one of the desires for 'Confidential Computing' is to
hide things like proprietary algorithms from the platform owners.  I
think the driver needs to acknowledge this fact and equip platform
owners with the simplest and most effective security solutions that
are available.

The only reason that mprotect protections are needed in this driver
are to close a security loophole on SGX2 hardware, ie. hardware that
supports the ENCLU[EMODPE] instruction.  This instruction allows an
enclave to modify page permissions that are encoded in the Enclave
Page Cache Metadata (EPCM) at initialization time.  In all likelihood,
this is going to be the only relevant hardware that this driver runs
on.

On SGX2 hardware, enclave based code can conspire with its untrusted
runtime to allow executable regions to have write permissions.  This
would allow the enclave to load and execute whatever code that it
pleases and that the operating system would have no visibility into
whatsoever.

The non-SGX2 platforms don't need mprotect protections since even if
they were to modify at the OS level their page permissions, any
attempts to access a page with modified permissions would be trapped
by the EPCM protections that are unmodifiable after the enclave has
been initialized.

In light of this, given the decision by the driver authors to not
fully equip the driver with EDMM support, the mprotect protection
requirements are straight forward and minimalistic.  All that is
needed is a binary valued variable, set on the command-line, that
either allows or denies anonymous code execution by an enclave,
ie. access to page protection changes after initialization.

The enclave page mapping callback is elegant but has little use if the
objective of all this is to allow the driver to enforce SGX1 semantics
on a platform that has SGX2 instruction support.  Save the elegant
solution until a reasoned arguement can be made as to what anyone
would actually be able to do with the per page permissions checks,
even on an EDMM capable driver.

I could go into detail on that issue as well but I hesitate to be an
insufferable bore.

I hope all of this is helpful.

Have a good weekend.

Dr. Greg

As always,
Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"In the future, company names will be a 32-character hex string."
                                -- Bruce Schneier


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-06 17:43   ` Dr. Greg
@ 2020-11-06 17:54     ` Dave Hansen
  2020-11-07 15:09       ` Dr. Greg
  2020-11-06 21:13     ` Matthew Wilcox
  1 sibling, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2020-11-06 17:54 UTC (permalink / raw)
  To: Dr. Greg, Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, Sean Christopherson, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Darren Kenny,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, haitao.huang, kai.huang, kai.svahn, kmoy,
	ludloff, luto, nhorman, npmccallum, puiterwijk, rientjes, tglx,
	yaozhangx, mikko.ylinen

On 11/6/20 9:43 AM, Dr. Greg wrote:
> In light of this, given the decision by the driver authors to not
> fully equip the driver with EDMM support, the mprotect protection
> requirements are straight forward and minimalistic.  All that is
> needed is a binary valued variable, set on the command-line, that
> either allows or denies anonymous code execution by an enclave,
> ie. access to page protection changes after initialization.

To that, I say NAK.  We need more flexibility than a boot-time-fixed,
system-wide switch.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-06 16:51     ` Jarkko Sakkinen
@ 2020-11-06 20:37       ` Borislav Petkov
  2020-11-06 22:04         ` Jarkko Sakkinen
  0 siblings, 1 reply; 33+ messages in thread
From: Borislav Petkov @ 2020-11-06 20:37 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Mel Gorman, Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, andriy.shevchenko, asapek,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	npmccallum, puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen,
	Michal Hocko, Vlastimil Babka

On Fri, Nov 06, 2020 at 06:51:07PM +0200, Jarkko Sakkinen wrote:
> Both comments make sense to me. I'll refine this patch on Monday and

And while you're at it, I'd suggest you refine the whole patchset and
send a full v41 instead:

- please audit all your Reviewed-by, Acked-by tags as to for what
versions of the patches they were given. If you've changed those patches
in the meantime, then all those tags are invalid and need to go.

- work in all the change requests

- fix the order of the patches so that each one builds

so that they can be taken cleanly into tip.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-06 17:43   ` Dr. Greg
  2020-11-06 17:54     ` Dave Hansen
@ 2020-11-06 21:13     ` Matthew Wilcox
  2020-11-06 21:23       ` Dave Hansen
  2020-11-07 15:27       ` Dr. Greg
  1 sibling, 2 replies; 33+ messages in thread
From: Matthew Wilcox @ 2020-11-06 21:13 UTC (permalink / raw)
  To: Dr. Greg
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Jethro Beekman,
	Darren Kenny, andriy.shevchenko, asapek, bp, cedric.xing,
	chenalexchen, conradparker, cyhanish, dave.hansen, haitao.huang,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Fri, Nov 06, 2020 at 11:43:59AM -0600, Dr. Greg wrote:
> The 900 pound primate in the room, that no one is acknowledging, is
> that this technology was designed to not allow the operating system to
> have any control over what it is doing.  In the mindset of kernel
> developers, the operating system is the absolute authority on
> security, so we find ourselves in a situation where the kernel needs
> to try and work around this fact so any solutions will be imperfect at
> best.
> 
> As I've noted before, this is actually a primary objective of enclave
> authors, since one of the desires for 'Confidential Computing' is to
> hide things like proprietary algorithms from the platform owners.  I
> think the driver needs to acknowledge this fact and equip platform
> owners with the simplest and most effective security solutions that
> are available.

Or we need to not merge "technology" that subverts the owner of
the hardware.  Remember: root kit authors are inventive buggers.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-06 21:13     ` Matthew Wilcox
@ 2020-11-06 21:23       ` Dave Hansen
  2020-11-07 15:27       ` Dr. Greg
  1 sibling, 0 replies; 33+ messages in thread
From: Dave Hansen @ 2020-11-06 21:23 UTC (permalink / raw)
  To: Matthew Wilcox, Dr. Greg
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Jethro Beekman,
	Darren Kenny, andriy.shevchenko, asapek, bp, cedric.xing,
	chenalexchen, conradparker, cyhanish, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen

On 11/6/20 1:13 PM, Matthew Wilcox wrote:
> On Fri, Nov 06, 2020 at 11:43:59AM -0600, Dr. Greg wrote:
>> The 900 pound primate in the room, that no one is acknowledging, is
>> that this technology was designed to not allow the operating system to
>> have any control over what it is doing.  In the mindset of kernel
>> developers, the operating system is the absolute authority on
>> security, so we find ourselves in a situation where the kernel needs
>> to try and work around this fact so any solutions will be imperfect at
>> best.
>>
>> As I've noted before, this is actually a primary objective of enclave
>> authors, since one of the desires for 'Confidential Computing' is to
>> hide things like proprietary algorithms from the platform owners.  I
>> think the driver needs to acknowledge this fact and equip platform
>> owners with the simplest and most effective security solutions that
>> are available.
> Or we need to not merge "technology" that subverts the owner of
> the hardware.  Remember: root kit authors are inventive buggers.

Machine owners have lots of ways to yield their own control of the
hardware.  One is:

	wget http://what-could-go-wrong.com -O foo.sh; sudo foo.sh

Another is to enable SGX in the BIOS.  You're giving up some level of
control and yielding it to the hardware.  If you don't trust the
hardware (aka. Intel), I'd stay far, far away from that BIOS option.

This mprotect() hook is trying to have the kernel enforce some rules
that yield *less* to enclave authors.  Once a rootkit is in play, the
kernel isn't going to be providing meaningful protection and I'd expect
that this hook is rather useless.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 11/24] x86/sgx: Add SGX misc driver interface
  2020-11-06 17:09               ` Borislav Petkov
@ 2020-11-06 22:01                 ` Jarkko Sakkinen
  0 siblings, 0 replies; 33+ messages in thread
From: Jarkko Sakkinen @ 2020-11-06 22:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Fri, Nov 06, 2020 at 06:09:20PM +0100, Borislav Petkov wrote:
> On Fri, Nov 06, 2020 at 06:07:42PM +0200, Jarkko Sakkinen wrote:
> > On Thu, Nov 05, 2020 at 07:10:47PM +0100, Borislav Petkov wrote:
> > > On Thu, Nov 05, 2020 at 07:57:45PM +0200, Jarkko Sakkinen wrote:
> > > > I'll rather send a full patch set if required.
> > > 
> > > Why if the changes all belong to this patch and why should I take a
> > > patch which clearly needs improving?
> > > 
> > > Just send the fixed version of this and I can take it now.
> > > 
> > > Thx.
> > 
> > Here's an update patch. I kept the name as sgx_encl_find() so and output
> > argument instead of return value, so that the change is localized. I
> > think this is good enough, i.e. the semantically obsolete stuff has been
> > wiped off.
> 
> Thanks.
> 
> > Tested-by: Jethro Beekman <jethro@fortanix.com>
> > Tested-by: Haitao Huang <haitao.huang@linux.intel.com>
> > Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
> > Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
> > Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
> > Tested-by: Seth Moore <sethmo@google.com>
> > Tested-by: Darren Kenny <darren.kenny@oracle.com>
> 
> Btw, you do know that when you change the patch, those tested-by's don't
> hold true anymore, right?
> 
> The Reviewed-by's too, actually.
> 
> I'll zap them.

Yes, I know that. That was something that I should have done for this
version. I was too busy turning every rock to make everything as clean
as possible, sorry about that. I'll also update my tree accordingly.

> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

/Jarkko


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-06 20:37       ` Borislav Petkov
@ 2020-11-06 22:04         ` Jarkko Sakkinen
  2020-11-06 22:31           ` Borislav Petkov
  0 siblings, 1 reply; 33+ messages in thread
From: Jarkko Sakkinen @ 2020-11-06 22:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mel Gorman, Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, andriy.shevchenko, asapek,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	npmccallum, puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen,
	Michal Hocko, Vlastimil Babka

On Fri, Nov 06, 2020 at 09:37:25PM +0100, Borislav Petkov wrote:
> On Fri, Nov 06, 2020 at 06:51:07PM +0200, Jarkko Sakkinen wrote:
> > Both comments make sense to me. I'll refine this patch on Monday and
> 
> And while you're at it, I'd suggest you refine the whole patchset and
> send a full v41 instead:
> 
> - please audit all your Reviewed-by, Acked-by tags as to for what
> versions of the patches they were given. If you've changed those patches
> in the meantime, then all those tags are invalid and need to go.
> 
> - work in all the change requests
> 
> - fix the order of the patches so that each one builds
> 
> so that they can be taken cleanly into tip.
> 
> Thx.

OK, everything else is clear except change requests part I want to
check.

There has been a change request to update callback that made perfect
sense to me. Is there something else that I might have missed? Just
checking.

> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

/Jarkko


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-06 22:04         ` Jarkko Sakkinen
@ 2020-11-06 22:31           ` Borislav Petkov
  0 siblings, 0 replies; 33+ messages in thread
From: Borislav Petkov @ 2020-11-06 22:31 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Mel Gorman, Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, andriy.shevchenko, asapek,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	npmccallum, puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen,
	Michal Hocko, Vlastimil Babka

On Sat, Nov 07, 2020 at 12:04:02AM +0200, Jarkko Sakkinen wrote:
> There has been a change request to update callback that made perfect
> sense to me. Is there something else that I might have missed? Just
> checking.

With "change requests" I mean the usual going through the replies to a
patchset and working in the changes people requested. Nothing special -
just the usual working in review feedback, that's it.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-06 17:54     ` Dave Hansen
@ 2020-11-07 15:09       ` Dr. Greg
  2020-11-07 19:16         ` Dave Hansen
  0 siblings, 1 reply; 33+ messages in thread
From: Dr. Greg @ 2020-11-07 15:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, haitao.huang,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Fri, Nov 06, 2020 at 09:54:19AM -0800, Dave Hansen wrote:

Good morning, I hope the weekend is going well for everyone, beautiful
weather out here in West-Cental Minnesota.

> On 11/6/20 9:43 AM, Dr. Greg wrote:
> > In light of this, given the decision by the driver authors to not
> > fully equip the driver with EDMM support, the mprotect protection
> > requirements are straight forward and minimalistic.  All that is
> > needed is a binary valued variable, set on the command-line, that
> > either allows or denies anonymous code execution by an enclave,
> > ie. access to page protection changes after initialization.

> To that, I say NAK.  We need more flexibility than a boot-time-fixed,
> system-wide switch.

To be clear, I wasn't referring to a global yes/no option in the code
that implements the mprotect callout method in the
vm_operations_struct.  I was referring to the implementation of the
hook in the SGX driver code.

In all of these discussions there hasn't been a refutation of my point
that the only reason this hook is needed is to stop the potential for
anonymous code execution on SGX2 capable hardware.  So we will assume,
that while unspoken, this is the rationale for the hook.

If you are NAK'ing a global enable/disable in the driver code, I think
there needs to be a discussion of why the driver, in its current
state, needs anything other then a yes/no decision on mprotect after
enclave initialization is complete.

At this point in time the driver has no intention of supporting EDMM,
so the simple belt-and-suspenders approach is to deny mprotect on
enclave virtual memory after initialization.  Absent mprotect, the
hardware is perfectly capable of enforcing page permissions that are
only consistent with the initial mapping of the enclave.

If and when EDMM is implemented there might be a rationale for per
page mprotect interrogation.  I won't waste people's time here but I
believe a cogent arguement can be made that there is little utility,
even under EDMM, of making per page permission decisions rather then a
'yes/no' decision by the platform owner as to whether or not they want
to allow anonymous code execution.

Hopefully all of this is a useful clarification.

Have a good weekend.

Dr. Greg

As always,
Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"If you ever teach a yodeling class, probably the hardest thing is to
 keep the students from just trying to yodel right off. You see, we build
 to that."
                                -- Jack Handey
                                   Deep Thoughts


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-06 21:13     ` Matthew Wilcox
  2020-11-06 21:23       ` Dave Hansen
@ 2020-11-07 15:27       ` Dr. Greg
  1 sibling, 0 replies; 33+ messages in thread
From: Dr. Greg @ 2020-11-07 15:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Jethro Beekman,
	Darren Kenny, andriy.shevchenko, asapek, bp, cedric.xing,
	chenalexchen, conradparker, cyhanish, dave.hansen, haitao.huang,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Fri, Nov 06, 2020 at 09:13:11PM +0000, Matthew Wilcox wrote:
> On Fri, Nov 06, 2020 at 11:43:59AM -0600, Dr. Greg wrote:
> > The 900 pound primate in the room, that no one is acknowledging, is
> > that this technology was designed to not allow the operating system to
> > have any control over what it is doing.  In the mindset of kernel
> > developers, the operating system is the absolute authority on
> > security, so we find ourselves in a situation where the kernel needs
> > to try and work around this fact so any solutions will be imperfect at
> > best.
> > 
> > As I've noted before, this is actually a primary objective of enclave
> > authors, since one of the desires for 'Confidential Computing' is to
> > hide things like proprietary algorithms from the platform owners.  I
> > think the driver needs to acknowledge this fact and equip platform
> > owners with the simplest and most effective security solutions that
> > are available.

> Or we need to not merge "technology" that subverts the owner of the
> hardware.  Remember: root kit authors are inventive buggers.

That will be an interesting philosophical argument for Linux moving
forward.  I've often stated that there is going to be a natural
political tension between the objectives of open-source and advances
in platform security.  By definition, advancing the latter will result
in technology that contrains what can be done with a platform.

It may have made more sense for the SGX driver to be mainline when the
technology was going to be ubiquitous.  Given the decision by Intel to
monetize the platform, by limiting its implementation to high end
server platforms, the case could be made that it is a driver best
supported by the distributions or cloud providers.

I'm neither for or against inclusion, I'm simply advocating that we
make informed decisions on the driver implementation that benefits the
user community.  FWIW, based on knowledge that has come from building
application stacks on top of the technology for a half decade now.

Have a good weekend.

Dr. Greg

As always,
Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"Atilla The Hun's Maxim: If you're going to rape, pillage and burn, be sure
 to do things in that order."
                                -- P.J. Plauger
                                   Programming On Purpose


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-07 15:09       ` Dr. Greg
@ 2020-11-07 19:16         ` Dave Hansen
  2020-11-12 20:58           ` Dr. Greg
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2020-11-07 19:16 UTC (permalink / raw)
  To: Dr. Greg
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, haitao.huang,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On 11/7/20 7:09 AM, Dr. Greg wrote:
> In all of these discussions there hasn't been a refutation of my point
> that the only reason this hook is needed is to stop the potential for
> anonymous code execution on SGX2 capable hardware.  So we will assume,
> that while unspoken, this is the rationale for the hook.

Unspoken?  See from the cover letter:

> 3. Enclave page permissions are dynamic (just like normal permissions) and
>    can be adjusted at runtime with mprotect().

I explicitly chose not to name the instructions, nor the exact version
of the SGX ISA that introduces them.  They're supported in the series,
and that's all that matters.

If you want to advocate for something different to be done, patches are
accepted.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 21/24] x86/sgx: Add a page reclaimer
  2020-11-04 14:54 ` [PATCH v40 21/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
@ 2020-11-08  3:56   ` Hillf Danton
  2020-11-09 19:59     ` Jarkko Sakkinen
  0 siblings, 1 reply; 33+ messages in thread
From: Hillf Danton @ 2020-11-08  3:56 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, linux-mm, Jethro Beekman,
	Jordan Hand, Nathaniel McCallum, Chunyang Hui, Seth Moore,
	Sean Christopherson, akpm, andriy.shevchenko, yaozhangx,
	mikko.ylinen

On Wed,  4 Nov 2020 16:54:27 Jarkko Sakkinen wrote:
[...]
> +/**
> + * sgx_alloc_epc_page() - Allocate an EPC page
> + * @owner:	the owner of the EPC page
> + * @reclaim:	reclaim pages if necessary
> + *
> + * Iterate through EPC sections and borrow a free EPC page to the caller. When a
> + * page is no longer needed it must be released with sgx_free_epc_page(). If
> + * @reclaim is set to true, directly reclaim pages when we are out of pages. No
> + * mm's can be locked when @reclaim is set to true.
> + *
> + * Finally, wake up ksgxswapd when the number of pages goes below the watermark
> + * before returning back to the caller.
> + *
> + * Return:
> + *   an EPC page,
> + *   -errno on error
> + */
> +struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> +{
> +	struct sgx_epc_page *entry;

Nit: s/entry/epc_page/
> +
> +	for ( ; ; ) {
> +		entry = __sgx_alloc_epc_page();
> +		if (!IS_ERR(entry)) {
> +			entry->owner = owner;
> +			break;
> +		}
> +
> +		if (list_empty(&sgx_active_page_list))
> +			return ERR_PTR(-ENOMEM);
> +
> +		if (!reclaim) {
> +			entry = ERR_PTR(-EBUSY);
> +			break;
> +		}
> +
> +		if (signal_pending(current)) {
> +			entry = ERR_PTR(-ERESTARTSYS);
> +			break;
> +		}
> +
> +		sgx_reclaim_pages();

This is the direct reclaim mode with ksgxswapd that works in
the background ignored in the entire for loop. But we can go
with it in parallel, see below, if it tries as hard as it can
to maitain the watermark in which allocators may have no
interest.

> +		schedule();

To cut allocator's latency use cond_resched();

> +	}
> +
> +	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> +		wake_up(&ksgxswapd_waitq);

Nit: s/ksgxswapd/sgxd/ as it seems to have nothing to do with swap,
given sgx itself is clear and good enough.
> +
> +	return entry;
> +}

struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
{
	struct sgx_epc_page *epc_page;

	for (;;) {
		epc_page = __sgx_alloc_epc_page();

		if (!IS_ERR(epc_page)) {
			epc_page->owner = owner;
			return epc_page;
		}

		if (signal_pending(current))
			return ERR_PTR(-ERESTARTSYS);

		if (list_empty(&sgx_active_page_list) || !reclaim)
			return ERR_PTR(-ENOMEM);

		wake_up(&ksgxswapd_waitq);
		cond_resched();
	}
	return ERR_PTR(-ENOMEM);
}


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 21/24] x86/sgx: Add a page reclaimer
  2020-11-08  3:56   ` Hillf Danton
@ 2020-11-09 19:59     ` Jarkko Sakkinen
  0 siblings, 0 replies; 33+ messages in thread
From: Jarkko Sakkinen @ 2020-11-09 19:59 UTC (permalink / raw)
  To: Hillf Danton
  Cc: x86, linux-sgx, linux-kernel, linux-mm, Jethro Beekman,
	Jordan Hand, Nathaniel McCallum, Chunyang Hui, Seth Moore,
	Sean Christopherson, akpm, andriy.shevchenko, yaozhangx,
	mikko.ylinen

On Sun, Nov 08, 2020 at 11:56:30AM +0800, Hillf Danton wrote:
> On Wed,  4 Nov 2020 16:54:27 Jarkko Sakkinen wrote:
> [...]
> > +/**
> > + * sgx_alloc_epc_page() - Allocate an EPC page
> > + * @owner:	the owner of the EPC page
> > + * @reclaim:	reclaim pages if necessary
> > + *
> > + * Iterate through EPC sections and borrow a free EPC page to the caller. When a
> > + * page is no longer needed it must be released with sgx_free_epc_page(). If
> > + * @reclaim is set to true, directly reclaim pages when we are out of pages. No
> > + * mm's can be locked when @reclaim is set to true.
> > + *
> > + * Finally, wake up ksgxswapd when the number of pages goes below the watermark
> > + * before returning back to the caller.
> > + *
> > + * Return:
> > + *   an EPC page,
> > + *   -errno on error
> > + */
> > +struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> > +{
> > +	struct sgx_epc_page *entry;
> 
> Nit: s/entry/epc_page/
> > +
> > +	for ( ; ; ) {
> > +		entry = __sgx_alloc_epc_page();
> > +		if (!IS_ERR(entry)) {
> > +			entry->owner = owner;
> > +			break;
> > +		}
> > +
> > +		if (list_empty(&sgx_active_page_list))
> > +			return ERR_PTR(-ENOMEM);
> > +
> > +		if (!reclaim) {
> > +			entry = ERR_PTR(-EBUSY);
> > +			break;
> > +		}
> > +
> > +		if (signal_pending(current)) {
> > +			entry = ERR_PTR(-ERESTARTSYS);
> > +			break;
> > +		}
> > +
> > +		sgx_reclaim_pages();
> i
> This is the direct reclaim mode with ksgxswapd that works in
> the background ignored in the entire for loop. But we can go
> with it in parallel, see below, if it tries as hard as it can
> to maitain the watermark in which allocators may have no
> interest.

I think this policy should be left at is and once the code in mainline
further refined. Consider it as a baseline/initial version for
reclaiming code.

> > +		schedule();
> 
> To cut allocator's latency use cond_resched();

Thanks, I'll change this.

> > +	}
> > +
> > +	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> > +		wake_up(&ksgxswapd_waitq);
> 
> Nit: s/ksgxswapd/sgxd/ as it seems to have nothing to do with swap,
> given sgx itself is clear and good enough.

Yeah, it also handling kexec() situation, i.e. has multitude of
functions.

> > +
> > +	return entry;
> > +}
> 
> struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> {
> 	struct sgx_epc_page *epc_page;
> 
> 	for (;;) {
> 		epc_page = __sgx_alloc_epc_page();
> 
> 		if (!IS_ERR(epc_page)) {
> 			epc_page->owner = owner;
> 			return epc_page;
> 		}
> 
> 		if (signal_pending(current))
> 			return ERR_PTR(-ERESTARTSYS);
> 
> 		if (list_empty(&sgx_active_page_list) || !reclaim)
> 			return ERR_PTR(-ENOMEM);
> 
> 		wake_up(&ksgxswapd_waitq);
> 		cond_resched();
> 	}
> 	return ERR_PTR(-ENOMEM);
> }

/Jarkko


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-07 19:16         ` Dave Hansen
@ 2020-11-12 20:58           ` Dr. Greg
  2020-11-12 21:31             ` Dave Hansen
  0 siblings, 1 reply; 33+ messages in thread
From: Dr. Greg @ 2020-11-12 20:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, haitao.huang,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Sat, Nov 07, 2020 at 11:16:25AM -0800, Dave Hansen wrote:

Good afternoon, I hope the week is going well for everyone.

> On 11/7/20 7:09 AM, Dr. Greg wrote:
> > In all of these discussions there hasn't been a refutation of my point
> > that the only reason this hook is needed is to stop the potential for
> > anonymous code execution on SGX2 capable hardware.  So we will assume,
> > that while unspoken, this is the rationale for the hook.

> Unspoken?  See from the cover letter:

The complexity of the issues involved and the almost complete lack of
understanding of the technology in the Linux user community would
suggest that everyone would benefit from a more detailed explanation
of the issues at hand.

> > 3. Enclave page permissions are dynamic (just like normal permissions) and
> >    can be adjusted at runtime with mprotect().

> I explicitly chose not to name the instructions, nor the exact
> version of the SGX ISA that introduces them.  They're supported in
> the series, and that's all that matters.

When it comes to security issues and risk assessment it always seems
that more information is better, but of course opinions vary, wildly
it would seem in the case of this technology.

I've been told countless times in my career: "What happens if you get
hit by a bus".  I've tried to address those issues by generating
copious amounts of documentation on everything I do.  Having the
relevant issues with respect to the security considerations and
implications of this technology clearly documented would seem to
address the 'hit by a bus' issue for other developers that may need to
look at and understand the code down the road.

> If you want to advocate for something different to be done, patches
> are accepted.

I'm including a patch below that addresses the mprotect vulnerability
as simplistically as I believe it can be addressed and provides some
background information on the issues at hand.  I will let people more
wise then I am decide whether or not the world at large is worse off
for having the information available.

I tested this with our runtime, which is of a significantly different
design then Intel's.  After testing multiple adversarial approaches to
changing page permissions, I'm left struggling to understand what the
page walking code accomplishes, even in the case of mmap.

The ultimate decision with respect to allowed page permissions is
cryptographically encoded in the enclave measurement.  The enclave
won't initialize if changes are made to the desired EPCM permissions.
If an attempt is made to use mmap to alter those permissions at the OS
level they will be inhibited by the EPCM permission verifications.

If one reads the EDMM papers by the Intel engineering team that
designed the technology, they were very concerned about an enclave
having to agree to any virtual memory changes, hence the need for
ENCLU[EACCEPT] and ENCLU[EACCEPTCOPY].  In that respect the behavior
of ENCLU[EMODPE] is somewhat interesting in that it gives untrusted
userspace the ability to thwart the intentions of enclave code.

They may not, however, have had any other choice given that SGX was
designed as a virtual memory play in order to make it an 'easy'
add-on.

Given all of this, it seems to be the case that the only thing needed
to block anonymous code execution is to block mprotect on an
initialized enclave, which the attached patch does.  If and when the
driver supports EDMM there is, I believe, a very open question as to
what type of introspection capability the kernel should have.

More on that in a subsequent post/patch.

Have a good evening.

Dr. Greg

Cut here. -----------------------------------------------------------------
From 68cba86b0cb3c5917e8c42d83edd5220e2890bb1 Mon Sep 17 00:00:00 2001
From: "Dr. Greg" <greg@enjellic.com>
Date: Thu, 12 Nov 2020 04:48:57 -0600
Subject: [PATCH] Unconditionally block mprotect of enclave memory.

In SGX there are two levels of memory protection, the classic
page table mechanism and SGX hardware based page protections
that are codified in the Enclave Page Cache Metadata.  A
successful memory access requires that both mechanisms agree
that the access is permitted.

In the initial implementation of SGX (SGX1), the page permissions
are immutable after the enclave is initialized.  Even if classic
page protections are modified via mprotect, any attempt to access
enclave memory with alternative permissions will be blocked.

One of the architectural changes implemented in the second
generation of SGX (SGX2) is the ability for page access
permissions to be dynamically manipulated after the enclave is
initialized.  This requires coordination between trusted code
running in the enclave and untrusted code using mprotect and
special ring-0 instructions.

One of the security threats associated with SGX2 hardware is that
enclave based code can conspire with its untrusted runtime to make
executable enclave memory writable.  This provides the opportunity for
completely anonymous code execution that the operating system has no
visibility into.

All that is needed to, simply, close this vulnerability is to
eliminate the ability to call mprotect against the virtual memory
range of an enclave after it is initialized.  Any mprotect changes
made prior to initialization that are inconsistent with the
permissions codified in the enclave will cause initialization and/or
access to fail.

Tested-by: Dr. Greg <greg@enjellic.com>
Signed-off-by: Dr. Greg <greg@enjellic.com>
---
 arch/x86/kernel/cpu/sgx/encl.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 139f8c398685..c613482ebb56 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -270,11 +270,10 @@ static int sgx_vma_mprotect(struct vm_area_struct *vma,
 			    struct vm_area_struct **pprev, unsigned long start,
 			    unsigned long end, unsigned long newflags)
 {
-	int ret;
+	struct sgx_encl *encl = vma->vm_private_data;
 
-	ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
-	if (ret)
-		return ret;
+	if ( test_bit(SGX_ENCL_INITIALIZED, &encl->flags) )
+		return -EACCES;
 
 	return mprotect_fixup(vma, pprev, start, end, newflags);
 }
-- 
2.19.2


And here (demonstrating my age). ------------------------------------------

As always,
Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"Umm.. the developers behind Flame were able to hijack Windows update,
 gain access to a Microsoft code signing and website signing key while
 staying undetected in the wild for at least 2+ years.

 But System Restore 2.0 is going to stop them?  Your average piece of
 malware can survive a system restore..."
                                -- Slashdot


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-12 20:58           ` Dr. Greg
@ 2020-11-12 21:31             ` Dave Hansen
  2020-11-12 22:41               ` Andy Lutomirski
  2020-11-15 18:59               ` Dr. Greg
  0 siblings, 2 replies; 33+ messages in thread
From: Dave Hansen @ 2020-11-12 21:31 UTC (permalink / raw)
  To: Dr. Greg
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, haitao.huang,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On 11/12/20 12:58 PM, Dr. Greg wrote:
> @@ -270,11 +270,10 @@ static int sgx_vma_mprotect(struct vm_area_struct *vma,
>  			    struct vm_area_struct **pprev, unsigned long start,
>  			    unsigned long end, unsigned long newflags)
>  {
> -	int ret;
> +	struct sgx_encl *encl = vma->vm_private_data;
>  
> -	ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
> -	if (ret)
> -		return ret;
> +	if ( test_bit(SGX_ENCL_INITIALIZED, &encl->flags) )
> +		return -EACCES;
>  
>  	return mprotect_fixup(vma, pprev, start, end, newflags);
>  }

This rules out mprotect() on running enclaves.  Does that break any
expectations from enclave authors, or take away capabilities that folks
need?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-12 21:31             ` Dave Hansen
@ 2020-11-12 22:41               ` Andy Lutomirski
  2020-11-16 18:00                 ` Dr. Greg
  2020-11-15 18:59               ` Dr. Greg
  1 sibling, 1 reply; 33+ messages in thread
From: Andy Lutomirski @ 2020-11-12 22:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dr. Greg, Jarkko Sakkinen, X86 ML, linux-sgx, LKML,
	Sean Christopherson, Linux-MM, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, Andy Shevchenko, asapek,
	Borislav Petkov, Xing, Cedric, chenalexchen, Conrad Parker,
	cyhanish, Huang, Haitao, Huang, Kai, Svahn, Kai, Keith Moyer,
	Christian Ludloff, Andrew Lutomirski, Neil Horman,
	Nathaniel McCallum, Patrick Uiterwijk, David Rientjes,
	Thomas Gleixner, yaozhangx, Mikko Ylinen

On Thu, Nov 12, 2020 at 1:31 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/12/20 12:58 PM, Dr. Greg wrote:
> > @@ -270,11 +270,10 @@ static int sgx_vma_mprotect(struct vm_area_struct *vma,
> >                           struct vm_area_struct **pprev, unsigned long start,
> >                           unsigned long end, unsigned long newflags)
> >  {
> > -     int ret;
> > +     struct sgx_encl *encl = vma->vm_private_data;
> >
> > -     ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
> > -     if (ret)
> > -             return ret;
> > +     if ( test_bit(SGX_ENCL_INITIALIZED, &encl->flags) )
> > +             return -EACCES;
> >
> >       return mprotect_fixup(vma, pprev, start, end, newflags);
> >  }
>
> This rules out mprotect() on running enclaves.  Does that break any
> expectations from enclave authors, or take away capabilities that folks
> need?

It certainly prevents any scheme in which an enclave coordinates with
the outside world to do W-and-then-X JIT inside.  I'm also not
convinced it has any real effect at all unless there's some magic I
missed to prevent someone from using mmap(2) to effectively change
permissions.

Everyone, IMO this SGX1 - vs - SGX2 - vs - EDMM discussion is entirely
missing the point and is a waste of everyone's time.  Let's pretend
we're building a system that has nothing to do with SGX and requires
no special hardware support at all.  It works like this:

A user program opens /dev/xyz and gets back an fd that represents 16
MB of memory.  The user program copies some data from disk (or network
or whatever) into fd (using write(2) or ioctl(2) or mmap(2) and
memcpy) and then mmaps some of the fd as R and some as RW and some as
RX, and then the user program jumps into the RX mapping.

If we replace /dev/xyz with /dev/zero, then this simply does not work
under a reasonably strict W^X policy -- a lot of people think it's
quite reasonable for an OS to prevent a user program from obtaining an
X mapping containing anything other than a mapping from a file on
disk.  To solve this, we can do one of at least three things:

a) You can't use /dev/xyz unless you have permission to create WX
memory or to at least create W memory and then change it to X.

b) You can do whatever you want with /dev/xyz, and LSM policies are
blatantly violated as a result.

c) The /dev/xyz API is clever and tracks, page-by-page, whether the
user intends to ever write and/or execute that page, and behaves
accordingly.

This patchset takes the approach (c).  The actual clever policy isn't
here yet, and we don't really know whether it will ever appear, but
the API is set up to enable such a policy to be written.  This appears
to be a win for everyone, since the code is pretty clean and the API
is straightforward.


Now, back to SGX.  There are only two things that are remotely
SGX-specific here.  First, SGX requires this unusual memory model in
which there is an executable mapping of (part of) a device node. [0]
Second, early SGX hardware had this oddity that the kernel could set a
per-backing-page (as opposed to per-PTE) bit to permanently disable X
on a given /dev/sgx page.  Building a security model around that would
have been a hack, and it DOES NOT WORK on new hardware.  So can we
please stop discussing it?  None of the actual interesting parts of
this have much to do with SGX per se and have nothing whatsoever to do
with EDMM or any other Intel buzzword.

Heck, if anyone actually cared to do so, something with essentially
the same semantics could probably be built using SEV hardware instead
of SGX, and it would have exactly the same issue if we wanted it to
work for tasks that didn't have access to /dev/kvm.


[0] SGX doesn't *really* require this.  We could set things up so that
you do mmap(..., MAP_ANONYMOUS, fd, ...) and then somehow introduce
that mapping to SGX.  I think the result would be too disgusting to
seriously consider.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-12 21:31             ` Dave Hansen
  2020-11-12 22:41               ` Andy Lutomirski
@ 2020-11-15 18:59               ` Dr. Greg
  1 sibling, 0 replies; 33+ messages in thread
From: Dr. Greg @ 2020-11-15 18:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	Sean Christopherson, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, haitao.huang,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Thu, Nov 12, 2020 at 01:31:19PM -0800, Dave Hansen wrote:

Good afternoon to everyone.

> On 11/12/20 12:58 PM, Dr. Greg wrote:
> > @@ -270,11 +270,10 @@ static int sgx_vma_mprotect(struct vm_area_struct *vma,
> >  			    struct vm_area_struct **pprev, unsigned long start,
> >  			    unsigned long end, unsigned long newflags)
> >  {
> > -	int ret;
> > +	struct sgx_encl *encl = vma->vm_private_data;
> >  
> > -	ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
> > -	if (ret)
> > -		return ret;
> > +	if ( test_bit(SGX_ENCL_INITIALIZED, &encl->flags) )
> > +		return -EACCES;
> >  
> >  	return mprotect_fixup(vma, pprev, start, end, newflags);
> >  }

> This rules out mprotect() on running enclaves.  Does that break any
> expectations from enclave authors, or take away capabilities that
> folks need?

As I mentioned an hour or so ago when I posted our updated patch, Sean
and Jarkko have specifically indicated that there is no intention to
support Enclave Dynamic Memory Management (EDMM) at this stage of the
driver.  I believe the intent is to open that can of worms after the
driver is mainlined.

Since the stated intent of the driver is to only implement SGX1
semantics there is no need to allow page permission changes of any
type after the enclave is initialized.  If mmap/mprotect are taken off
the table for an initialized enclave, there is no need to walk the
enclave page permissions since the hardware itself will enforce those
intents.

Runtime support is critical to implementing EDMM.  It seems premature
to place code into the kernel until there is agreement from the
runtime developers as to how page permission intent should be
communicated into the kernel.  Current EDMM implementations simply
allocate a sparse aperture which can be further extended, for example,
to increase heap space or the number of Task Control Structures.

As I've stated previously, there is an open question at this point as
to how useful a mainline driver will be without EDMM support, unless
the distributions or cloud providers are going to patch it in on top
of the mainline driver.  Those players have been copied on all of
these e-mails so I would assume they could/would pipe up with comments
on what type of security architecture should be implemented.

As I've stated before, I believe in the final analysis that the only
relevant question is yes or no with respect to dynamic enclaves.

Have a good remainder of the weekend.

Dr. Greg

As always,
Greg Wettstein, Ph.D, Worker          Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"If you think nobody cares if you're alive, try missing a couple of car
 payments."
                                -- Earl Wilson


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-12 22:41               ` Andy Lutomirski
@ 2020-11-16 18:00                 ` Dr. Greg
  2020-11-19  1:39                   ` Haitao Huang
  0 siblings, 1 reply; 33+ messages in thread
From: Dr. Greg @ 2020-11-16 18:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Jarkko Sakkinen, X86 ML, linux-sgx, LKML,
	Sean Christopherson, Linux-MM, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, Andy Shevchenko, asapek,
	Borislav Petkov, Xing, Cedric, chenalexchen, Conrad Parker,
	cyhanish, Huang, Haitao, Huang, Kai, Svahn, Kai, Keith Moyer,
	Christian Ludloff, Neil Horman, Nathaniel McCallum,
	Patrick Uiterwijk, David Rientjes, Thomas Gleixner, yaozhangx,
	Mikko Ylinen

On Thu, Nov 12, 2020 at 02:41:00PM -0800, Andy Lutomirski wrote:

Good morning, I hope the week is starting well for everyone.

> On Thu, Nov 12, 2020 at 1:31 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 11/12/20 12:58 PM, Dr. Greg wrote:
> > > @@ -270,11 +270,10 @@ static int sgx_vma_mprotect(struct vm_area_struct *vma,
> > >                           struct vm_area_struct **pprev, unsigned long start,
> > >                           unsigned long end, unsigned long newflags)
> > >  {
> > > -     int ret;
> > > +     struct sgx_encl *encl = vma->vm_private_data;
> > >
> > > -     ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
> > > -     if (ret)
> > > -             return ret;
> > > +     if ( test_bit(SGX_ENCL_INITIALIZED, &encl->flags) )
> > > +             return -EACCES;
> > >
> > >       return mprotect_fixup(vma, pprev, start, end, newflags);
> > >  }
> >
> > This rules out mprotect() on running enclaves.  Does that break any
> > expectations from enclave authors, or take away capabilities that folks
> > need?

> It certainly prevents any scheme in which an enclave coordinates
> with the outside world to do W-and-then-X JIT inside.  I'm also not
> convinced it has any real effect at all unless there's some magic I
> missed to prevent someone from using mmap(2) to effectively change
> permissions.

The patch that I posted yesterday addresses the security issue for
both mmap and mprotect by trapping the permission change request at
the level of the sgx_encl_may_map() function.

With respect to the W-and-then-X JIT issue, the stated purpose of the
driver is to implement basic SGX functionality, which is SGX1
semantics, it has been stated formally for a year by the developers
themselves that they are not entertaining a driver that addresses any
of the issues associated with non-static memory permissions.

As I've noted previously, the hardware itself is capable of enforcing
that after initialization, if mmap/mprotect is blocked on hardware
that supports SGX2 instructions.

> Everyone, IMO this SGX1 - vs - SGX2 - vs - EDMM discussion is
> entirely missing the point and is a waste of everyone's time.  Let's
> pretend we're building a system that has nothing to do with SGX and
> requires no special hardware support at all.  It works like this:

I don't doubt there is a potential bigger vision here, quite frankly
it is probably an open question whether or not SGX is going to be a
part of this future, for a variety of reasons.  I also do not doubt
that you have the skills to define that vision.

Right now, however, the issue is not about pretending but rather one
of getting a driver into the kernel that provides a framework for
building whatever future SGX may have.  Given GKH's comments on LWN
last week in response to the RAPL vulnerability, I'm not sure if it is
a politically done deal that the driver will go in.

SGX has specific hardware characteristics that impact the driver, I
don't see how fitting it into a generic trusted execution model
advances the agenda immediately at hand.  Particularly given the fact
that I'm not even sure people understand the questions that need to be
answered about such a generic model.

> A user program opens /dev/xyz and gets back an fd that represents 16
> MB of memory.  The user program copies some data from disk (or
> network or whatever) into fd (using write(2) or ioctl(2) or mmap(2)
> and memcpy) and then mmaps some of the fd as R and some as RW and
> some as RX, and then the user program jumps into the RX mapping.

This is basically the SGX model in the new driver.  The important
defining characteristic of the driver, that we can't wave away, is
that the hardware requires a specific set of initial page permissions
to be implemented in order for initialization of the memory range
(enclave) to succeed.

This is inherent to the way SGX hardware was designed to work.  The
only difference between SGX1 and SGX2 is that the latter offers a
small number of additional instructions that allow the page
permissions to be dynamically manipulated after initialization is
complete.

From a security perspective, the issue at hand is that the executable
material is not going to come in through the fd, it is going to be
loaded by the enclave over the network.  This isn't fear mongering, it
is the stated intent of what people want to do with the technology as
a integral part of confidential computing.

I've had the opportunity to brief DOD and other entities concerned
with intelligence issues, about these type of potential capabilities.
It isn't hard to envision scenarios of where having potentially
sensitive code and data only ever handled and executed by a trusted
entity, in an environment that is inherently ephemeral with respect to
its persistence, is an important design characteristic.  Thermite has
also been known to play a role in some of these designs prior to the
greater elegance of trusted execution environments.

Ultimately, if you believe the Confidential Computing Consortium, it
is also what people want for their sensitive cloud workloads.  Absent
the thermite of course.

> If we replace /dev/xyz with /dev/zero, then this simply does not work
> under a reasonably strict W^X policy -- a lot of people think it's
> quite reasonable for an OS to prevent a user program from obtaining an
> X mapping containing anything other than a mapping from a file on
> disk.  To solve this, we can do one of at least three things:
> 
> a) You can't use /dev/xyz unless you have permission to create WX
> memory or to at least create W memory and then change it to X.
> 
> b) You can do whatever you want with /dev/xyz, and LSM policies are
> blatantly violated as a result.

I think the important issue at hand is that classic LSM policies
simply are not relevant with respect to how this technology was
designed to operate, and perhaps more importantly, how people want to
use it.

That is why I have consistently stated that I think the only relevant
knob is a binary decision as to whether or not a platform owner wants
to entertain completely anonymous code execution.

> c) The /dev/xyz API is clever and tracks, page-by-page, whether the
> user intends to ever write and/or execute that page, and behaves
> accordingly.
>
> This patchset takes the approach (c).  The actual clever policy
> isn't here yet, and we don't really know whether it will ever
> appear, but the API is set up to enable such a policy to be written.
> This appears to be a win for everyone, since the code is pretty
> clean and the API is straightforward.

I believe I have been clear in stating that I have never doubted the
cleverness of the approach or its potential utility for the future.

The issue at hand is that it simply isn't relevant at this stage of
the driver.  Getting this new vision right is something that would
benefit from a lot of conversations between runtime and kernel
developers.  Arguably, the case can be made that it should have a
second type of implementation to ensure that the approach is generic,
extensible and most importantly secure.

The 'cleverness' of the policy needs to be evaluated in the context of
what does it mean with respect to the risk arbitration decisions that
we are trying to support.  The open question comes down to, in
essence, asking ourselves whether or not we believe that it makes
sense to say that 15 pages of RWX memory is a security threat but 5
are not.

> Now, back to SGX.  There are only two things that are remotely
> SGX-specific here.  First, SGX requires this unusual memory model in
> which there is an executable mapping of (part of) a device node. [0]
> Second, early SGX hardware had this oddity that the kernel could set
> a per-backing-page (as opposed to per-PTE) bit to permanently
> disable X on a given /dev/sgx page.  Building a security model
> around that would have been a hack, and it DOES NOT WORK on new
> hardware.  So can we please stop discussing it?  None of the actual
> interesting parts of this have much to do with SGX per se and have
> nothing whatsoever to do with EDMM or any other Intel buzzword.

Just a clarification for everyone sitting in their recliners eating
popcorn and following along at home.

As I've stated before, I don't argue the potential utility of some new
model, SGX however, has hardware characteristics that cannot be waved
away in this discussion.  The technology was designed to have a
cryptographic measurement that controls whether or not the memory
image is suitable for execution.  The description of that image is
defined by the enclave author when the enclave is signed.

This is why the current EDMM implementation requires that a maximum
aperture range be defined for dynamic memory regions.  Since the
linear address of each page address in the enclave is encoded into the
measurement, enclave initialization will fail unless the loaded memory
image is consistent with the wishes of the enclave signer.  Having 40
pages of potential heap memory when the author called for 39 would be
considered an initialization defect that would be enforced by the
hardware.

The desired page permissions are also encoded into the enclave
measurement.  Since the current implementation takes the maximum
scoped permissions from the security information encoded in the
enclave, it would require that the enclave encode for RWX permissions
if the intent was to dynmically load executable or JIT code after the
enclave was initialized.  If I understand your above analysis
correctly, this would be problematic for current security
models/practices.

Obviously an API could be proposed that allowed this permissable
memory map to be defined independently from the enclave.  This notion,
based on my read of the security risk considerations that went into
the design of SGX, would be problematic, since it would allow an
untrusted party to modify the characteristics that were desired by the
enclave author for the executable image.

> Heck, if anyone actually cared to do so, something with essentially
> the same semantics could probably be built using SEV hardware
> instead of SGX, and it would have exactly the same issue if we
> wanted it to work for tasks that didn't have access to /dev/kvm.

As I noted above, perhaps whatever this driver may become in the
future would benefit from the community creating something like this
as a second reference to build an API on top of.

> [0] SGX doesn't *really* require this.  We could set things up so that
> you do mmap(..., MAP_ANONYMOUS, fd, ...) and then somehow introduce
> that mapping to SGX.  I think the result would be too disgusting to
> seriously consider.

Let me be clear, I certainly would not advocate doing anything too
disgusting to consider.

Hopefully our proposal for simplifying the security model for the
driver, while still allowing the framework for a still unspecified
future pathway, doesn't fit this description.

Best wishes for a productive week to everyone.

Dr. Greg

As always,
Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"Boy, it must not take much to make a phone work.  Looking at
 everthing else here it must be the same way with the INTERNET."
                                -- Francis 'Fritz' Wettstein


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-16 18:00                 ` Dr. Greg
@ 2020-11-19  1:39                   ` Haitao Huang
  2020-11-20 17:31                     ` Dr. Greg
  0 siblings, 1 reply; 33+ messages in thread
From: Haitao Huang @ 2020-11-19  1:39 UTC (permalink / raw)
  To: Andy Lutomirski, Dr. Greg
  Cc: Dave Hansen, Jarkko Sakkinen, X86 ML, linux-sgx, LKML,
	Sean Christopherson, Linux-MM, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Darren Kenny, Andy Shevchenko, asapek,
	Borislav Petkov, Xing, Cedric, chenalexchen, Conrad Parker,
	cyhanish, Huang, Haitao, Huang, Kai, Svahn, Kai, Keith Moyer,
	Christian Ludloff, Neil Horman, Nathaniel McCallum,
	Patrick Uiterwijk, David Rientjes, Thomas Gleixner, yaozhangx,
	Mikko Ylinen

On Mon, 16 Nov 2020 12:00:23 -0600, Dr. Greg <greg@enjellic.com> wrote:

> On Thu, Nov 12, 2020 at 02:41:00PM -0800, Andy Lutomirski wrote:
>
> Good morning, I hope the week is starting well for everyone.
>
>> On Thu, Nov 12, 2020 at 1:31 PM Dave Hansen <dave.hansen@intel.com>  
>> wrote:
>> >
>> > On 11/12/20 12:58 PM, Dr. Greg wrote:
>> > > @@ -270,11 +270,10 @@ static int sgx_vma_mprotect(struct  
>> vm_area_struct *vma,
>> > >                           struct vm_area_struct **pprev, unsigned  
>> long start,
>> > >                           unsigned long end, unsigned long newflags)
>> > >  {
>> > > -     int ret;
>> > > +     struct sgx_encl *encl = vma->vm_private_data;
>> > >
>> > > -     ret = sgx_encl_may_map(vma->vm_private_data, start, end,  
>> newflags);
>> > > -     if (ret)
>> > > -             return ret;
>> > > +     if ( test_bit(SGX_ENCL_INITIALIZED, &encl->flags) )
>> > > +             return -EACCES;
>> > >
>> > >       return mprotect_fixup(vma, pprev, start, end, newflags);
>> > >  }
>> >
>> > This rules out mprotect() on running enclaves.  Does that break any
>> > expectations from enclave authors, or take away capabilities that  
>> folks
>> > need?
>
>> It certainly prevents any scheme in which an enclave coordinates
>> with the outside world to do W-and-then-X JIT inside.  I'm also not
>> convinced it has any real effect at all unless there's some magic I
>> missed to prevent someone from using mmap(2) to effectively change
>> permissions.
>
> The patch that I posted yesterday addresses the security issue for
> both mmap and mprotect by trapping the permission change request at
> the level of the sgx_encl_may_map() function.
>
> With respect to the W-and-then-X JIT issue, the stated purpose of the
> driver is to implement basic SGX functionality, which is SGX1
> semantics, it has been stated formally for a year by the developers
> themselves that they are not entertaining a driver that addresses any
> of the issues associated with non-static memory permissions.
>

The JIT issue is applicable even to SGX1 platforms. We can do EADD with  
EPCM.RWX in sec_info and with PTE.RW, EINIT, then mprotect to set PTE.RX  
when JIT is done.

Haitao


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
  2020-11-19  1:39                   ` Haitao Huang
@ 2020-11-20 17:31                     ` Dr. Greg
  0 siblings, 0 replies; 33+ messages in thread
From: Dr. Greg @ 2020-11-20 17:31 UTC (permalink / raw)
  To: Haitao Huang
  Cc: Andy Lutomirski, Dave Hansen, Jarkko Sakkinen, X86 ML, linux-sgx,
	LKML, Sean Christopherson, Linux-MM, Andrew Morton,
	Matthew Wilcox, Jethro Beekman, Darren Kenny, Andy Shevchenko,
	asapek, Borislav Petkov, Xing, Cedric, chenalexchen,
	Conrad Parker, cyhanish, Huang, Haitao, Huang, Kai, Svahn, Kai,
	Keith Moyer, Christian Ludloff, Neil Horman, Nathaniel McCallum,
	Patrick Uiterwijk, David Rientjes, Thomas Gleixner, yaozhangx,
	Mikko Ylinen

On Wed, Nov 18, 2020 at 07:39:50PM -0600, Haitao Huang wrote:

Good morning, I hope the week is ending well for everyone.

> On Mon, 16 Nov 2020 12:00:23 -0600, Dr. Greg <greg@enjellic.com> wrote:
> 
> >On Thu, Nov 12, 2020 at 02:41:00PM -0800, Andy Lutomirski wrote:
> >>It certainly prevents any scheme in which an enclave coordinates
> >>with the outside world to do W-and-then-X JIT inside.  I'm also not
> >>convinced it has any real effect at all unless there's some magic I
> >>missed to prevent someone from using mmap(2) to effectively change
> >>permissions.
> >
> >The patch that I posted yesterday addresses the security issue for
> >both mmap and mprotect by trapping the permission change request at
> >the level of the sgx_encl_may_map() function.
> >
> >With respect to the W-and-then-X JIT issue, the stated purpose of the
> >driver is to implement basic SGX functionality, which is SGX1
> >semantics, it has been stated formally for a year by the developers
> >themselves that they are not entertaining a driver that addresses any
> >of the issues associated with non-static memory permissions.

> The JIT issue is applicable even to SGX1 platforms. We can do EADD
> with EPCM.RWX in sec_info and with PTE.RW, EINIT, then mprotect to
> set PTE.RX when JIT is done.

Correct, which further underscores the point that I am trying make, it
is unclear what the current mmap/mprotect protection architecture is
accomplishing, since it only enforces EPCM permissions.  The hardware
is perfectly capable of doing so without assistance from the operating
system, in fact, the very reason it was designed was to provide
protections in the face of an adversarial operating system.

For precisely one year, as of next week, we have engaged in a
re-design of this driver, driven by the concern that the previous
driver allowed execution of code that bypassed the LSM.  So I'm
assuming the community has concerns about the possibility of anonymous
code execution.  If this is the case, the mmap/mprotect protection
code in the driver should be implementing some type of control over
the execution of anonymous memory, which our patch implements, very
simply and very understandably.

The other straight forward alternative is a knob that tells the driver
to disallow the addition of any page that attempts to set EPCM.XW
permissions.

As always, corrections gladly accepted if our analysis of the driver
or how the hardware itself works is incorrect.

> Haitao

Have a good weekend.

Dr. Greg

As always,
Dr. Greg Wettstein, Ph.D, Worker      Autonomously self-defensive
Enjellic Systems Development, LLC     IOT platforms and edge devices.
4206 N. 19th Ave.
Fargo, ND  58102
PH: 701-281-1686                      EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"My second remark is that our intellectual powers are rather geared to
 master static relations and that our powers to visualize processes
 evolving in time are relatively poorly developed."
                                -- Edsger W. Dijkstra


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2020-11-20 17:31 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201104145430.300542-1-jarkko.sakkinen@linux.intel.com>
2020-11-04 14:54 ` [PATCH v40 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
2020-11-05 16:04   ` Borislav Petkov
2020-11-05 17:33     ` Dave Hansen
2020-11-06 10:04   ` Mel Gorman
2020-11-06 16:51     ` Jarkko Sakkinen
2020-11-06 20:37       ` Borislav Petkov
2020-11-06 22:04         ` Jarkko Sakkinen
2020-11-06 22:31           ` Borislav Petkov
2020-11-06 17:43   ` Dr. Greg
2020-11-06 17:54     ` Dave Hansen
2020-11-07 15:09       ` Dr. Greg
2020-11-07 19:16         ` Dave Hansen
2020-11-12 20:58           ` Dr. Greg
2020-11-12 21:31             ` Dave Hansen
2020-11-12 22:41               ` Andy Lutomirski
2020-11-16 18:00                 ` Dr. Greg
2020-11-19  1:39                   ` Haitao Huang
2020-11-20 17:31                     ` Dr. Greg
2020-11-15 18:59               ` Dr. Greg
2020-11-06 21:13     ` Matthew Wilcox
2020-11-06 21:23       ` Dave Hansen
2020-11-07 15:27       ` Dr. Greg
2020-11-04 14:54 ` [PATCH v40 11/24] x86/sgx: Add SGX misc driver interface Jarkko Sakkinen
     [not found]   ` <20201105011043.GA700495@kernel.org>
     [not found]     ` <20201105011615.GA701257@kernel.org>
2020-11-05 16:05       ` Borislav Petkov
2020-11-05 17:57         ` Jarkko Sakkinen
2020-11-05 18:10           ` Borislav Petkov
2020-11-06 16:07             ` Jarkko Sakkinen
2020-11-06 17:09               ` Borislav Petkov
2020-11-06 22:01                 ` Jarkko Sakkinen
2020-11-04 14:54 ` [PATCH v40 21/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
2020-11-08  3:56   ` Hillf Danton
2020-11-09 19:59     ` Jarkko Sakkinen
2020-11-04 14:54 ` [PATCH v40 22/24] x86/sgx: Add ptrace() support for the SGX driver Jarkko Sakkinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).