linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v39 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct
       [not found] <20201003045059.665934-1-jarkko.sakkinen@linux.intel.com>
@ 2020-10-03  4:50 ` Jarkko Sakkinen
  2020-10-03  4:50 ` [PATCH v39 11/24] x86/sgx: Add SGX enclave driver Jarkko Sakkinen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-03  4:50 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Sean Christopherson, linux-mm, Andrew Morton,
	Matthew Wilcox, Jethro Beekman, Darren Kenny, Jarkko Sakkinen,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, npmccallum, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen

From: Sean Christopherson <sean.j.christopherson@intel.com>

Background
==========

1. SGX enclave pages are populated with data by copying data to them from
   normal memory via an ioctl() (SGX_IOC_ENCLAVE_ADD_PAGES).
2. It is desirable to be able to restrict those normal memory data sources.
   For instance, to ensure that the source data is executable before
   copying data to an executable enclave page.
3. Enclave page permissions are dynamic (just like normal permissions) and
   can be adjusted at runtime with mprotect().
4. The original data source may have long since vanished at the time when
   enclave page permissions are established (mmap() or mprotect()).

The solution (elsewhere in this series) is to force enclaves creators to
declare their paging permission *intent* up front to the ioctl().  This
intent can me immediately compared to the source data’s mapping (and
rejected if necessary).

The intent is also stashed off for later comparison with enclave PTEs.
This ensures that any future mmap()/mprotect() operations performed by the
enclave creator or done on behalf of the enclave can be compared with the
earlier declared permissions.

Problem
=======

There is an existing mmap() hook which allows SGX to perform this
permission comparison at mmap() time.  However, there is no corresponding
->mprotect() hook.

Solution
========

Add a vm_ops->mprotect() hook so that mprotect() operations which are
inconsistent with any page's stashed intent can be rejected by the driver.

Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
 include/linux/mm.h | 3 +++
 mm/mprotect.c      | 5 ++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b2f370f0b420..dca57fe80555 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -551,6 +551,9 @@ struct vm_operations_struct {
 	void (*close)(struct vm_area_struct * area);
 	int (*split)(struct vm_area_struct * area, unsigned long addr);
 	int (*mremap)(struct vm_area_struct * area);
+	int (*mprotect)(struct vm_area_struct *vma,
+			struct vm_area_struct **pprev, unsigned long start,
+			unsigned long end, unsigned long newflags);
 	vm_fault_t (*fault)(struct vm_fault *vmf);
 	vm_fault_t (*huge_fault)(struct vm_fault *vmf,
 			enum page_entry_size pe_size);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..f170f3da8a4f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -610,7 +610,10 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		tmp = vma->vm_end;
 		if (tmp > end)
 			tmp = end;
-		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
+		if (vma->vm_ops && vma->vm_ops->mprotect)
+			error = vma->vm_ops->mprotect(vma, &prev, nstart, tmp, newflags);
+		else
+			error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
 		if (error)
 			goto out;
 		nstart = tmp;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
       [not found] <20201003045059.665934-1-jarkko.sakkinen@linux.intel.com>
  2020-10-03  4:50 ` [PATCH v39 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
@ 2020-10-03  4:50 ` Jarkko Sakkinen
  2020-10-03 14:39   ` Greg KH
  2020-10-03 19:54   ` Matthew Wilcox
  2020-10-03  4:50 ` [PATCH v39 16/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
  2020-10-03  4:50 ` [PATCH v39 17/24] x86/sgx: Add ptrace() support for the SGX driver Jarkko Sakkinen
  3 siblings, 2 replies; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-03  4:50 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Jarkko Sakkinen, linux-security-module, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Haitao Huang,
	Chunyang Hui, Jordan Hand, Nathaniel McCallum, Seth Moore,
	Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk, rientjes,
	tglx, yaozhangx, mikko.ylinen

Intel Software Guard eXtensions (SGX) is a set of CPU instructions that can
be used by applications to set aside private regions of code and data. The
code outside the SGX hosted software entity is prevented from accessing the
memory inside the enclave by the CPU. We call these entities enclaves.

Add a driver that provides an ioctl API to construct and run enclaves.
Enclaves are constructed from pages residing in reserved physical memory
areas. The contents of these pages can only be accessed when they are
mapped as part of an enclave, by a hardware thread running inside the
enclave.

The starting state of an enclave consists of a fixed measured set of
pages that are copied to the EPC during the construction process by
using the opcode ENCLS leaf functions and Software Enclave Control
Structure (SECS) that defines the enclave properties.

Enclaves are constructed by using ENCLS leaf functions ECREATE, EADD and
EINIT. ECREATE initializes SECS, EADD copies pages from system memory to
the EPC and EINIT checks a given signed measurement and moves the enclave
into a state ready for execution.

An initialized enclave can only be accessed through special Thread Control
Structure (TCS) pages by using ENCLU (ring-3 only) leaf EENTER.  This leaf
function converts a thread into enclave mode and continues the execution in
the offset defined by the TCS provided to EENTER. An enclave is exited
through syscall, exception, interrupts or by explicitly calling another
ENCLU leaf EEXIT.

The mmap() permissions are capped by the contained enclave page
permissions. The mapped areas must also be populated, i.e. each page
address must contain a page. This logic is implemented in
sgx_encl_may_map().

Cc: linux-security-module@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Haitao Huang <haitao.huang@linux.intel.com>
Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
Tested-by: Seth Moore <sethmo@google.com>
Tested-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
 arch/x86/kernel/cpu/sgx/Makefile |   2 +
 arch/x86/kernel/cpu/sgx/driver.c | 173 ++++++++++++++++
 arch/x86/kernel/cpu/sgx/driver.h |  29 +++
 arch/x86/kernel/cpu/sgx/encl.c   | 331 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/encl.h   |  85 ++++++++
 arch/x86/kernel/cpu/sgx/main.c   |  11 +
 6 files changed, 631 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/sgx/driver.c
 create mode 100644 arch/x86/kernel/cpu/sgx/driver.h
 create mode 100644 arch/x86/kernel/cpu/sgx/encl.c
 create mode 100644 arch/x86/kernel/cpu/sgx/encl.h

diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 79510ce01b3b..3fc451120735 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -1,2 +1,4 @@
 obj-y += \
+	driver.o \
+	encl.o \
 	main.o
diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
new file mode 100644
index 000000000000..f54da5f19c2b
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -0,0 +1,173 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+// Copyright(c) 2016-18 Intel Corporation.
+
+#include <linux/acpi.h>
+#include <linux/miscdevice.h>
+#include <linux/mman.h>
+#include <linux/security.h>
+#include <linux/suspend.h>
+#include <asm/traps.h>
+#include "driver.h"
+#include "encl.h"
+
+u64 sgx_encl_size_max_32;
+u64 sgx_encl_size_max_64;
+u32 sgx_misc_reserved_mask;
+u64 sgx_attributes_reserved_mask;
+u64 sgx_xfrm_reserved_mask = ~0x3;
+u32 sgx_xsave_size_tbl[64];
+
+static int sgx_open(struct inode *inode, struct file *file)
+{
+	struct sgx_encl *encl;
+	int ret;
+
+	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
+	if (!encl)
+		return -ENOMEM;
+
+	atomic_set(&encl->flags, 0);
+	kref_init(&encl->refcount);
+	xa_init(&encl->page_array);
+	mutex_init(&encl->lock);
+	INIT_LIST_HEAD(&encl->mm_list);
+	spin_lock_init(&encl->mm_lock);
+
+	ret = init_srcu_struct(&encl->srcu);
+	if (ret) {
+		kfree(encl);
+		return ret;
+	}
+
+	file->private_data = encl;
+
+	return 0;
+}
+
+static int sgx_release(struct inode *inode, struct file *file)
+{
+	struct sgx_encl *encl = file->private_data;
+	struct sgx_encl_mm *encl_mm;
+
+	for ( ; ; )  {
+		spin_lock(&encl->mm_lock);
+
+		if (list_empty(&encl->mm_list)) {
+			encl_mm = NULL;
+		} else {
+			encl_mm = list_first_entry(&encl->mm_list,
+						   struct sgx_encl_mm, list);
+			list_del_rcu(&encl_mm->list);
+		}
+
+		spin_unlock(&encl->mm_lock);
+
+		/* The list is empty, ready to go. */
+		if (!encl_mm)
+			break;
+
+		synchronize_srcu(&encl->srcu);
+		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+		kfree(encl_mm);
+	}
+
+	mutex_lock(&encl->lock);
+	atomic_or(SGX_ENCL_DEAD, &encl->flags);
+	mutex_unlock(&encl->lock);
+
+	kref_put(&encl->refcount, sgx_encl_release);
+	return 0;
+}
+
+static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct sgx_encl *encl = file->private_data;
+	int ret;
+
+	ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags);
+	if (ret)
+		return ret;
+
+	ret = sgx_encl_mm_add(encl, vma->vm_mm);
+	if (ret)
+		return ret;
+
+	vma->vm_ops = &sgx_vm_ops;
+	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
+	vma->vm_private_data = encl;
+
+	return 0;
+}
+
+static unsigned long sgx_get_unmapped_area(struct file *file,
+					   unsigned long addr,
+					   unsigned long len,
+					   unsigned long pgoff,
+					   unsigned long flags)
+{
+	if ((flags & MAP_TYPE) == MAP_PRIVATE)
+		return -EINVAL;
+
+	if (flags & MAP_FIXED)
+		return addr;
+
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
+static const struct file_operations sgx_encl_fops = {
+	.owner			= THIS_MODULE,
+	.open			= sgx_open,
+	.release		= sgx_release,
+	.mmap			= sgx_mmap,
+	.get_unmapped_area	= sgx_get_unmapped_area,
+};
+
+static struct miscdevice sgx_dev_enclave = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "enclave",
+	.nodename = "sgx/enclave",
+	.fops = &sgx_encl_fops,
+};
+
+int __init sgx_drv_init(void)
+{
+	unsigned int eax, ebx, ecx, edx;
+	u64 attr_mask, xfrm_mask;
+	int ret;
+	int i;
+
+	if (!boot_cpu_has(X86_FEATURE_SGX_LC)) {
+		pr_info("The public key MSRs are not writable.\n");
+		return -ENODEV;
+	}
+
+	cpuid_count(SGX_CPUID, 0, &eax, &ebx, &ecx, &edx);
+	sgx_misc_reserved_mask = ~ebx | SGX_MISC_RESERVED_MASK;
+	sgx_encl_size_max_64 = 1ULL << ((edx >> 8) & 0xFF);
+	sgx_encl_size_max_32 = 1ULL << (edx & 0xFF);
+
+	cpuid_count(SGX_CPUID, 1, &eax, &ebx, &ecx, &edx);
+
+	attr_mask = (((u64)ebx) << 32) + (u64)eax;
+	sgx_attributes_reserved_mask = ~attr_mask | SGX_ATTR_RESERVED_MASK;
+
+	if (boot_cpu_has(X86_FEATURE_OSXSAVE)) {
+		xfrm_mask = (((u64)edx) << 32) + (u64)ecx;
+
+		for (i = 2; i < 64; i++) {
+			cpuid_count(0x0D, i, &eax, &ebx, &ecx, &edx);
+			if ((1UL << i) & xfrm_mask)
+				sgx_xsave_size_tbl[i] = eax + ebx;
+		}
+
+		sgx_xfrm_reserved_mask = ~xfrm_mask;
+	}
+
+	ret = misc_register(&sgx_dev_enclave);
+	if (ret) {
+		pr_err("Creating /dev/sgx/enclave failed with %d.\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
diff --git a/arch/x86/kernel/cpu/sgx/driver.h b/arch/x86/kernel/cpu/sgx/driver.h
new file mode 100644
index 000000000000..f7ce40dedc91
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/driver.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+#ifndef __ARCH_SGX_DRIVER_H__
+#define __ARCH_SGX_DRIVER_H__
+
+#include <crypto/hash.h>
+#include <linux/kref.h>
+#include <linux/mmu_notifier.h>
+#include <linux/radix-tree.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/workqueue.h>
+#include "sgx.h"
+
+#define SGX_EINIT_SPIN_COUNT	20
+#define SGX_EINIT_SLEEP_COUNT	50
+#define SGX_EINIT_SLEEP_TIME	20
+
+extern u64 sgx_encl_size_max_32;
+extern u64 sgx_encl_size_max_64;
+extern u32 sgx_misc_reserved_mask;
+extern u64 sgx_attributes_reserved_mask;
+extern u64 sgx_xfrm_reserved_mask;
+extern u32 sgx_xsave_size_tbl[64];
+
+long sgx_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
+
+int sgx_drv_init(void);
+
+#endif /* __ARCH_X86_SGX_DRIVER_H__ */
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
new file mode 100644
index 000000000000..c2c4a77af36b
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -0,0 +1,331 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+// Copyright(c) 2016-18 Intel Corporation.
+
+#include <linux/lockdep.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/shmem_fs.h>
+#include <linux/suspend.h>
+#include <linux/sched/mm.h>
+#include "arch.h"
+#include "encl.h"
+#include "encls.h"
+#include "sgx.h"
+
+static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
+						unsigned long addr)
+{
+	struct sgx_encl_page *entry;
+	unsigned int flags;
+
+	/* If process was forked, VMA is still there but vm_private_data is set
+	 * to NULL.
+	 */
+	if (!encl)
+		return ERR_PTR(-EFAULT);
+
+	flags = atomic_read(&encl->flags);
+	if ((flags & SGX_ENCL_DEAD) || !(flags & SGX_ENCL_INITIALIZED))
+		return ERR_PTR(-EFAULT);
+
+	entry = xa_load(&encl->page_array, PFN_DOWN(addr));
+	if (!entry)
+		return ERR_PTR(-EFAULT);
+
+	/* Page is already resident in the EPC. */
+	if (entry->epc_page)
+		return entry;
+
+	return ERR_PTR(-EFAULT);
+}
+
+static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
+				     struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm = container_of(mn, struct sgx_encl_mm, mmu_notifier);
+	struct sgx_encl_mm *tmp = NULL;
+
+	/*
+	 * The enclave itself can remove encl_mm.  Note, objects can't be moved
+	 * off an RCU protected list, but deletion is ok.
+	 */
+	spin_lock(&encl_mm->encl->mm_lock);
+	list_for_each_entry(tmp, &encl_mm->encl->mm_list, list) {
+		if (tmp == encl_mm) {
+			list_del_rcu(&encl_mm->list);
+			break;
+		}
+	}
+	spin_unlock(&encl_mm->encl->mm_lock);
+
+	if (tmp == encl_mm) {
+		synchronize_srcu(&encl_mm->encl->srcu);
+		mmu_notifier_put(mn);
+	}
+}
+
+static void sgx_mmu_notifier_free(struct mmu_notifier *mn)
+{
+	struct sgx_encl_mm *encl_mm = container_of(mn, struct sgx_encl_mm, mmu_notifier);
+
+	kfree(encl_mm);
+}
+
+static const struct mmu_notifier_ops sgx_mmu_notifier_ops = {
+	.release		= sgx_mmu_notifier_release,
+	.free_notifier		= sgx_mmu_notifier_free,
+};
+
+static struct sgx_encl_mm *sgx_encl_find_mm(struct sgx_encl *encl,
+					    struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm = NULL;
+	struct sgx_encl_mm *tmp;
+	int idx;
+
+	idx = srcu_read_lock(&encl->srcu);
+
+	list_for_each_entry_rcu(tmp, &encl->mm_list, list) {
+		if (tmp->mm == mm) {
+			encl_mm = tmp;
+			break;
+		}
+	}
+
+	srcu_read_unlock(&encl->srcu, idx);
+
+	return encl_mm;
+}
+
+int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
+{
+	struct sgx_encl_mm *encl_mm;
+	int ret;
+
+	/* mm_list can be accessed only by a single thread at a time. */
+	mmap_assert_write_locked(mm);
+
+	if (atomic_read(&encl->flags) & SGX_ENCL_DEAD)
+		return -EINVAL;
+
+	/*
+	 * mm_structs are kept on mm_list until the mm or the enclave dies,
+	 * i.e. once an mm is off the list, it's gone for good, therefore it's
+	 * impossible to get a false positive on @mm due to a stale mm_list.
+	 */
+	if (sgx_encl_find_mm(encl, mm))
+		return 0;
+
+	encl_mm = kzalloc(sizeof(*encl_mm), GFP_KERNEL);
+	if (!encl_mm)
+		return -ENOMEM;
+
+	encl_mm->encl = encl;
+	encl_mm->mm = mm;
+	encl_mm->mmu_notifier.ops = &sgx_mmu_notifier_ops;
+
+	ret = __mmu_notifier_register(&encl_mm->mmu_notifier, mm);
+	if (ret) {
+		kfree(encl_mm);
+		return ret;
+	}
+
+	spin_lock(&encl->mm_lock);
+	list_add_rcu(&encl_mm->list, &encl->mm_list);
+	spin_unlock(&encl->mm_lock);
+
+	return 0;
+}
+
+static void sgx_vma_open(struct vm_area_struct *vma)
+{
+	struct sgx_encl *encl = vma->vm_private_data;
+
+	if (!encl)
+		return;
+
+	if (sgx_encl_mm_add(encl, vma->vm_mm))
+		vma->vm_private_data = NULL;
+}
+
+static unsigned int sgx_vma_fault(struct vm_fault *vmf)
+{
+	unsigned long addr = (unsigned long)vmf->address;
+	struct vm_area_struct *vma = vmf->vma;
+	struct sgx_encl *encl = vma->vm_private_data;
+	struct sgx_encl_page *entry;
+	int ret = VM_FAULT_NOPAGE;
+	unsigned long pfn;
+
+	if (!encl)
+		return VM_FAULT_SIGBUS;
+
+	mutex_lock(&encl->lock);
+
+	entry = sgx_encl_load_page(encl, addr);
+	if (IS_ERR(entry)) {
+		if (unlikely(PTR_ERR(entry) != -EBUSY))
+			ret = VM_FAULT_SIGBUS;
+
+		goto out;
+	}
+
+	if (!follow_pfn(vma, addr, &pfn))
+		goto out;
+
+	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(entry->epc_page->desc));
+	if (ret != VM_FAULT_NOPAGE) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+out:
+	mutex_unlock(&encl->lock);
+	return ret;
+}
+
+/**
+ * sgx_encl_may_map() - Check if a requested VMA mapping is allowed
+ * @encl:		an enclave pointer
+ * @start:		lower bound of the address range, inclusive
+ * @end:		upper bound of the address range, exclusive
+ * @vm_prot_bits:	requested protections of the address range
+ *
+ * Iterate through the enclave pages contained within [@start, @end) to verify
+ * the permissions requested by @vm_prot_bits do not exceed that of any enclave
+ * page to be mapped.
+ *
+ * Return:
+ *   0 on success,
+ *   -EACCES if VMA permissions exceed enclave page permissions
+ */
+int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
+		     unsigned long end, unsigned long vm_flags)
+{
+	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+	unsigned long idx_start = PFN_DOWN(start);
+	unsigned long idx_end = PFN_DOWN(end - 1);
+	struct sgx_encl_page *page;
+
+	XA_STATE(xas, &encl->page_array, idx_start);
+
+	/*
+	 * Disallow READ_IMPLIES_EXEC tasks as their VMA permissions might
+	 * conflict with the enclave page permissions.
+	 */
+	if (current->personality & READ_IMPLIES_EXEC)
+		return -EACCES;
+
+	xas_for_each(&xas, page, idx_end)
+		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
+			return -EACCES;
+
+	return 0;
+}
+
+static int sgx_vma_mprotect(struct vm_area_struct *vma,
+			    struct vm_area_struct **pprev, unsigned long start,
+			    unsigned long end, unsigned long newflags)
+{
+	int ret;
+
+	ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
+	if (ret)
+		return ret;
+
+	return mprotect_fixup(vma, pprev, start, end, newflags);
+}
+
+const struct vm_operations_struct sgx_vm_ops = {
+	.open = sgx_vma_open,
+	.fault = sgx_vma_fault,
+	.mprotect = sgx_vma_mprotect,
+};
+
+/**
+ * sgx_encl_find - find an enclave
+ * @mm:		mm struct of the current process
+ * @addr:	address in the ELRANGE
+ * @vma:	the resulting VMA
+ *
+ * Find an enclave identified by the given address. Give back a VMA that is
+ * part of the enclave and located in that address. The VMA is given back if it
+ * is a proper enclave VMA even if an &sgx_encl instance does not exist yet
+ * (enclave creation has not been performed).
+ *
+ * Return:
+ *   0 on success,
+ *   -EINVAL if an enclave was not found,
+ *   -ENOENT if the enclave has not been created yet
+ */
+int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
+		  struct vm_area_struct **vma)
+{
+	struct vm_area_struct *result;
+	struct sgx_encl *encl;
+
+	result = find_vma(mm, addr);
+	if (!result || result->vm_ops != &sgx_vm_ops || addr < result->vm_start)
+		return -EINVAL;
+
+	encl = result->vm_private_data;
+	*vma = result;
+
+	return encl ? 0 : -ENOENT;
+}
+
+/**
+ * sgx_encl_destroy() - destroy enclave resources
+ * @encl:	an enclave pointer
+ */
+void sgx_encl_destroy(struct sgx_encl *encl)
+{
+	struct sgx_encl_page *entry;
+	unsigned long index;
+
+	atomic_or(SGX_ENCL_DEAD, &encl->flags);
+
+	xa_for_each(&encl->page_array, index, entry) {
+		if (entry->epc_page) {
+			sgx_free_epc_page(entry->epc_page);
+			encl->secs_child_cnt--;
+			entry->epc_page = NULL;
+		}
+
+		kfree(entry);
+	}
+
+	xa_destroy(&encl->page_array);
+
+	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_free_epc_page(encl->secs.epc_page);
+		encl->secs.epc_page = NULL;
+	}
+}
+
+/**
+ * sgx_encl_release - Destroy an enclave instance
+ * @kref:	address of a kref inside &sgx_encl
+ *
+ * Used together with kref_put(). Frees all the resources associated with the
+ * enclave and the instance itself.
+ */
+void sgx_encl_release(struct kref *ref)
+{
+	struct sgx_encl *encl = container_of(ref, struct sgx_encl, refcount);
+
+	sgx_encl_destroy(encl);
+
+	if (encl->backing)
+		fput(encl->backing);
+
+	cleanup_srcu_struct(&encl->srcu);
+
+	WARN_ON_ONCE(!list_empty(&encl->mm_list));
+
+	/* Detect EPC page leak's. */
+	WARN_ON_ONCE(encl->secs_child_cnt);
+	WARN_ON_ONCE(encl->secs.epc_page);
+
+	kfree(encl);
+}
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
new file mode 100644
index 000000000000..8ff445476657
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -0,0 +1,85 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/**
+ * Copyright(c) 2016-19 Intel Corporation.
+ */
+#ifndef _X86_ENCL_H
+#define _X86_ENCL_H
+
+#include <linux/cpumask.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/srcu.h>
+#include <linux/workqueue.h>
+#include <linux/xarray.h>
+#include "sgx.h"
+
+/**
+ * enum sgx_encl_page_desc - defines bits for an enclave page's descriptor
+ * %SGX_ENCL_PAGE_ADDR_MASK:		Holds the virtual address of the page.
+ *
+ * The page address for SECS is zero and is used by the subsystem to recognize
+ * the SECS page.
+ */
+enum sgx_encl_page_desc {
+	/* Bits 11:3 are available when the page is not swapped. */
+	SGX_ENCL_PAGE_ADDR_MASK		= PAGE_MASK,
+};
+
+#define SGX_ENCL_PAGE_ADDR(page) \
+	((page)->desc & SGX_ENCL_PAGE_ADDR_MASK)
+
+struct sgx_encl_page {
+	unsigned long desc;
+	unsigned long vm_max_prot_bits;
+	struct sgx_epc_page *epc_page;
+	struct sgx_encl *encl;
+};
+
+enum sgx_encl_flags {
+	SGX_ENCL_CREATED	= BIT(0),
+	SGX_ENCL_INITIALIZED	= BIT(1),
+	SGX_ENCL_DEBUG		= BIT(2),
+	SGX_ENCL_DEAD		= BIT(3),
+	SGX_ENCL_IOCTL		= BIT(4),
+};
+
+struct sgx_encl_mm {
+	struct sgx_encl *encl;
+	struct mm_struct *mm;
+	struct list_head list;
+	struct mmu_notifier mmu_notifier;
+};
+
+struct sgx_encl {
+	atomic_t flags;
+	unsigned int page_cnt;
+	unsigned int secs_child_cnt;
+	struct mutex lock;
+	struct list_head mm_list;
+	spinlock_t mm_lock;
+	struct file *backing;
+	struct kref refcount;
+	struct srcu_struct srcu;
+	unsigned long base;
+	unsigned long size;
+	unsigned long ssaframesize;
+	struct xarray page_array;
+	struct sgx_encl_page secs;
+	cpumask_t cpumask;
+};
+
+extern const struct vm_operations_struct sgx_vm_ops;
+
+int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
+		  struct vm_area_struct **vma);
+void sgx_encl_destroy(struct sgx_encl *encl);
+void sgx_encl_release(struct kref *ref);
+int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
+int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
+		     unsigned long end, unsigned long vm_flags);
+
+#endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 97c6895fb6c9..4137254fb29e 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -9,6 +9,8 @@
 #include <linux/sched/mm.h>
 #include <linux/sched/signal.h>
 #include <linux/slab.h>
+#include "driver.h"
+#include "encl.h"
 #include "encls.h"
 
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
@@ -260,6 +262,8 @@ static bool __init sgx_page_cache_init(void)
 
 static void __init sgx_init(void)
 {
+	int ret;
+
 	if (!boot_cpu_has(X86_FEATURE_SGX))
 		return;
 
@@ -269,8 +273,15 @@ static void __init sgx_init(void)
 	if (!sgx_page_reclaimer_init())
 		goto err_page_cache;
 
+	ret = sgx_drv_init();
+	if (ret)
+		goto err_kthread;
+
 	return;
 
+err_kthread:
+	kthread_stop(ksgxswapd_tsk);
+
 err_page_cache:
 	sgx_page_cache_teardown();
 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v39 16/24] x86/sgx: Add a page reclaimer
       [not found] <20201003045059.665934-1-jarkko.sakkinen@linux.intel.com>
  2020-10-03  4:50 ` [PATCH v39 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
  2020-10-03  4:50 ` [PATCH v39 11/24] x86/sgx: Add SGX enclave driver Jarkko Sakkinen
@ 2020-10-03  4:50 ` Jarkko Sakkinen
  2020-10-03  5:22   ` Haitao Huang
  2020-10-03  4:50 ` [PATCH v39 17/24] x86/sgx: Add ptrace() support for the SGX driver Jarkko Sakkinen
  3 siblings, 1 reply; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-03  4:50 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Jarkko Sakkinen, linux-mm, Jethro Beekman,
	Jordan Hand, Nathaniel McCallum, Chunyang Hui, Seth Moore,
	Sean Christopherson, akpm, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

There is a limited amount of EPC available. Therefore, some of it must be
copied to the regular memory, and only subset kept in the SGX reserved
memory. While kernel cannot directly access enclave memory, SGX provides a
set of ENCLS leaf functions to perform reclaiming.

Implement a page reclaimer by using these leaf functions. It picks the
victim pages in LRU fashion from all the enclaves running in the system.
The thread ksgxswapd reclaims pages on the event when the number of free
EPC pages goes below SGX_NR_LOW_PAGES up until it reaches
SGX_NR_HIGH_PAGES.

sgx_alloc_epc_page() can optionally directly reclaim pages with @reclaim
set true. A caller must also supply owner for each page so that the
reclaimer can access the associated enclaves. This is needed for locking,
as most of the ENCLS leafs cannot be executed concurrently for an enclave.
The owner is also needed for accessing SECS, which is required to be
resident when its child pages are being reclaimed.

Cc: linux-mm@kvack.org
Acked-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
Tested-by: Seth Moore <sethmo@google.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
 arch/x86/kernel/cpu/sgx/driver.c |   1 +
 arch/x86/kernel/cpu/sgx/encl.c   | 344 +++++++++++++++++++++-
 arch/x86/kernel/cpu/sgx/encl.h   |  41 +++
 arch/x86/kernel/cpu/sgx/ioctl.c  |  78 ++++-
 arch/x86/kernel/cpu/sgx/main.c   | 481 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h    |   9 +
 6 files changed, 947 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index d01b28f7ce4a..0446781cc7a2 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -29,6 +29,7 @@ static int sgx_open(struct inode *inode, struct file *file)
 	atomic_set(&encl->flags, 0);
 	kref_init(&encl->refcount);
 	xa_init(&encl->page_array);
+	INIT_LIST_HEAD(&encl->va_pages);
 	mutex_init(&encl->lock);
 	INIT_LIST_HEAD(&encl->mm_list);
 	spin_lock_init(&encl->mm_lock);
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index c2c4a77af36b..54326efa6c2f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -12,9 +12,88 @@
 #include "encls.h"
 #include "sgx.h"
 
+/*
+ * ELDU: Load an EPC page as unblocked. For more info, see "OS Management of EPC
+ * Pages" in the SDM.
+ */
+static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
+			   struct sgx_epc_page *epc_page,
+			   struct sgx_epc_page *secs_page)
+{
+	unsigned long va_offset = SGX_ENCL_PAGE_VA_OFFSET(encl_page);
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_pageinfo pginfo;
+	struct sgx_backing b;
+	pgoff_t page_index;
+	int ret;
+
+	if (secs_page)
+		page_index = SGX_ENCL_PAGE_INDEX(encl_page);
+	else
+		page_index = PFN_DOWN(encl->size);
+
+	ret = sgx_encl_get_backing(encl, page_index, &b);
+	if (ret)
+		return ret;
+
+	pginfo.addr = SGX_ENCL_PAGE_ADDR(encl_page);
+	pginfo.contents = (unsigned long)kmap_atomic(b.contents);
+	pginfo.metadata = (unsigned long)kmap_atomic(b.pcmd) +
+			  b.pcmd_offset;
+
+	if (secs_page)
+		pginfo.secs = (u64)sgx_get_epc_addr(secs_page);
+	else
+		pginfo.secs = 0;
+
+	ret = __eldu(&pginfo, sgx_get_epc_addr(epc_page),
+		     sgx_get_epc_addr(encl_page->va_page->epc_page) +
+				      va_offset);
+	if (ret) {
+		if (encls_failed(ret))
+			ENCLS_WARN(ret, "ELDU");
+
+		ret = -EFAULT;
+	}
+
+	kunmap_atomic((void *)(unsigned long)(pginfo.metadata - b.pcmd_offset));
+	kunmap_atomic((void *)(unsigned long)pginfo.contents);
+
+	sgx_encl_put_backing(&b, false);
+
+	return ret;
+}
+
+static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
+					  struct sgx_epc_page *secs_page)
+{
+	unsigned long va_offset = SGX_ENCL_PAGE_VA_OFFSET(encl_page);
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_epc_page *epc_page;
+	int ret;
+
+	epc_page = sgx_alloc_epc_page(encl_page, false);
+	if (IS_ERR(epc_page))
+		return epc_page;
+
+	ret = __sgx_encl_eldu(encl_page, epc_page, secs_page);
+	if (ret) {
+		sgx_free_epc_page(epc_page);
+		return ERR_PTR(ret);
+	}
+
+	sgx_free_va_slot(encl_page->va_page, va_offset);
+	list_move(&encl_page->va_page->list, &encl->va_pages);
+	encl_page->desc &= ~SGX_ENCL_PAGE_VA_OFFSET_MASK;
+	encl_page->epc_page = epc_page;
+
+	return epc_page;
+}
+
 static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 						unsigned long addr)
 {
+	struct sgx_epc_page *epc_page;
 	struct sgx_encl_page *entry;
 	unsigned int flags;
 
@@ -33,10 +112,27 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_PTR(-EFAULT);
 
 	/* Page is already resident in the EPC. */
-	if (entry->epc_page)
+	if (entry->epc_page) {
+		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+			return ERR_PTR(-EBUSY);
+
 		return entry;
+	}
+
+	if (!(encl->secs.epc_page)) {
+		epc_page = sgx_encl_eldu(&encl->secs, NULL);
+		if (IS_ERR(epc_page))
+			return ERR_CAST(epc_page);
+	}
 
-	return ERR_PTR(-EFAULT);
+	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
+	if (IS_ERR(epc_page))
+		return ERR_CAST(epc_page);
+
+	encl->secs_child_cnt++;
+	sgx_mark_page_reclaimable(entry->epc_page);
+
+	return entry;
 }
 
 static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
@@ -132,6 +228,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
 
 	spin_lock(&encl->mm_lock);
 	list_add_rcu(&encl_mm->list, &encl->mm_list);
+	/* Pairs with smp_rmb() in sgx_reclaimer_block(). */
+	smp_wmb();
+	encl->mm_list_version++;
 	spin_unlock(&encl->mm_lock);
 
 	return 0;
@@ -179,6 +278,8 @@ static unsigned int sgx_vma_fault(struct vm_fault *vmf)
 		goto out;
 	}
 
+	sgx_encl_test_and_clear_young(vma->vm_mm, entry);
+
 out:
 	mutex_unlock(&encl->lock);
 	return ret;
@@ -280,6 +381,7 @@ int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
  */
 void sgx_encl_destroy(struct sgx_encl *encl)
 {
+	struct sgx_va_page *va_page;
 	struct sgx_encl_page *entry;
 	unsigned long index;
 
@@ -287,6 +389,13 @@ void sgx_encl_destroy(struct sgx_encl *encl)
 
 	xa_for_each(&encl->page_array, index, entry) {
 		if (entry->epc_page) {
+			/*
+			 * The page and its radix tree entry cannot be freed
+			 * if the page is being held by the reclaimer.
+			 */
+			if (sgx_unmark_page_reclaimable(entry->epc_page))
+				continue;
+
 			sgx_free_epc_page(entry->epc_page);
 			encl->secs_child_cnt--;
 			entry->epc_page = NULL;
@@ -301,6 +410,19 @@ void sgx_encl_destroy(struct sgx_encl *encl)
 		sgx_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 	}
+
+	/*
+	 * The reclaimer is responsible for checking SGX_ENCL_DEAD before doing
+	 * EWB, thus it's safe to free VA pages even if the reclaimer holds a
+	 * reference to the enclave.
+	 */
+	while (!list_empty(&encl->va_pages)) {
+		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
+					   list);
+		list_del(&va_page->list);
+		sgx_free_epc_page(va_page->epc_page);
+		kfree(va_page);
+	}
 }
 
 /**
@@ -329,3 +451,221 @@ void sgx_encl_release(struct kref *ref)
 
 	kfree(encl);
 }
+
+static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl,
+					      pgoff_t index)
+{
+	struct inode *inode = encl->backing->f_path.dentry->d_inode;
+	struct address_space *mapping = inode->i_mapping;
+	gfp_t gfpmask = mapping_gfp_mask(mapping);
+
+	return shmem_read_mapping_page_gfp(mapping, index, gfpmask);
+}
+
+/**
+ * sgx_encl_get_backing() - Pin the backing storage
+ * @encl:	an enclave pointer
+ * @page_index:	enclave page index
+ * @backing:	data for accessing backing storage for the page
+ *
+ * Pin the backing storage pages for storing the encrypted contents and Paging
+ * Crypto MetaData (PCMD) of an enclave page.
+ *
+ * Return:
+ *   0 on success,
+ *   -errno otherwise.
+ */
+int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
+			 struct sgx_backing *backing)
+{
+	pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
+	struct page *contents;
+	struct page *pcmd;
+
+	contents = sgx_encl_get_backing_page(encl, page_index);
+	if (IS_ERR(contents))
+		return PTR_ERR(contents);
+
+	pcmd = sgx_encl_get_backing_page(encl, pcmd_index);
+	if (IS_ERR(pcmd)) {
+		put_page(contents);
+		return PTR_ERR(pcmd);
+	}
+
+	backing->page_index = page_index;
+	backing->contents = contents;
+	backing->pcmd = pcmd;
+	backing->pcmd_offset =
+		(page_index & (PAGE_SIZE / sizeof(struct sgx_pcmd) - 1)) *
+		sizeof(struct sgx_pcmd);
+
+	return 0;
+}
+
+/**
+ * sgx_encl_put_backing() - Unpin the backing storage
+ * @backing:	data for accessing backing storage for the page
+ * @do_write:	mark pages dirty
+ */
+void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write)
+{
+	if (do_write) {
+		set_page_dirty(backing->pcmd);
+		set_page_dirty(backing->contents);
+	}
+
+	put_page(backing->pcmd);
+	put_page(backing->contents);
+}
+
+static int sgx_encl_test_and_clear_young_cb(pte_t *ptep, unsigned long addr,
+					    void *data)
+{
+	pte_t pte;
+	int ret;
+
+	ret = pte_young(*ptep);
+	if (ret) {
+		pte = pte_mkold(*ptep);
+		set_pte_at((struct mm_struct *)data, addr, ptep, pte);
+	}
+
+	return ret;
+}
+
+/**
+ * sgx_encl_test_and_clear_young() - Test and reset the accessed bit
+ * @mm:		mm_struct that is checked
+ * @page:	enclave page to be tested for recent access
+ *
+ * Checks the Access (A) bit from the PTE corresponding to the enclave page and
+ * clears it.
+ *
+ * Return: 1 if the page has been recently accessed and 0 if not.
+ */
+int sgx_encl_test_and_clear_young(struct mm_struct *mm,
+				  struct sgx_encl_page *page)
+{
+	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
+	struct sgx_encl *encl = page->encl;
+	struct vm_area_struct *vma;
+	int ret;
+
+	ret = sgx_encl_find(mm, addr, &vma);
+	if (ret)
+		return 0;
+
+	if (encl != vma->vm_private_data)
+		return 0;
+
+	ret = apply_to_page_range(vma->vm_mm, addr, PAGE_SIZE,
+				  sgx_encl_test_and_clear_young_cb, vma->vm_mm);
+	if (ret < 0)
+		return 0;
+
+	return ret;
+}
+
+/**
+ * sgx_encl_reserve_page() - Reserve an enclave page
+ * @encl:	an enclave pointer
+ * @addr:	a page address
+ *
+ * Load an enclave page and lock the enclave so that the page can be used by
+ * EDBG* and EMOD*.
+ *
+ * Return:
+ *   an enclave page on success
+ *   -EFAULT	if the load fails
+ */
+struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
+					    unsigned long addr)
+{
+	struct sgx_encl_page *entry;
+
+	for ( ; ; ) {
+		mutex_lock(&encl->lock);
+
+		entry = sgx_encl_load_page(encl, addr);
+		if (PTR_ERR(entry) != -EBUSY)
+			break;
+
+		mutex_unlock(&encl->lock);
+	}
+
+	if (IS_ERR(entry))
+		mutex_unlock(&encl->lock);
+
+	return entry;
+}
+
+/**
+ * sgx_alloc_va_page() - Allocate a Version Array (VA) page
+ *
+ * Allocate a free EPC page and convert it to a Version Array (VA) page.
+ *
+ * Return:
+ *   a VA page,
+ *   -errno otherwise
+ */
+struct sgx_epc_page *sgx_alloc_va_page(void)
+{
+	struct sgx_epc_page *epc_page;
+	int ret;
+
+	epc_page = sgx_alloc_epc_page(NULL, true);
+	if (IS_ERR(epc_page))
+		return ERR_CAST(epc_page);
+
+	ret = __epa(sgx_get_epc_addr(epc_page));
+	if (ret) {
+		WARN_ONCE(1, "EPA returned %d (0x%x)", ret, ret);
+		sgx_free_epc_page(epc_page);
+		return ERR_PTR(-EFAULT);
+	}
+
+	return epc_page;
+}
+
+/**
+ * sgx_alloc_va_slot - allocate a VA slot
+ * @va_page:	a &struct sgx_va_page instance
+ *
+ * Allocates a slot from a &struct sgx_va_page instance.
+ *
+ * Return: offset of the slot inside the VA page
+ */
+unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page)
+{
+	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
+
+	if (slot < SGX_VA_SLOT_COUNT)
+		set_bit(slot, va_page->slots);
+
+	return slot << 3;
+}
+
+/**
+ * sgx_free_va_slot - free a VA slot
+ * @va_page:	a &struct sgx_va_page instance
+ * @offset:	offset of the slot inside the VA page
+ *
+ * Frees a slot from a &struct sgx_va_page instance.
+ */
+void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset)
+{
+	clear_bit(offset >> 3, va_page->slots);
+}
+
+/**
+ * sgx_va_page_full - is the VA page full?
+ * @va_page:	a &struct sgx_va_page instance
+ *
+ * Return: true if all slots have been taken
+ */
+bool sgx_va_page_full(struct sgx_va_page *va_page)
+{
+	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
+
+	return slot == SGX_VA_SLOT_COUNT;
+}
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 0448d22d3010..e8eb9e9a834e 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -19,6 +19,10 @@
 
 /**
  * enum sgx_encl_page_desc - defines bits for an enclave page's descriptor
+ * %SGX_ENCL_PAGE_BEING_RECLAIMED:	The page is in the process of being
+ *					reclaimed.
+ * %SGX_ENCL_PAGE_VA_OFFSET_MASK:	Holds the offset in the Version Array
+ *					(VA) page for a swapped page.
  * %SGX_ENCL_PAGE_ADDR_MASK:		Holds the virtual address of the page.
  *
  * The page address for SECS is zero and is used by the subsystem to recognize
@@ -26,16 +30,23 @@
  */
 enum sgx_encl_page_desc {
 	/* Bits 11:3 are available when the page is not swapped. */
+	SGX_ENCL_PAGE_BEING_RECLAIMED		= BIT(3),
+	SGX_ENCL_PAGE_VA_OFFSET_MASK	= GENMASK_ULL(11, 3),
 	SGX_ENCL_PAGE_ADDR_MASK		= PAGE_MASK,
 };
 
 #define SGX_ENCL_PAGE_ADDR(page) \
 	((page)->desc & SGX_ENCL_PAGE_ADDR_MASK)
+#define SGX_ENCL_PAGE_VA_OFFSET(page) \
+	((page)->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK)
+#define SGX_ENCL_PAGE_INDEX(page) \
+	PFN_DOWN((page)->desc - (page)->encl->base)
 
 struct sgx_encl_page {
 	unsigned long desc;
 	unsigned long vm_max_prot_bits;
 	struct sgx_epc_page *epc_page;
+	struct sgx_va_page *va_page;
 	struct sgx_encl *encl;
 };
 
@@ -61,6 +72,7 @@ struct sgx_encl {
 	struct mutex lock;
 	struct list_head mm_list;
 	spinlock_t mm_lock;
+	unsigned long mm_list_version;
 	struct file *backing;
 	struct kref refcount;
 	struct srcu_struct srcu;
@@ -68,12 +80,21 @@ struct sgx_encl {
 	unsigned long size;
 	unsigned long ssaframesize;
 	struct xarray page_array;
+	struct list_head va_pages;
 	struct sgx_encl_page secs;
 	cpumask_t cpumask;
 	unsigned long attributes;
 	unsigned long attributes_mask;
 };
 
+#define SGX_VA_SLOT_COUNT 512
+
+struct sgx_va_page {
+	struct sgx_epc_page *epc_page;
+	DECLARE_BITMAP(slots, SGX_VA_SLOT_COUNT);
+	struct list_head list;
+};
+
 extern const struct vm_operations_struct sgx_vm_ops;
 
 int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
@@ -84,4 +105,24 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
 int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 		     unsigned long end, unsigned long vm_flags);
 
+struct sgx_backing {
+	pgoff_t page_index;
+	struct page *contents;
+	struct page *pcmd;
+	unsigned long pcmd_offset;
+};
+
+int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
+			 struct sgx_backing *backing);
+void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write);
+int sgx_encl_test_and_clear_young(struct mm_struct *mm,
+				  struct sgx_encl_page *page);
+struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
+					    unsigned long addr);
+
+struct sgx_epc_page *sgx_alloc_va_page(void);
+unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
+void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
+bool sgx_va_page_full(struct sgx_va_page *va_page);
+
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 3c04798e83e5..613f6c03598e 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -16,6 +16,43 @@
 #include "encl.h"
 #include "encls.h"
 
+static struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl)
+{
+	struct sgx_va_page *va_page = NULL;
+	void *err;
+
+	BUILD_BUG_ON(SGX_VA_SLOT_COUNT !=
+		(SGX_ENCL_PAGE_VA_OFFSET_MASK >> 3) + 1);
+
+	if (!(encl->page_cnt % SGX_VA_SLOT_COUNT)) {
+		va_page = kzalloc(sizeof(*va_page), GFP_KERNEL);
+		if (!va_page)
+			return ERR_PTR(-ENOMEM);
+
+		va_page->epc_page = sgx_alloc_va_page();
+		if (IS_ERR(va_page->epc_page)) {
+			err = ERR_CAST(va_page->epc_page);
+			kfree(va_page);
+			return err;
+		}
+
+		WARN_ON_ONCE(encl->page_cnt % SGX_VA_SLOT_COUNT);
+	}
+	encl->page_cnt++;
+	return va_page;
+}
+
+static void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
+{
+	encl->page_cnt--;
+
+	if (va_page) {
+		sgx_free_epc_page(va_page->epc_page);
+		list_del(&va_page->list);
+		kfree(va_page);
+	}
+}
+
 static u32 sgx_calc_ssa_frame_size(u32 miscselect, u64 xfrm)
 {
 	u32 size_max = PAGE_SIZE;
@@ -80,15 +117,24 @@ static int sgx_validate_secs(const struct sgx_secs *secs)
 static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 {
 	struct sgx_epc_page *secs_epc;
+	struct sgx_va_page *va_page;
 	struct sgx_pageinfo pginfo;
 	struct sgx_secinfo secinfo;
 	unsigned long encl_size;
 	struct file *backing;
 	long ret;
 
+	va_page = sgx_encl_grow(encl);
+	if (IS_ERR(va_page))
+		return PTR_ERR(va_page);
+	else if (va_page)
+		list_add(&va_page->list, &encl->va_pages);
+	/* else the tail page of the VA page list had free slots. */
+
 	if (sgx_validate_secs(secs)) {
 		pr_debug("invalid SECS\n");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto err_out_shrink;
 	}
 
 	/* The extra page goes to SECS. */
@@ -96,12 +142,14 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 
 	backing = shmem_file_setup("SGX backing", encl_size + (encl_size >> 5),
 				   VM_NORESERVE);
-	if (IS_ERR(backing))
-		return PTR_ERR(backing);
+	if (IS_ERR(backing)) {
+		ret = PTR_ERR(backing);
+		goto err_out_shrink;
+	}
 
 	encl->backing = backing;
 
-	secs_epc = __sgx_alloc_epc_page();
+	secs_epc = sgx_alloc_epc_page(&encl->secs, true);
 	if (IS_ERR(secs_epc)) {
 		ret = PTR_ERR(secs_epc);
 		goto err_out_backing;
@@ -149,6 +197,9 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	fput(encl->backing);
 	encl->backing = NULL;
 
+err_out_shrink:
+	sgx_encl_shrink(encl, va_page);
+
 	return ret;
 }
 
@@ -321,21 +372,35 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 {
 	struct sgx_encl_page *encl_page;
 	struct sgx_epc_page *epc_page;
+	struct sgx_va_page *va_page;
 	int ret;
 
 	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
 	if (IS_ERR(encl_page))
 		return PTR_ERR(encl_page);
 
-	epc_page = __sgx_alloc_epc_page();
+	epc_page = sgx_alloc_epc_page(encl_page, true);
 	if (IS_ERR(epc_page)) {
 		kfree(encl_page);
 		return PTR_ERR(epc_page);
 	}
 
+	va_page = sgx_encl_grow(encl);
+	if (IS_ERR(va_page)) {
+		ret = PTR_ERR(va_page);
+		goto err_out_free;
+	}
+
 	mmap_read_lock(current->mm);
 	mutex_lock(&encl->lock);
 
+	/*
+	 * Adding to encl->va_pages must be done under encl->lock.  Ditto for
+	 * deleting (via sgx_encl_shrink()) in the error path.
+	 */
+	if (va_page)
+		list_add(&va_page->list, &encl->va_pages);
+
 	/*
 	 * Insert prior to EADD in case of OOM.  EADD modifies MRENCLAVE, i.e.
 	 * can't be gracefully unwound, while failure on EADD/EXTEND is limited
@@ -366,6 +431,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
+	sgx_mark_page_reclaimable(encl_page->epc_page);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -374,9 +440,11 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 	xa_erase(&encl->page_array, PFN_DOWN(encl_page->desc));
 
 err_out_unlock:
+	sgx_encl_shrink(encl, va_page);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 
+err_out_free:
 	sgx_free_epc_page(epc_page);
 	kfree(encl_page);
 
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 4137254fb29e..3f9130501370 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -16,6 +16,395 @@
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
 static struct task_struct *ksgxswapd_tsk;
+static DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
+static LIST_HEAD(sgx_active_page_list);
+static DEFINE_SPINLOCK(sgx_active_page_list_lock);
+
+/**
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * @page:	EPC page
+ *
+ * Mark a page as reclaimable and add it to the active page list. Pages
+ * are automatically removed from the active list when freed.
+ */
+void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
+{
+	spin_lock(&sgx_active_page_list_lock);
+	page->desc |= SGX_EPC_PAGE_RECLAIMABLE;
+	list_add_tail(&page->list, &sgx_active_page_list);
+	spin_unlock(&sgx_active_page_list_lock);
+}
+
+/**
+ * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * @page:	EPC page
+ *
+ * Clear the reclaimable flag and remove the page from the active page list.
+ *
+ * Return:
+ *   0 on success,
+ *   -EBUSY if the page is in the process of being reclaimed
+ */
+int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
+{
+	/*
+	 * Remove the page from the active list if necessary.  If the page
+	 * is actively being reclaimed, i.e. RECLAIMABLE is set but the
+	 * page isn't on the active list, return -EBUSY as we can't free
+	 * the page at this time since it is "owned" by the reclaimer.
+	 */
+	spin_lock(&sgx_active_page_list_lock);
+	if (page->desc & SGX_EPC_PAGE_RECLAIMABLE) {
+		if (list_empty(&page->list)) {
+			spin_unlock(&sgx_active_page_list_lock);
+			return -EBUSY;
+		}
+		list_del(&page->list);
+		page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
+	}
+	spin_unlock(&sgx_active_page_list_lock);
+
+	return 0;
+}
+
+static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl *encl = page->encl;
+	struct sgx_encl_mm *encl_mm;
+	bool ret = true;
+	int idx;
+
+	idx = srcu_read_lock(&encl->srcu);
+
+	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+		if (!mmget_not_zero(encl_mm->mm))
+			continue;
+
+		mmap_read_lock(encl_mm->mm);
+		ret = !sgx_encl_test_and_clear_young(encl_mm->mm, page);
+		mmap_read_unlock(encl_mm->mm);
+
+		mmput_async(encl_mm->mm);
+
+		if (!ret || (atomic_read(&encl->flags) & SGX_ENCL_DEAD))
+			break;
+	}
+
+	srcu_read_unlock(&encl->srcu, idx);
+
+	if (!ret && !(atomic_read(&encl->flags) & SGX_ENCL_DEAD))
+		return false;
+
+	return true;
+}
+
+static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl_page *page = epc_page->owner;
+	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
+	struct sgx_encl *encl = page->encl;
+	unsigned long mm_list_version;
+	struct sgx_encl_mm *encl_mm;
+	struct vm_area_struct *vma;
+	int idx, ret;
+
+	do {
+		mm_list_version = encl->mm_list_version;
+
+		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
+		smp_rmb();
+
+		idx = srcu_read_lock(&encl->srcu);
+
+		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+			if (!mmget_not_zero(encl_mm->mm))
+				continue;
+
+			mmap_read_lock(encl_mm->mm);
+
+			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
+			if (!ret && encl == vma->vm_private_data)
+				zap_vma_ptes(vma, addr, PAGE_SIZE);
+
+			mmap_read_unlock(encl_mm->mm);
+
+			mmput_async(encl_mm->mm);
+		}
+
+		srcu_read_unlock(&encl->srcu, idx);
+	} while (unlikely(encl->mm_list_version != mm_list_version));
+
+	mutex_lock(&encl->lock);
+
+	if (!(atomic_read(&encl->flags) & SGX_ENCL_DEAD)) {
+		ret = __eblock(sgx_get_epc_addr(epc_page));
+		if (encls_failed(ret))
+			ENCLS_WARN(ret, "EBLOCK");
+	}
+
+	mutex_unlock(&encl->lock);
+}
+
+static int __sgx_encl_ewb(struct sgx_epc_page *epc_page, void *va_slot,
+			  struct sgx_backing *backing)
+{
+	struct sgx_pageinfo pginfo;
+	int ret;
+
+	pginfo.addr = 0;
+	pginfo.secs = 0;
+
+	pginfo.contents = (unsigned long)kmap_atomic(backing->contents);
+	pginfo.metadata = (unsigned long)kmap_atomic(backing->pcmd) +
+			  backing->pcmd_offset;
+
+	ret = __ewb(&pginfo, sgx_get_epc_addr(epc_page), va_slot);
+
+	kunmap_atomic((void *)(unsigned long)(pginfo.metadata -
+					      backing->pcmd_offset));
+	kunmap_atomic((void *)(unsigned long)pginfo.contents);
+
+	return ret;
+}
+
+static void sgx_ipi_cb(void *info)
+{
+}
+
+static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl)
+{
+	cpumask_t *cpumask = &encl->cpumask;
+	struct sgx_encl_mm *encl_mm;
+	int idx;
+
+	/*
+	 * Can race with sgx_encl_mm_add(), but ETRACK has already been
+	 * executed, which means that the CPUs running in the new mm will enter
+	 * into the enclave with a fresh epoch.
+	 */
+	cpumask_clear(cpumask);
+
+	idx = srcu_read_lock(&encl->srcu);
+
+	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+		if (!mmget_not_zero(encl_mm->mm))
+			continue;
+
+		cpumask_or(cpumask, cpumask, mm_cpumask(encl_mm->mm));
+
+		mmput_async(encl_mm->mm);
+	}
+
+	srcu_read_unlock(&encl->srcu, idx);
+
+	return cpumask;
+}
+
+/*
+ * Swap page to the regular memory transformed to the blocked state by using
+ * EBLOCK, which means that it can no loger be referenced (no new TLB entries).
+ *
+ * The first trial just tries to write the page assuming that some other thread
+ * has reset the count for threads inside the enlave by using ETRACK, and
+ * previous thread count has been zeroed out. The second trial calls ETRACK
+ * before EWB. If that fails we kick all the HW threads out, and then do EWB,
+ * which should be guaranteed the succeed.
+ */
+static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
+			 struct sgx_backing *backing)
+{
+	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_va_page *va_page;
+	unsigned int va_offset;
+	void *va_slot;
+	int ret;
+
+	encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED;
+
+	va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
+				   list);
+	va_offset = sgx_alloc_va_slot(va_page);
+	va_slot = sgx_get_epc_addr(va_page->epc_page) + va_offset;
+	if (sgx_va_page_full(va_page))
+		list_move_tail(&va_page->list, &encl->va_pages);
+
+	ret = __sgx_encl_ewb(epc_page, va_slot, backing);
+	if (ret == SGX_NOT_TRACKED) {
+		ret = __etrack(sgx_get_epc_addr(encl->secs.epc_page));
+		if (ret) {
+			if (encls_failed(ret))
+				ENCLS_WARN(ret, "ETRACK");
+		}
+
+		ret = __sgx_encl_ewb(epc_page, va_slot, backing);
+		if (ret == SGX_NOT_TRACKED) {
+			/*
+			 * Slow path, send IPIs to kick cpus out of the
+			 * enclave.  Note, it's imperative that the cpu
+			 * mask is generated *after* ETRACK, else we'll
+			 * miss cpus that entered the enclave between
+			 * generating the mask and incrementing epoch.
+			 */
+			on_each_cpu_mask(sgx_encl_ewb_cpumask(encl),
+					 sgx_ipi_cb, NULL, 1);
+			ret = __sgx_encl_ewb(epc_page, va_slot, backing);
+		}
+	}
+
+	if (ret) {
+		if (encls_failed(ret))
+			ENCLS_WARN(ret, "EWB");
+
+		sgx_free_va_slot(va_page, va_offset);
+	} else {
+		encl_page->desc |= va_offset;
+		encl_page->va_page = va_page;
+	}
+}
+
+static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
+				struct sgx_backing *backing)
+{
+	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl *encl = encl_page->encl;
+	struct sgx_backing secs_backing;
+	int ret;
+
+	mutex_lock(&encl->lock);
+
+	if (atomic_read(&encl->flags) & SGX_ENCL_DEAD) {
+		ret = __eremove(sgx_get_epc_addr(epc_page));
+		ENCLS_WARN(ret, "EREMOVE");
+	} else {
+		sgx_encl_ewb(epc_page, backing);
+	}
+
+	encl_page->epc_page = NULL;
+	encl->secs_child_cnt--;
+
+	if (!encl->secs_child_cnt) {
+		if (atomic_read(&encl->flags) & SGX_ENCL_DEAD) {
+			sgx_free_epc_page(encl->secs.epc_page);
+			encl->secs.epc_page = NULL;
+		} else if (atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED) {
+			ret = sgx_encl_get_backing(encl, PFN_DOWN(encl->size),
+						   &secs_backing);
+			if (ret)
+				goto out;
+
+			sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
+
+			sgx_free_epc_page(encl->secs.epc_page);
+			encl->secs.epc_page = NULL;
+
+			sgx_encl_put_backing(&secs_backing, true);
+		}
+	}
+
+out:
+	mutex_unlock(&encl->lock);
+}
+
+/*
+ * Take a fixed number of pages from the head of the active page pool and
+ * reclaim them to the enclave's private shmem files. Skip the pages, which have
+ * been accessed since the last scan. Move those pages to the tail of active
+ * page pool so that the pages get scanned in LRU like fashion.
+ *
+ * Batch process a chunk of pages (at the moment 16) in order to degrade amount
+ * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
+ * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI
+ * + EWB) but not sufficiently. Reclaiming one page at a time would also be
+ * problematic as it would increase the lock contention too much, which would
+ * halt forward progress.
+ */
+static void sgx_reclaim_pages(void)
+{
+	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
+	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_epc_section *section;
+	struct sgx_encl_page *encl_page;
+	struct sgx_epc_page *epc_page;
+	int cnt = 0;
+	int ret;
+	int i;
+
+	spin_lock(&sgx_active_page_list_lock);
+	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
+		if (list_empty(&sgx_active_page_list))
+			break;
+
+		epc_page = list_first_entry(&sgx_active_page_list,
+					    struct sgx_epc_page, list);
+		list_del_init(&epc_page->list);
+		encl_page = epc_page->owner;
+
+		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
+			chunk[cnt++] = epc_page;
+		else
+			/* The owner is freeing the page. No need to add the
+			 * page back to the list of reclaimable pages.
+			 */
+			epc_page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
+	}
+	spin_unlock(&sgx_active_page_list_lock);
+
+	for (i = 0; i < cnt; i++) {
+		epc_page = chunk[i];
+		encl_page = epc_page->owner;
+
+		if (!sgx_reclaimer_age(epc_page))
+			goto skip;
+
+		ret = sgx_encl_get_backing(encl_page->encl,
+					   SGX_ENCL_PAGE_INDEX(encl_page),
+					   &backing[i]);
+		if (ret)
+			goto skip;
+
+		mutex_lock(&encl_page->encl->lock);
+		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
+		mutex_unlock(&encl_page->encl->lock);
+		continue;
+
+skip:
+		spin_lock(&sgx_active_page_list_lock);
+		list_add_tail(&epc_page->list, &sgx_active_page_list);
+		spin_unlock(&sgx_active_page_list_lock);
+
+		kref_put(&encl_page->encl->refcount, sgx_encl_release);
+
+		chunk[i] = NULL;
+	}
+
+	for (i = 0; i < cnt; i++) {
+		epc_page = chunk[i];
+		if (epc_page)
+			sgx_reclaimer_block(epc_page);
+	}
+
+	for (i = 0; i < cnt; i++) {
+		epc_page = chunk[i];
+		if (!epc_page)
+			continue;
+
+		encl_page = epc_page->owner;
+		sgx_reclaimer_write(epc_page, &backing[i]);
+		sgx_encl_put_backing(&backing[i], true);
+
+		kref_put(&encl_page->encl->refcount, sgx_encl_release);
+		epc_page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
+
+		section = sgx_get_epc_section(epc_page);
+		spin_lock(&section->lock);
+		list_add_tail(&epc_page->list, &section->page_list);
+		section->free_cnt++;
+		spin_unlock(&section->lock);
+	}
+}
+
 
 static void sgx_sanitize_section(struct sgx_epc_section *section)
 {
@@ -44,6 +433,23 @@ static void sgx_sanitize_section(struct sgx_epc_section *section)
 	}
 }
 
+static unsigned long sgx_nr_free_pages(void)
+{
+	unsigned long cnt = 0;
+	int i;
+
+	for (i = 0; i < sgx_nr_epc_sections; i++)
+		cnt += sgx_epc_sections[i].free_cnt;
+
+	return cnt;
+}
+
+static bool sgx_should_reclaim(unsigned long watermark)
+{
+	return sgx_nr_free_pages() < watermark &&
+	       !list_empty(&sgx_active_page_list);
+}
+
 static int ksgxswapd(void *p)
 {
 	int i;
@@ -69,6 +475,20 @@ static int ksgxswapd(void *p)
 			WARN(1, "EPC section %d has unsanitized pages.\n", i);
 	}
 
+	while (!kthread_should_stop()) {
+		if (try_to_freeze())
+			continue;
+
+		wait_event_freezable(ksgxswapd_waitq,
+				     kthread_should_stop() ||
+				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
+
+		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
+			sgx_reclaim_pages();
+
+		cond_resched();
+	}
+
 	return 0;
 }
 
@@ -94,6 +514,7 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_sec
 
 	page = list_first_entry(&section->page_list, struct sgx_epc_page, list);
 	list_del_init(&page->list);
+	section->free_cnt--;
 
 	return page;
 }
@@ -127,6 +548,57 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 	return ERR_PTR(-ENOMEM);
 }
 
+/**
+ * sgx_alloc_epc_page() - Allocate an EPC page
+ * @owner:	the owner of the EPC page
+ * @reclaim:	reclaim pages if necessary
+ *
+ * Iterate through EPC sections and borrow a free EPC page to the caller. When a
+ * page is no longer needed it must be released with sgx_free_epc_page(). If
+ * @reclaim is set to true, directly reclaim pages when we are out of pages. No
+ * mm's can be locked when @reclaim is set to true.
+ *
+ * Finally, wake up ksgxswapd when the number of pages goes below the watermark
+ * before returning back to the caller.
+ *
+ * Return:
+ *   an EPC page,
+ *   -errno on error
+ */
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
+{
+	struct sgx_epc_page *entry;
+
+	for ( ; ; ) {
+		entry = __sgx_alloc_epc_page();
+		if (!IS_ERR(entry)) {
+			entry->owner = owner;
+			break;
+		}
+
+		if (list_empty(&sgx_active_page_list))
+			return ERR_PTR(-ENOMEM);
+
+		if (!reclaim) {
+			entry = ERR_PTR(-EBUSY);
+			break;
+		}
+
+		if (signal_pending(current)) {
+			entry = ERR_PTR(-ERESTARTSYS);
+			break;
+		}
+
+		sgx_reclaim_pages();
+		schedule();
+	}
+
+	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
+		wake_up(&ksgxswapd_waitq);
+
+	return entry;
+}
+
 /**
  * sgx_free_epc_page() - Free an EPC page
  * @page:	an EPC page
@@ -138,12 +610,20 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	struct sgx_epc_section *section = sgx_get_epc_section(page);
 	int ret;
 
+	/*
+	 * Don't take sgx_active_page_list_lock when asserting the page isn't
+	 * reclaimable, missing a WARN in the very rare case is preferable to
+	 * unnecessarily taking a global lock in the common case.
+	 */
+	WARN_ON_ONCE(page->desc & SGX_EPC_PAGE_RECLAIMABLE);
+
 	ret = __eremove(sgx_get_epc_addr(page));
 	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
 		return;
 
 	spin_lock(&section->lock);
 	list_add_tail(&page->list, &section->page_list);
+	section->free_cnt++;
 	spin_unlock(&section->lock);
 }
 
@@ -194,6 +674,7 @@ static bool __init sgx_setup_epc_section(u64 addr, u64 size,
 		list_add_tail(&page->list, &section->unsanitized_page_list);
 	}
 
+	section->free_cnt = nr_pages;
 	return true;
 
 err_out:
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 8d126070db1e..ec4f7b338dbe 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -15,6 +15,7 @@
 
 struct sgx_epc_page {
 	unsigned long desc;
+	struct sgx_encl_page *owner;
 	struct list_head list;
 };
 
@@ -27,6 +28,7 @@ struct sgx_epc_page {
 struct sgx_epc_section {
 	unsigned long pa;
 	void *va;
+	unsigned long free_cnt;
 	struct list_head page_list;
 	struct list_head unsanitized_page_list;
 	spinlock_t lock;
@@ -35,6 +37,10 @@ struct sgx_epc_section {
 #define SGX_EPC_SECTION_MASK		GENMASK(7, 0)
 #define SGX_MAX_EPC_SECTIONS		(SGX_EPC_SECTION_MASK + 1)
 #define SGX_MAX_ADD_PAGES_LENGTH	0x100000
+#define SGX_EPC_PAGE_RECLAIMABLE	BIT(8)
+#define SGX_NR_TO_SCAN			16
+#define SGX_NR_LOW_PAGES		32
+#define SGX_NR_HIGH_PAGES		64
 
 extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 
@@ -50,7 +56,10 @@ static inline void *sgx_get_epc_addr(struct sgx_epc_page *page)
 	return section->va + (page->desc & PAGE_MASK) - section->pa;
 }
 
+void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
+int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
 #endif /* _X86_SGX_H */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v39 17/24] x86/sgx: Add ptrace() support for the SGX driver
       [not found] <20201003045059.665934-1-jarkko.sakkinen@linux.intel.com>
                   ` (2 preceding siblings ...)
  2020-10-03  4:50 ` [PATCH v39 16/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
@ 2020-10-03  4:50 ` Jarkko Sakkinen
  3 siblings, 0 replies; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-03  4:50 UTC (permalink / raw)
  To: x86, linux-sgx
  Cc: linux-kernel, Jarkko Sakkinen, linux-mm, Andrew Morton,
	Matthew Wilcox, Jethro Beekman, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	npmccallum, puiterwijk, rientjes, sean.j.christopherson, tglx,
	yaozhangx, mikko.ylinen

Intel Sofware Guard eXtensions (SGX) allows creation of executable blobs
called enclaves, which cannot be accessed by default when not executing
inside the enclave. Enclaves can be entered by only using predefined memory
addresses, which are defined when the enclave is loaded.

However, enclaves can defined as debug enclaves at the load time. In debug
enclaves data can be read and/or written a memory word at a time by by
using ENCLS[EDBGRD] and ENCLS[EDBGWR] leaf instructions.

Add sgx_vma_access() function that implements 'access' virtual function
of struct vm_operations_struct. Use aforementioned leaf instructions to
provide read and write primitives for the enclave memory.

Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Acked-by: Jethro Beekman <jethro@fortanix.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
---
 arch/x86/kernel/cpu/sgx/encl.c | 89 ++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 54326efa6c2f..ae45f8f0951e 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -337,10 +337,99 @@ static int sgx_vma_mprotect(struct vm_area_struct *vma,
 	return mprotect_fixup(vma, pprev, start, end, newflags);
 }
 
+static int sgx_encl_debug_read(struct sgx_encl *encl, struct sgx_encl_page *page,
+			       unsigned long addr, void *data)
+{
+	unsigned long offset = addr & ~PAGE_MASK;
+	int ret;
+
+
+	ret = __edbgrd(sgx_get_epc_addr(page->epc_page) + offset, data);
+	if (ret)
+		return -EIO;
+
+	return 0;
+}
+
+static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *page,
+				unsigned long addr, void *data)
+{
+	unsigned long offset = addr & ~PAGE_MASK;
+	int ret;
+
+	ret = __edbgwr(sgx_get_epc_addr(page->epc_page) + offset, data);
+	if (ret)
+		return -EIO;
+
+	return 0;
+}
+
+static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
+			  void *buf, int len, int write)
+{
+	struct sgx_encl *encl = vma->vm_private_data;
+	struct sgx_encl_page *entry = NULL;
+	char data[sizeof(unsigned long)];
+	unsigned long align;
+	unsigned int flags;
+	int offset;
+	int cnt;
+	int ret = 0;
+	int i;
+
+	/*
+	 * If process was forked, VMA is still there but vm_private_data is set
+	 * to NULL.
+	 */
+	if (!encl)
+		return -EFAULT;
+
+	flags = atomic_read(&encl->flags);
+
+	if (!(flags & SGX_ENCL_DEBUG) || !(flags & SGX_ENCL_INITIALIZED) ||
+	    (flags & SGX_ENCL_DEAD))
+		return -EFAULT;
+
+	for (i = 0; i < len; i += cnt) {
+		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
+		if (IS_ERR(entry)) {
+			ret = PTR_ERR(entry);
+			break;
+		}
+
+		align = ALIGN_DOWN(addr + i, sizeof(unsigned long));
+		offset = (addr + i) & (sizeof(unsigned long) - 1);
+		cnt = sizeof(unsigned long) - offset;
+		cnt = min(cnt, len - i);
+
+		ret = sgx_encl_debug_read(encl, entry, align, data);
+		if (ret)
+			goto out;
+
+		if (write) {
+			memcpy(data + offset, buf + i, cnt);
+			ret = sgx_encl_debug_write(encl, entry, align, data);
+			if (ret)
+				goto out;
+		} else {
+			memcpy(buf + i, data + offset, cnt);
+		}
+
+out:
+		mutex_unlock(&encl->lock);
+
+		if (ret)
+			break;
+	}
+
+	return ret < 0 ? ret : i;
+}
+
 const struct vm_operations_struct sgx_vm_ops = {
 	.open = sgx_vma_open,
 	.fault = sgx_vma_fault,
 	.mprotect = sgx_vma_mprotect,
+	.access = sgx_vma_access,
 };
 
 /**
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 16/24] x86/sgx: Add a page reclaimer
  2020-10-03  4:50 ` [PATCH v39 16/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
@ 2020-10-03  5:22   ` Haitao Huang
  2020-10-03 13:32     ` Jarkko Sakkinen
  0 siblings, 1 reply; 35+ messages in thread
From: Haitao Huang @ 2020-10-03  5:22 UTC (permalink / raw)
  To: x86, linux-sgx, Jarkko Sakkinen
  Cc: linux-kernel, linux-mm, Jethro Beekman, Jordan Hand,
	Nathaniel McCallum, Chunyang Hui, Seth Moore,
	Sean Christopherson, akpm, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

When I turn on CONFIG_PROVE_LOCKING, kernel reports following suspicious  
RCU usages. Not sure if it is an issue. Just reporting here:

[ +34.337095] =============================
[  +0.000001] WARNING: suspicious RCU usage
[  +0.000002] 5.9.0-rc6-lock-sgx39 #1 Not tainted
[  +0.000001] -----------------------------
[  +0.000001] ./include/linux/xarray.h:1165 suspicious  
rcu_dereference_check() usage!
[  +0.000001]
               other info that might help us debug this:

[  +0.000001]
               rcu_scheduler_active = 2, debug_locks = 1
[  +0.000001] 1 lock held by enclaveos-runne/4238:
[  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:  
vm_mmap_pgoff+0xa1/0x120
[  +0.000005]
               stack backtrace:
[  +0.000002] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted  
5.9.0-rc6-lock-sgx39 #1
[  +0.000001] Hardware name: Microsoft Corporation Virtual Machine/Virtual  
Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
[  +0.000002] Call Trace:
[  +0.000003]  dump_stack+0x7d/0x9f
[  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
[  +0.000004]  xas_start+0x14c/0x1c0
[  +0.000003]  xas_load+0xf/0x50
[  +0.000002]  xas_find+0x25c/0x2c0
[  +0.000004]  sgx_encl_may_map+0x87/0x1c0
[  +0.000006]  sgx_mmap+0x29/0x70
[  +0.000003]  mmap_region+0x3ee/0x710
[  +0.000006]  do_mmap+0x3f1/0x5e0
[  +0.000004]  vm_mmap_pgoff+0xcd/0x120
[  +0.000007]  ksys_mmap_pgoff+0x1de/0x240
[  +0.000005]  __x64_sys_mmap+0x33/0x40
[  +0.000002]  do_syscall_64+0x37/0x80
[  +0.000003]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000002] RIP: 0033:0x7fe34efe06ba
[  +0.000002] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da 4d  
89 f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05  
<48> 3d 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
[  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:  
0000000000000009
[  +0.000001] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:  
00007fe34efe06ba
[  +0.000001] RDX: 0000000000000001 RSI: 0000000000001000 RDI:  
0000000007fff000
[  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:  
0000000000000000
[  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:  
0000000007fff000
[  +0.000001] R13: 0000000000001000 R14: 0000000000000011 R15:  
0000000000000000

[  +0.000010] =============================
[  +0.000001] WARNING: suspicious RCU usage
[  +0.000001] 5.9.0-rc6-lock-sgx39 #1 Not tainted
[  +0.000001] -----------------------------
[  +0.000001] ./include/linux/xarray.h:1181 suspicious  
rcu_dereference_check() usage!
[  +0.000001]
               other info that might help us debug this:

[  +0.000001]
               rcu_scheduler_active = 2, debug_locks = 1
[  +0.000001] 1 lock held by enclaveos-runne/4238:
[  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:  
vm_mmap_pgoff+0xa1/0x120
[  +0.000003]
               stack backtrace:
[  +0.000001] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted  
5.9.0-rc6-lock-sgx39 #1
[  +0.000001] Hardware name: Microsoft Corporation Virtual Machine/Virtual  
Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
[  +0.000001] Call Trace:
[  +0.000001]  dump_stack+0x7d/0x9f
[  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
[  +0.000003]  xas_descend+0x116/0x120
[  +0.000004]  xas_load+0x42/0x50
[  +0.000002]  xas_find+0x25c/0x2c0
[  +0.000004]  sgx_encl_may_map+0x87/0x1c0
[  +0.000006]  sgx_mmap+0x29/0x70
[  +0.000002]  mmap_region+0x3ee/0x710
[  +0.000006]  do_mmap+0x3f1/0x5e0
[  +0.000004]  vm_mmap_pgoff+0xcd/0x120
[  +0.000007]  ksys_mmap_pgoff+0x1de/0x240
[  +0.000005]  __x64_sys_mmap+0x33/0x40
[  +0.000002]  do_syscall_64+0x37/0x80
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000001] RIP: 0033:0x7fe34efe06ba
[  +0.000001] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da 4d  
89 f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05  
<48> 3d 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
[  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:  
0000000000000009
[  +0.000001] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:  
00007fe34efe06ba
[  +0.000001] RDX: 0000000000000001 RSI: 0000000000001000 RDI:  
0000000007fff000
[  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:  
0000000000000000
[  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:  
0000000007fff000
[  +0.000001] R13: 0000000000001000 R14: 0000000000000011 R15:  
0000000000000000

[  +0.001117] =============================
[  +0.000001] WARNING: suspicious RCU usage
[  +0.000001] 5.9.0-rc6-lock-sgx39 #1 Not tainted
[  +0.000001] -----------------------------
[  +0.000001] ./include/linux/xarray.h:1181 suspicious  
rcu_dereference_check() usage!
[  +0.000001]
               other info that might help us debug this:

[  +0.000001]
               rcu_scheduler_active = 2, debug_locks = 1
[  +0.000001] 1 lock held by enclaveos-runne/4238:
[  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:  
vm_mmap_pgoff+0xa1/0x120
[  +0.000003]
               stack backtrace:
[  +0.000002] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted  
5.9.0-rc6-lock-sgx39 #1
[  +0.000001] Hardware name: Microsoft Corporation Virtual Machine/Virtual  
Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
[  +0.000001] Call Trace:
[  +0.000002]  dump_stack+0x7d/0x9f
[  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
[  +0.000003]  sgx_encl_may_map+0x1b0/0x1c0
[  +0.000006]  sgx_mmap+0x29/0x70
[  +0.000002]  mmap_region+0x3ee/0x710
[  +0.000006]  do_mmap+0x3f1/0x5e0
[  +0.000005]  vm_mmap_pgoff+0xcd/0x120
[  +0.000006]  ksys_mmap_pgoff+0x1de/0x240
[  +0.000005]  __x64_sys_mmap+0x33/0x40
[  +0.000002]  do_syscall_64+0x37/0x80
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000002] RIP: 0033:0x7fe34efe06ba
[  +0.000001] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da 4d  
89 f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05  
<48> 3d 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
[  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:  
0000000000000009
[  +0.000001] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:  
00007fe34efe06ba
[  +0.000001] RDX: 0000000000000003 RSI: 0000000000010000 RDI:  
0000000007fee000
[  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:  
0000000000000000
[  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:  
0000000007fee000
[  +0.000001] R13: 0000000000010000 R14: 0000000000000011 R15:  
0000000000000000

[  +0.003197] =============================
[  +0.000001] WARNING: suspicious RCU usage
[  +0.000001] 5.9.0-rc6-lock-sgx39 #1 Not tainted
[  +0.000001] -----------------------------
[  +0.000001] ./include/linux/xarray.h:1198 suspicious  
rcu_dereference_check() usage!
[  +0.000001]
               other info that might help us debug this:

[  +0.000001]
               rcu_scheduler_active = 2, debug_locks = 1
[  +0.000001] 1 lock held by enclaveos-runne/4238:
[  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:  
vm_mmap_pgoff+0xa1/0x120
[  +0.000003]
               stack backtrace:
[  +0.000002] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted  
5.9.0-rc6-lock-sgx39 #1
[  +0.000001] Hardware name: Microsoft Corporation Virtual Machine/Virtual  
Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
[  +0.000001] Call Trace:
[  +0.000002]  dump_stack+0x7d/0x9f
[  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
[  +0.000004]  xas_find+0x255/0x2c0
[  +0.000003]  sgx_encl_may_map+0xad/0x1c0
[  +0.000006]  sgx_mmap+0x29/0x70
[  +0.000003]  mmap_region+0x3ee/0x710
[  +0.000005]  do_mmap+0x3f1/0x5e0
[  +0.000005]  vm_mmap_pgoff+0xcd/0x120
[  +0.000007]  ksys_mmap_pgoff+0x1de/0x240
[  +0.000004]  __x64_sys_mmap+0x33/0x40
[  +0.000002]  do_syscall_64+0x37/0x80
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000002] RIP: 0033:0x7fe34efe06ba
[  +0.000001] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da 4d  
89 f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05  
<48> 3d 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
[  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:  
0000000000000009
[  +0.000002] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:  
00007fe34efe06ba
[  +0.000001] RDX: 0000000000000003 RSI: 0000000000010000 RDI:  
0000000007fba000
[  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:  
0000000000000000
[  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:  
0000000007fba000
[  +0.000001] R13: 0000000000010000 R14: 0000000000000011 R15:  
0000000000000000


On Fri, 02 Oct 2020 23:50:51 -0500, Jarkko Sakkinen  
<jarkko.sakkinen@linux.intel.com> wrote:

> There is a limited amount of EPC available. Therefore, some of it must be
> copied to the regular memory, and only subset kept in the SGX reserved
> memory. While kernel cannot directly access enclave memory, SGX provides  
> a
> set of ENCLS leaf functions to perform reclaiming.
>
> Implement a page reclaimer by using these leaf functions. It picks the
> victim pages in LRU fashion from all the enclaves running in the system.
> The thread ksgxswapd reclaims pages on the event when the number of free
> EPC pages goes below SGX_NR_LOW_PAGES up until it reaches
> SGX_NR_HIGH_PAGES.
>
> sgx_alloc_epc_page() can optionally directly reclaim pages with @reclaim
> set true. A caller must also supply owner for each page so that the
> reclaimer can access the associated enclaves. This is needed for locking,
> as most of the ENCLS leafs cannot be executed concurrently for an  
> enclave.
> The owner is also needed for accessing SECS, which is required to be
> resident when its child pages are being reclaimed.
>
> Cc: linux-mm@kvack.org
> Acked-by: Jethro Beekman <jethro@fortanix.com>
> Tested-by: Jethro Beekman <jethro@fortanix.com>
> Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
> Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
> Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
> Tested-by: Seth Moore <sethmo@google.com>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/driver.c |   1 +
>  arch/x86/kernel/cpu/sgx/encl.c   | 344 +++++++++++++++++++++-
>  arch/x86/kernel/cpu/sgx/encl.h   |  41 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c  |  78 ++++-
>  arch/x86/kernel/cpu/sgx/main.c   | 481 +++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/sgx.h    |   9 +
>  6 files changed, 947 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/driver.c  
> b/arch/x86/kernel/cpu/sgx/driver.c
> index d01b28f7ce4a..0446781cc7a2 100644
> --- a/arch/x86/kernel/cpu/sgx/driver.c
> +++ b/arch/x86/kernel/cpu/sgx/driver.c
> @@ -29,6 +29,7 @@ static int sgx_open(struct inode *inode, struct file  
> *file)
>  	atomic_set(&encl->flags, 0);
>  	kref_init(&encl->refcount);
>  	xa_init(&encl->page_array);
> +	INIT_LIST_HEAD(&encl->va_pages);
>  	mutex_init(&encl->lock);
>  	INIT_LIST_HEAD(&encl->mm_list);
>  	spin_lock_init(&encl->mm_lock);
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c  
> b/arch/x86/kernel/cpu/sgx/encl.c
> index c2c4a77af36b..54326efa6c2f 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -12,9 +12,88 @@
>  #include "encls.h"
>  #include "sgx.h"
> +/*
> + * ELDU: Load an EPC page as unblocked. For more info, see "OS  
> Management of EPC
> + * Pages" in the SDM.
> + */
> +static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
> +			   struct sgx_epc_page *epc_page,
> +			   struct sgx_epc_page *secs_page)
> +{
> +	unsigned long va_offset = SGX_ENCL_PAGE_VA_OFFSET(encl_page);
> +	struct sgx_encl *encl = encl_page->encl;
> +	struct sgx_pageinfo pginfo;
> +	struct sgx_backing b;
> +	pgoff_t page_index;
> +	int ret;
> +
> +	if (secs_page)
> +		page_index = SGX_ENCL_PAGE_INDEX(encl_page);
> +	else
> +		page_index = PFN_DOWN(encl->size);
> +
> +	ret = sgx_encl_get_backing(encl, page_index, &b);
> +	if (ret)
> +		return ret;
> +
> +	pginfo.addr = SGX_ENCL_PAGE_ADDR(encl_page);
> +	pginfo.contents = (unsigned long)kmap_atomic(b.contents);
> +	pginfo.metadata = (unsigned long)kmap_atomic(b.pcmd) +
> +			  b.pcmd_offset;
> +
> +	if (secs_page)
> +		pginfo.secs = (u64)sgx_get_epc_addr(secs_page);
> +	else
> +		pginfo.secs = 0;
> +
> +	ret = __eldu(&pginfo, sgx_get_epc_addr(epc_page),
> +		     sgx_get_epc_addr(encl_page->va_page->epc_page) +
> +				      va_offset);
> +	if (ret) {
> +		if (encls_failed(ret))
> +			ENCLS_WARN(ret, "ELDU");
> +
> +		ret = -EFAULT;
> +	}
> +
> +	kunmap_atomic((void *)(unsigned long)(pginfo.metadata -  
> b.pcmd_offset));
> +	kunmap_atomic((void *)(unsigned long)pginfo.contents);
> +
> +	sgx_encl_put_backing(&b, false);
> +
> +	return ret;
> +}
> +
> +static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page  
> *encl_page,
> +					  struct sgx_epc_page *secs_page)
> +{
> +	unsigned long va_offset = SGX_ENCL_PAGE_VA_OFFSET(encl_page);
> +	struct sgx_encl *encl = encl_page->encl;
> +	struct sgx_epc_page *epc_page;
> +	int ret;
> +
> +	epc_page = sgx_alloc_epc_page(encl_page, false);
> +	if (IS_ERR(epc_page))
> +		return epc_page;
> +
> +	ret = __sgx_encl_eldu(encl_page, epc_page, secs_page);
> +	if (ret) {
> +		sgx_free_epc_page(epc_page);
> +		return ERR_PTR(ret);
> +	}
> +
> +	sgx_free_va_slot(encl_page->va_page, va_offset);
> +	list_move(&encl_page->va_page->list, &encl->va_pages);
> +	encl_page->desc &= ~SGX_ENCL_PAGE_VA_OFFSET_MASK;
> +	encl_page->epc_page = epc_page;
> +
> +	return epc_page;
> +}
> +
>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>  						unsigned long addr)
>  {
> +	struct sgx_epc_page *epc_page;
>  	struct sgx_encl_page *entry;
>  	unsigned int flags;
> @@ -33,10 +112,27 @@ static struct sgx_encl_page  
> *sgx_encl_load_page(struct sgx_encl *encl,
>  		return ERR_PTR(-EFAULT);
> 	/* Page is already resident in the EPC. */
> -	if (entry->epc_page)
> +	if (entry->epc_page) {
> +		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> +			return ERR_PTR(-EBUSY);
> +
>  		return entry;
> +	}
> +
> +	if (!(encl->secs.epc_page)) {
> +		epc_page = sgx_encl_eldu(&encl->secs, NULL);
> +		if (IS_ERR(epc_page))
> +			return ERR_CAST(epc_page);
> +	}
> -	return ERR_PTR(-EFAULT);
> +	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
> +	if (IS_ERR(epc_page))
> +		return ERR_CAST(epc_page);
> +
> +	encl->secs_child_cnt++;
> +	sgx_mark_page_reclaimable(entry->epc_page);
> +
> +	return entry;
>  }
> static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
> @@ -132,6 +228,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct  
> mm_struct *mm)
> 	spin_lock(&encl->mm_lock);
>  	list_add_rcu(&encl_mm->list, &encl->mm_list);
> +	/* Pairs with smp_rmb() in sgx_reclaimer_block(). */
> +	smp_wmb();
> +	encl->mm_list_version++;
>  	spin_unlock(&encl->mm_lock);
> 	return 0;
> @@ -179,6 +278,8 @@ static unsigned int sgx_vma_fault(struct vm_fault  
> *vmf)
>  		goto out;
>  	}
> +	sgx_encl_test_and_clear_young(vma->vm_mm, entry);
> +
>  out:
>  	mutex_unlock(&encl->lock);
>  	return ret;
> @@ -280,6 +381,7 @@ int sgx_encl_find(struct mm_struct *mm, unsigned  
> long addr,
>   */
>  void sgx_encl_destroy(struct sgx_encl *encl)
>  {
> +	struct sgx_va_page *va_page;
>  	struct sgx_encl_page *entry;
>  	unsigned long index;
> @@ -287,6 +389,13 @@ void sgx_encl_destroy(struct sgx_encl *encl)
> 	xa_for_each(&encl->page_array, index, entry) {
>  		if (entry->epc_page) {
> +			/*
> +			 * The page and its radix tree entry cannot be freed
> +			 * if the page is being held by the reclaimer.
> +			 */
> +			if (sgx_unmark_page_reclaimable(entry->epc_page))
> +				continue;
> +
>  			sgx_free_epc_page(entry->epc_page);
>  			encl->secs_child_cnt--;
>  			entry->epc_page = NULL;
> @@ -301,6 +410,19 @@ void sgx_encl_destroy(struct sgx_encl *encl)
>  		sgx_free_epc_page(encl->secs.epc_page);
>  		encl->secs.epc_page = NULL;
>  	}
> +
> +	/*
> +	 * The reclaimer is responsible for checking SGX_ENCL_DEAD before doing
> +	 * EWB, thus it's safe to free VA pages even if the reclaimer holds a
> +	 * reference to the enclave.
> +	 */
> +	while (!list_empty(&encl->va_pages)) {
> +		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
> +					   list);
> +		list_del(&va_page->list);
> +		sgx_free_epc_page(va_page->epc_page);
> +		kfree(va_page);
> +	}
>  }
> /**
> @@ -329,3 +451,221 @@ void sgx_encl_release(struct kref *ref)
> 	kfree(encl);
>  }
> +
> +static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl,
> +					      pgoff_t index)
> +{
> +	struct inode *inode = encl->backing->f_path.dentry->d_inode;
> +	struct address_space *mapping = inode->i_mapping;
> +	gfp_t gfpmask = mapping_gfp_mask(mapping);
> +
> +	return shmem_read_mapping_page_gfp(mapping, index, gfpmask);
> +}
> +
> +/**
> + * sgx_encl_get_backing() - Pin the backing storage
> + * @encl:	an enclave pointer
> + * @page_index:	enclave page index
> + * @backing:	data for accessing backing storage for the page
> + *
> + * Pin the backing storage pages for storing the encrypted contents and  
> Paging
> + * Crypto MetaData (PCMD) of an enclave page.
> + *
> + * Return:
> + *   0 on success,
> + *   -errno otherwise.
> + */
> +int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long  
> page_index,
> +			 struct sgx_backing *backing)
> +{
> +	pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
> +	struct page *contents;
> +	struct page *pcmd;
> +
> +	contents = sgx_encl_get_backing_page(encl, page_index);
> +	if (IS_ERR(contents))
> +		return PTR_ERR(contents);
> +
> +	pcmd = sgx_encl_get_backing_page(encl, pcmd_index);
> +	if (IS_ERR(pcmd)) {
> +		put_page(contents);
> +		return PTR_ERR(pcmd);
> +	}
> +
> +	backing->page_index = page_index;
> +	backing->contents = contents;
> +	backing->pcmd = pcmd;
> +	backing->pcmd_offset =
> +		(page_index & (PAGE_SIZE / sizeof(struct sgx_pcmd) - 1)) *
> +		sizeof(struct sgx_pcmd);
> +
> +	return 0;
> +}
> +
> +/**
> + * sgx_encl_put_backing() - Unpin the backing storage
> + * @backing:	data for accessing backing storage for the page
> + * @do_write:	mark pages dirty
> + */
> +void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write)
> +{
> +	if (do_write) {
> +		set_page_dirty(backing->pcmd);
> +		set_page_dirty(backing->contents);
> +	}
> +
> +	put_page(backing->pcmd);
> +	put_page(backing->contents);
> +}
> +
> +static int sgx_encl_test_and_clear_young_cb(pte_t *ptep, unsigned long  
> addr,
> +					    void *data)
> +{
> +	pte_t pte;
> +	int ret;
> +
> +	ret = pte_young(*ptep);
> +	if (ret) {
> +		pte = pte_mkold(*ptep);
> +		set_pte_at((struct mm_struct *)data, addr, ptep, pte);
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * sgx_encl_test_and_clear_young() - Test and reset the accessed bit
> + * @mm:		mm_struct that is checked
> + * @page:	enclave page to be tested for recent access
> + *
> + * Checks the Access (A) bit from the PTE corresponding to the enclave  
> page and
> + * clears it.
> + *
> + * Return: 1 if the page has been recently accessed and 0 if not.
> + */
> +int sgx_encl_test_and_clear_young(struct mm_struct *mm,
> +				  struct sgx_encl_page *page)
> +{
> +	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
> +	struct sgx_encl *encl = page->encl;
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	ret = sgx_encl_find(mm, addr, &vma);
> +	if (ret)
> +		return 0;
> +
> +	if (encl != vma->vm_private_data)
> +		return 0;
> +
> +	ret = apply_to_page_range(vma->vm_mm, addr, PAGE_SIZE,
> +				  sgx_encl_test_and_clear_young_cb, vma->vm_mm);
> +	if (ret < 0)
> +		return 0;
> +
> +	return ret;
> +}
> +
> +/**
> + * sgx_encl_reserve_page() - Reserve an enclave page
> + * @encl:	an enclave pointer
> + * @addr:	a page address
> + *
> + * Load an enclave page and lock the enclave so that the page can be  
> used by
> + * EDBG* and EMOD*.
> + *
> + * Return:
> + *   an enclave page on success
> + *   -EFAULT	if the load fails
> + */
> +struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> +					    unsigned long addr)
> +{
> +	struct sgx_encl_page *entry;
> +
> +	for ( ; ; ) {
> +		mutex_lock(&encl->lock);
> +
> +		entry = sgx_encl_load_page(encl, addr);
> +		if (PTR_ERR(entry) != -EBUSY)
> +			break;
> +
> +		mutex_unlock(&encl->lock);
> +	}
> +
> +	if (IS_ERR(entry))
> +		mutex_unlock(&encl->lock);
> +
> +	return entry;
> +}
> +
> +/**
> + * sgx_alloc_va_page() - Allocate a Version Array (VA) page
> + *
> + * Allocate a free EPC page and convert it to a Version Array (VA) page.
> + *
> + * Return:
> + *   a VA page,
> + *   -errno otherwise
> + */
> +struct sgx_epc_page *sgx_alloc_va_page(void)
> +{
> +	struct sgx_epc_page *epc_page;
> +	int ret;
> +
> +	epc_page = sgx_alloc_epc_page(NULL, true);
> +	if (IS_ERR(epc_page))
> +		return ERR_CAST(epc_page);
> +
> +	ret = __epa(sgx_get_epc_addr(epc_page));
> +	if (ret) {
> +		WARN_ONCE(1, "EPA returned %d (0x%x)", ret, ret);
> +		sgx_free_epc_page(epc_page);
> +		return ERR_PTR(-EFAULT);
> +	}
> +
> +	return epc_page;
> +}
> +
> +/**
> + * sgx_alloc_va_slot - allocate a VA slot
> + * @va_page:	a &struct sgx_va_page instance
> + *
> + * Allocates a slot from a &struct sgx_va_page instance.
> + *
> + * Return: offset of the slot inside the VA page
> + */
> +unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page)
> +{
> +	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
> +
> +	if (slot < SGX_VA_SLOT_COUNT)
> +		set_bit(slot, va_page->slots);
> +
> +	return slot << 3;
> +}
> +
> +/**
> + * sgx_free_va_slot - free a VA slot
> + * @va_page:	a &struct sgx_va_page instance
> + * @offset:	offset of the slot inside the VA page
> + *
> + * Frees a slot from a &struct sgx_va_page instance.
> + */
> +void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset)
> +{
> +	clear_bit(offset >> 3, va_page->slots);
> +}
> +
> +/**
> + * sgx_va_page_full - is the VA page full?
> + * @va_page:	a &struct sgx_va_page instance
> + *
> + * Return: true if all slots have been taken
> + */
> +bool sgx_va_page_full(struct sgx_va_page *va_page)
> +{
> +	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
> +
> +	return slot == SGX_VA_SLOT_COUNT;
> +}
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h  
> b/arch/x86/kernel/cpu/sgx/encl.h
> index 0448d22d3010..e8eb9e9a834e 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -19,6 +19,10 @@
> /**
>   * enum sgx_encl_page_desc - defines bits for an enclave page's  
> descriptor
> + * %SGX_ENCL_PAGE_BEING_RECLAIMED:	The page is in the process of being
> + *					reclaimed.
> + * %SGX_ENCL_PAGE_VA_OFFSET_MASK:	Holds the offset in the Version Array
> + *					(VA) page for a swapped page.
>   * %SGX_ENCL_PAGE_ADDR_MASK:		Holds the virtual address of the page.
>   *
>   * The page address for SECS is zero and is used by the subsystem to  
> recognize
> @@ -26,16 +30,23 @@
>   */
>  enum sgx_encl_page_desc {
>  	/* Bits 11:3 are available when the page is not swapped. */
> +	SGX_ENCL_PAGE_BEING_RECLAIMED		= BIT(3),
> +	SGX_ENCL_PAGE_VA_OFFSET_MASK	= GENMASK_ULL(11, 3),
>  	SGX_ENCL_PAGE_ADDR_MASK		= PAGE_MASK,
>  };
> #define SGX_ENCL_PAGE_ADDR(page) \
>  	((page)->desc & SGX_ENCL_PAGE_ADDR_MASK)
> +#define SGX_ENCL_PAGE_VA_OFFSET(page) \
> +	((page)->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK)
> +#define SGX_ENCL_PAGE_INDEX(page) \
> +	PFN_DOWN((page)->desc - (page)->encl->base)
> struct sgx_encl_page {
>  	unsigned long desc;
>  	unsigned long vm_max_prot_bits;
>  	struct sgx_epc_page *epc_page;
> +	struct sgx_va_page *va_page;
>  	struct sgx_encl *encl;
>  };
> @@ -61,6 +72,7 @@ struct sgx_encl {
>  	struct mutex lock;
>  	struct list_head mm_list;
>  	spinlock_t mm_lock;
> +	unsigned long mm_list_version;
>  	struct file *backing;
>  	struct kref refcount;
>  	struct srcu_struct srcu;
> @@ -68,12 +80,21 @@ struct sgx_encl {
>  	unsigned long size;
>  	unsigned long ssaframesize;
>  	struct xarray page_array;
> +	struct list_head va_pages;
>  	struct sgx_encl_page secs;
>  	cpumask_t cpumask;
>  	unsigned long attributes;
>  	unsigned long attributes_mask;
>  };
> +#define SGX_VA_SLOT_COUNT 512
> +
> +struct sgx_va_page {
> +	struct sgx_epc_page *epc_page;
> +	DECLARE_BITMAP(slots, SGX_VA_SLOT_COUNT);
> +	struct list_head list;
> +};
> +
>  extern const struct vm_operations_struct sgx_vm_ops;
> int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
> @@ -84,4 +105,24 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct  
> mm_struct *mm);
>  int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
>  		     unsigned long end, unsigned long vm_flags);
> +struct sgx_backing {
> +	pgoff_t page_index;
> +	struct page *contents;
> +	struct page *pcmd;
> +	unsigned long pcmd_offset;
> +};
> +
> +int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long  
> page_index,
> +			 struct sgx_backing *backing);
> +void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write);
> +int sgx_encl_test_and_clear_young(struct mm_struct *mm,
> +				  struct sgx_encl_page *page);
> +struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> +					    unsigned long addr);
> +
> +struct sgx_epc_page *sgx_alloc_va_page(void);
> +unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
> +void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
> +bool sgx_va_page_full(struct sgx_va_page *va_page);
> +
>  #endif /* _X86_ENCL_H */
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c  
> b/arch/x86/kernel/cpu/sgx/ioctl.c
> index 3c04798e83e5..613f6c03598e 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -16,6 +16,43 @@
>  #include "encl.h"
>  #include "encls.h"
> +static struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl)
> +{
> +	struct sgx_va_page *va_page = NULL;
> +	void *err;
> +
> +	BUILD_BUG_ON(SGX_VA_SLOT_COUNT !=
> +		(SGX_ENCL_PAGE_VA_OFFSET_MASK >> 3) + 1);
> +
> +	if (!(encl->page_cnt % SGX_VA_SLOT_COUNT)) {
> +		va_page = kzalloc(sizeof(*va_page), GFP_KERNEL);
> +		if (!va_page)
> +			return ERR_PTR(-ENOMEM);
> +
> +		va_page->epc_page = sgx_alloc_va_page();
> +		if (IS_ERR(va_page->epc_page)) {
> +			err = ERR_CAST(va_page->epc_page);
> +			kfree(va_page);
> +			return err;
> +		}
> +
> +		WARN_ON_ONCE(encl->page_cnt % SGX_VA_SLOT_COUNT);
> +	}
> +	encl->page_cnt++;
> +	return va_page;
> +}
> +
> +static void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page  
> *va_page)
> +{
> +	encl->page_cnt--;
> +
> +	if (va_page) {
> +		sgx_free_epc_page(va_page->epc_page);
> +		list_del(&va_page->list);
> +		kfree(va_page);
> +	}
> +}
> +
>  static u32 sgx_calc_ssa_frame_size(u32 miscselect, u64 xfrm)
>  {
>  	u32 size_max = PAGE_SIZE;
> @@ -80,15 +117,24 @@ static int sgx_validate_secs(const struct sgx_secs  
> *secs)
>  static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
>  {
>  	struct sgx_epc_page *secs_epc;
> +	struct sgx_va_page *va_page;
>  	struct sgx_pageinfo pginfo;
>  	struct sgx_secinfo secinfo;
>  	unsigned long encl_size;
>  	struct file *backing;
>  	long ret;
> +	va_page = sgx_encl_grow(encl);
> +	if (IS_ERR(va_page))
> +		return PTR_ERR(va_page);
> +	else if (va_page)
> +		list_add(&va_page->list, &encl->va_pages);
> +	/* else the tail page of the VA page list had free slots. */
> +
>  	if (sgx_validate_secs(secs)) {
>  		pr_debug("invalid SECS\n");
> -		return -EINVAL;
> +		ret = -EINVAL;
> +		goto err_out_shrink;
>  	}
> 	/* The extra page goes to SECS. */
> @@ -96,12 +142,14 @@ static int sgx_encl_create(struct sgx_encl *encl,  
> struct sgx_secs *secs)
> 	backing = shmem_file_setup("SGX backing", encl_size + (encl_size >> 5),
>  				   VM_NORESERVE);
> -	if (IS_ERR(backing))
> -		return PTR_ERR(backing);
> +	if (IS_ERR(backing)) {
> +		ret = PTR_ERR(backing);
> +		goto err_out_shrink;
> +	}
> 	encl->backing = backing;
> -	secs_epc = __sgx_alloc_epc_page();
> +	secs_epc = sgx_alloc_epc_page(&encl->secs, true);
>  	if (IS_ERR(secs_epc)) {
>  		ret = PTR_ERR(secs_epc);
>  		goto err_out_backing;
> @@ -149,6 +197,9 @@ static int sgx_encl_create(struct sgx_encl *encl,  
> struct sgx_secs *secs)
>  	fput(encl->backing);
>  	encl->backing = NULL;
> +err_out_shrink:
> +	sgx_encl_shrink(encl, va_page);
> +
>  	return ret;
>  }
> @@ -321,21 +372,35 @@ static int sgx_encl_add_page(struct sgx_encl  
> *encl, unsigned long src,
>  {
>  	struct sgx_encl_page *encl_page;
>  	struct sgx_epc_page *epc_page;
> +	struct sgx_va_page *va_page;
>  	int ret;
> 	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
>  	if (IS_ERR(encl_page))
>  		return PTR_ERR(encl_page);
> -	epc_page = __sgx_alloc_epc_page();
> +	epc_page = sgx_alloc_epc_page(encl_page, true);
>  	if (IS_ERR(epc_page)) {
>  		kfree(encl_page);
>  		return PTR_ERR(epc_page);
>  	}
> +	va_page = sgx_encl_grow(encl);
> +	if (IS_ERR(va_page)) {
> +		ret = PTR_ERR(va_page);
> +		goto err_out_free;
> +	}
> +
>  	mmap_read_lock(current->mm);
>  	mutex_lock(&encl->lock);
> +	/*
> +	 * Adding to encl->va_pages must be done under encl->lock.  Ditto for
> +	 * deleting (via sgx_encl_shrink()) in the error path.
> +	 */
> +	if (va_page)
> +		list_add(&va_page->list, &encl->va_pages);
> +
>  	/*
>  	 * Insert prior to EADD in case of OOM.  EADD modifies MRENCLAVE, i.e.
>  	 * can't be gracefully unwound, while failure on EADD/EXTEND is limited
> @@ -366,6 +431,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl,  
> unsigned long src,
>  			goto err_out;
>  	}
> +	sgx_mark_page_reclaimable(encl_page->epc_page);
>  	mutex_unlock(&encl->lock);
>  	mmap_read_unlock(current->mm);
>  	return ret;
> @@ -374,9 +440,11 @@ static int sgx_encl_add_page(struct sgx_encl *encl,  
> unsigned long src,
>  	xa_erase(&encl->page_array, PFN_DOWN(encl_page->desc));
> err_out_unlock:
> +	sgx_encl_shrink(encl, va_page);
>  	mutex_unlock(&encl->lock);
>  	mmap_read_unlock(current->mm);
> +err_out_free:
>  	sgx_free_epc_page(epc_page);
>  	kfree(encl_page);
> diff --git a/arch/x86/kernel/cpu/sgx/main.c  
> b/arch/x86/kernel/cpu/sgx/main.c
> index 4137254fb29e..3f9130501370 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -16,6 +16,395 @@
>  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  static int sgx_nr_epc_sections;
>  static struct task_struct *ksgxswapd_tsk;
> +static DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
> +static LIST_HEAD(sgx_active_page_list);
> +static DEFINE_SPINLOCK(sgx_active_page_list_lock);
> +
> +/**
> + * sgx_mark_page_reclaimable() - Mark a page as reclaimable
> + * @page:	EPC page
> + *
> + * Mark a page as reclaimable and add it to the active page list. Pages
> + * are automatically removed from the active list when freed.
> + */
> +void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
> +{
> +	spin_lock(&sgx_active_page_list_lock);
> +	page->desc |= SGX_EPC_PAGE_RECLAIMABLE;
> +	list_add_tail(&page->list, &sgx_active_page_list);
> +	spin_unlock(&sgx_active_page_list_lock);
> +}
> +
> +/**
> + * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
> + * @page:	EPC page
> + *
> + * Clear the reclaimable flag and remove the page from the active page  
> list.
> + *
> + * Return:
> + *   0 on success,
> + *   -EBUSY if the page is in the process of being reclaimed
> + */
> +int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
> +{
> +	/*
> +	 * Remove the page from the active list if necessary.  If the page
> +	 * is actively being reclaimed, i.e. RECLAIMABLE is set but the
> +	 * page isn't on the active list, return -EBUSY as we can't free
> +	 * the page at this time since it is "owned" by the reclaimer.
> +	 */
> +	spin_lock(&sgx_active_page_list_lock);
> +	if (page->desc & SGX_EPC_PAGE_RECLAIMABLE) {
> +		if (list_empty(&page->list)) {
> +			spin_unlock(&sgx_active_page_list_lock);
> +			return -EBUSY;
> +		}
> +		list_del(&page->list);
> +		page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
> +	}
> +	spin_unlock(&sgx_active_page_list_lock);
> +
> +	return 0;
> +}
> +
> +static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
> +{
> +	struct sgx_encl_page *page = epc_page->owner;
> +	struct sgx_encl *encl = page->encl;
> +	struct sgx_encl_mm *encl_mm;
> +	bool ret = true;
> +	int idx;
> +
> +	idx = srcu_read_lock(&encl->srcu);
> +
> +	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> +		if (!mmget_not_zero(encl_mm->mm))
> +			continue;
> +
> +		mmap_read_lock(encl_mm->mm);
> +		ret = !sgx_encl_test_and_clear_young(encl_mm->mm, page);
> +		mmap_read_unlock(encl_mm->mm);
> +
> +		mmput_async(encl_mm->mm);
> +
> +		if (!ret || (atomic_read(&encl->flags) & SGX_ENCL_DEAD))
> +			break;
> +	}
> +
> +	srcu_read_unlock(&encl->srcu, idx);
> +
> +	if (!ret && !(atomic_read(&encl->flags) & SGX_ENCL_DEAD))
> +		return false;
> +
> +	return true;
> +}
> +
> +static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
> +{
> +	struct sgx_encl_page *page = epc_page->owner;
> +	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
> +	struct sgx_encl *encl = page->encl;
> +	unsigned long mm_list_version;
> +	struct sgx_encl_mm *encl_mm;
> +	struct vm_area_struct *vma;
> +	int idx, ret;
> +
> +	do {
> +		mm_list_version = encl->mm_list_version;
> +
> +		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
> +		smp_rmb();
> +
> +		idx = srcu_read_lock(&encl->srcu);
> +
> +		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> +			if (!mmget_not_zero(encl_mm->mm))
> +				continue;
> +
> +			mmap_read_lock(encl_mm->mm);
> +
> +			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> +			if (!ret && encl == vma->vm_private_data)
> +				zap_vma_ptes(vma, addr, PAGE_SIZE);
> +
> +			mmap_read_unlock(encl_mm->mm);
> +
> +			mmput_async(encl_mm->mm);
> +		}
> +
> +		srcu_read_unlock(&encl->srcu, idx);
> +	} while (unlikely(encl->mm_list_version != mm_list_version));
> +
> +	mutex_lock(&encl->lock);
> +
> +	if (!(atomic_read(&encl->flags) & SGX_ENCL_DEAD)) {
> +		ret = __eblock(sgx_get_epc_addr(epc_page));
> +		if (encls_failed(ret))
> +			ENCLS_WARN(ret, "EBLOCK");
> +	}
> +
> +	mutex_unlock(&encl->lock);
> +}
> +
> +static int __sgx_encl_ewb(struct sgx_epc_page *epc_page, void *va_slot,
> +			  struct sgx_backing *backing)
> +{
> +	struct sgx_pageinfo pginfo;
> +	int ret;
> +
> +	pginfo.addr = 0;
> +	pginfo.secs = 0;
> +
> +	pginfo.contents = (unsigned long)kmap_atomic(backing->contents);
> +	pginfo.metadata = (unsigned long)kmap_atomic(backing->pcmd) +
> +			  backing->pcmd_offset;
> +
> +	ret = __ewb(&pginfo, sgx_get_epc_addr(epc_page), va_slot);
> +
> +	kunmap_atomic((void *)(unsigned long)(pginfo.metadata -
> +					      backing->pcmd_offset));
> +	kunmap_atomic((void *)(unsigned long)pginfo.contents);
> +
> +	return ret;
> +}
> +
> +static void sgx_ipi_cb(void *info)
> +{
> +}
> +
> +static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl)
> +{
> +	cpumask_t *cpumask = &encl->cpumask;
> +	struct sgx_encl_mm *encl_mm;
> +	int idx;
> +
> +	/*
> +	 * Can race with sgx_encl_mm_add(), but ETRACK has already been
> +	 * executed, which means that the CPUs running in the new mm will enter
> +	 * into the enclave with a fresh epoch.
> +	 */
> +	cpumask_clear(cpumask);
> +
> +	idx = srcu_read_lock(&encl->srcu);
> +
> +	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> +		if (!mmget_not_zero(encl_mm->mm))
> +			continue;
> +
> +		cpumask_or(cpumask, cpumask, mm_cpumask(encl_mm->mm));
> +
> +		mmput_async(encl_mm->mm);
> +	}
> +
> +	srcu_read_unlock(&encl->srcu, idx);
> +
> +	return cpumask;
> +}
> +
> +/*
> + * Swap page to the regular memory transformed to the blocked state by  
> using
> + * EBLOCK, which means that it can no loger be referenced (no new TLB  
> entries).
> + *
> + * The first trial just tries to write the page assuming that some  
> other thread
> + * has reset the count for threads inside the enlave by using ETRACK,  
> and
> + * previous thread count has been zeroed out. The second trial calls  
> ETRACK
> + * before EWB. If that fails we kick all the HW threads out, and then  
> do EWB,
> + * which should be guaranteed the succeed.
> + */
> +static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
> +			 struct sgx_backing *backing)
> +{
> +	struct sgx_encl_page *encl_page = epc_page->owner;
> +	struct sgx_encl *encl = encl_page->encl;
> +	struct sgx_va_page *va_page;
> +	unsigned int va_offset;
> +	void *va_slot;
> +	int ret;
> +
> +	encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED;
> +
> +	va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
> +				   list);
> +	va_offset = sgx_alloc_va_slot(va_page);
> +	va_slot = sgx_get_epc_addr(va_page->epc_page) + va_offset;
> +	if (sgx_va_page_full(va_page))
> +		list_move_tail(&va_page->list, &encl->va_pages);
> +
> +	ret = __sgx_encl_ewb(epc_page, va_slot, backing);
> +	if (ret == SGX_NOT_TRACKED) {
> +		ret = __etrack(sgx_get_epc_addr(encl->secs.epc_page));
> +		if (ret) {
> +			if (encls_failed(ret))
> +				ENCLS_WARN(ret, "ETRACK");
> +		}
> +
> +		ret = __sgx_encl_ewb(epc_page, va_slot, backing);
> +		if (ret == SGX_NOT_TRACKED) {
> +			/*
> +			 * Slow path, send IPIs to kick cpus out of the
> +			 * enclave.  Note, it's imperative that the cpu
> +			 * mask is generated *after* ETRACK, else we'll
> +			 * miss cpus that entered the enclave between
> +			 * generating the mask and incrementing epoch.
> +			 */
> +			on_each_cpu_mask(sgx_encl_ewb_cpumask(encl),
> +					 sgx_ipi_cb, NULL, 1);
> +			ret = __sgx_encl_ewb(epc_page, va_slot, backing);
> +		}
> +	}
> +
> +	if (ret) {
> +		if (encls_failed(ret))
> +			ENCLS_WARN(ret, "EWB");
> +
> +		sgx_free_va_slot(va_page, va_offset);
> +	} else {
> +		encl_page->desc |= va_offset;
> +		encl_page->va_page = va_page;
> +	}
> +}
> +
> +static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
> +				struct sgx_backing *backing)
> +{
> +	struct sgx_encl_page *encl_page = epc_page->owner;
> +	struct sgx_encl *encl = encl_page->encl;
> +	struct sgx_backing secs_backing;
> +	int ret;
> +
> +	mutex_lock(&encl->lock);
> +
> +	if (atomic_read(&encl->flags) & SGX_ENCL_DEAD) {
> +		ret = __eremove(sgx_get_epc_addr(epc_page));
> +		ENCLS_WARN(ret, "EREMOVE");
> +	} else {
> +		sgx_encl_ewb(epc_page, backing);
> +	}
> +
> +	encl_page->epc_page = NULL;
> +	encl->secs_child_cnt--;
> +
> +	if (!encl->secs_child_cnt) {
> +		if (atomic_read(&encl->flags) & SGX_ENCL_DEAD) {
> +			sgx_free_epc_page(encl->secs.epc_page);
> +			encl->secs.epc_page = NULL;
> +		} else if (atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED) {
> +			ret = sgx_encl_get_backing(encl, PFN_DOWN(encl->size),
> +						   &secs_backing);
> +			if (ret)
> +				goto out;
> +
> +			sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
> +
> +			sgx_free_epc_page(encl->secs.epc_page);
> +			encl->secs.epc_page = NULL;
> +
> +			sgx_encl_put_backing(&secs_backing, true);
> +		}
> +	}
> +
> +out:
> +	mutex_unlock(&encl->lock);
> +}
> +
> +/*
> + * Take a fixed number of pages from the head of the active page pool  
> and
> + * reclaim them to the enclave's private shmem files. Skip the pages,  
> which have
> + * been accessed since the last scan. Move those pages to the tail of  
> active
> + * page pool so that the pages get scanned in LRU like fashion.
> + *
> + * Batch process a chunk of pages (at the moment 16) in order to  
> degrade amount
> + * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does  
> degrade a bit
> + * among the HW threads with three stage EWB pipeline (EWB, ETRACK +  
> EWB and IPI
> + * + EWB) but not sufficiently. Reclaiming one page at a time would  
> also be
> + * problematic as it would increase the lock contention too much, which  
> would
> + * halt forward progress.
> + */
> +static void sgx_reclaim_pages(void)
> +{
> +	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
> +	struct sgx_backing backing[SGX_NR_TO_SCAN];
> +	struct sgx_epc_section *section;
> +	struct sgx_encl_page *encl_page;
> +	struct sgx_epc_page *epc_page;
> +	int cnt = 0;
> +	int ret;
> +	int i;
> +
> +	spin_lock(&sgx_active_page_list_lock);
> +	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> +		if (list_empty(&sgx_active_page_list))
> +			break;
> +
> +		epc_page = list_first_entry(&sgx_active_page_list,
> +					    struct sgx_epc_page, list);
> +		list_del_init(&epc_page->list);
> +		encl_page = epc_page->owner;
> +
> +		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
> +			chunk[cnt++] = epc_page;
> +		else
> +			/* The owner is freeing the page. No need to add the
> +			 * page back to the list of reclaimable pages.
> +			 */
> +			epc_page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
> +	}
> +	spin_unlock(&sgx_active_page_list_lock);
> +
> +	for (i = 0; i < cnt; i++) {
> +		epc_page = chunk[i];
> +		encl_page = epc_page->owner;
> +
> +		if (!sgx_reclaimer_age(epc_page))
> +			goto skip;
> +
> +		ret = sgx_encl_get_backing(encl_page->encl,
> +					   SGX_ENCL_PAGE_INDEX(encl_page),
> +					   &backing[i]);
> +		if (ret)
> +			goto skip;
> +
> +		mutex_lock(&encl_page->encl->lock);
> +		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
> +		mutex_unlock(&encl_page->encl->lock);
> +		continue;
> +
> +skip:
> +		spin_lock(&sgx_active_page_list_lock);
> +		list_add_tail(&epc_page->list, &sgx_active_page_list);
> +		spin_unlock(&sgx_active_page_list_lock);
> +
> +		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> +
> +		chunk[i] = NULL;
> +	}
> +
> +	for (i = 0; i < cnt; i++) {
> +		epc_page = chunk[i];
> +		if (epc_page)
> +			sgx_reclaimer_block(epc_page);
> +	}
> +
> +	for (i = 0; i < cnt; i++) {
> +		epc_page = chunk[i];
> +		if (!epc_page)
> +			continue;
> +
> +		encl_page = epc_page->owner;
> +		sgx_reclaimer_write(epc_page, &backing[i]);
> +		sgx_encl_put_backing(&backing[i], true);
> +
> +		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> +		epc_page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
> +
> +		section = sgx_get_epc_section(epc_page);
> +		spin_lock(&section->lock);
> +		list_add_tail(&epc_page->list, &section->page_list);
> +		section->free_cnt++;
> +		spin_unlock(&section->lock);
> +	}
> +}
> +
> static void sgx_sanitize_section(struct sgx_epc_section *section)
>  {
> @@ -44,6 +433,23 @@ static void sgx_sanitize_section(struct  
> sgx_epc_section *section)
>  	}
>  }
> +static unsigned long sgx_nr_free_pages(void)
> +{
> +	unsigned long cnt = 0;
> +	int i;
> +
> +	for (i = 0; i < sgx_nr_epc_sections; i++)
> +		cnt += sgx_epc_sections[i].free_cnt;
> +
> +	return cnt;
> +}
> +
> +static bool sgx_should_reclaim(unsigned long watermark)
> +{
> +	return sgx_nr_free_pages() < watermark &&
> +	       !list_empty(&sgx_active_page_list);
> +}
> +
>  static int ksgxswapd(void *p)
>  {
>  	int i;
> @@ -69,6 +475,20 @@ static int ksgxswapd(void *p)
>  			WARN(1, "EPC section %d has unsanitized pages.\n", i);
>  	}
> +	while (!kthread_should_stop()) {
> +		if (try_to_freeze())
> +			continue;
> +
> +		wait_event_freezable(ksgxswapd_waitq,
> +				     kthread_should_stop() ||
> +				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
> +
> +		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> +			sgx_reclaim_pages();
> +
> +		cond_resched();
> +	}
> +
>  	return 0;
>  }
> @@ -94,6 +514,7 @@ static struct sgx_epc_page  
> *__sgx_alloc_epc_page_from_section(struct sgx_epc_sec
> 	page = list_first_entry(&section->page_list, struct sgx_epc_page, list);
>  	list_del_init(&page->list);
> +	section->free_cnt--;
> 	return page;
>  }
> @@ -127,6 +548,57 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  	return ERR_PTR(-ENOMEM);
>  }
> +/**
> + * sgx_alloc_epc_page() - Allocate an EPC page
> + * @owner:	the owner of the EPC page
> + * @reclaim:	reclaim pages if necessary
> + *
> + * Iterate through EPC sections and borrow a free EPC page to the  
> caller. When a
> + * page is no longer needed it must be released with  
> sgx_free_epc_page(). If
> + * @reclaim is set to true, directly reclaim pages when we are out of  
> pages. No
> + * mm's can be locked when @reclaim is set to true.
> + *
> + * Finally, wake up ksgxswapd when the number of pages goes below the  
> watermark
> + * before returning back to the caller.
> + *
> + * Return:
> + *   an EPC page,
> + *   -errno on error
> + */
> +struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> +{
> +	struct sgx_epc_page *entry;
> +
> +	for ( ; ; ) {
> +		entry = __sgx_alloc_epc_page();
> +		if (!IS_ERR(entry)) {
> +			entry->owner = owner;
> +			break;
> +		}
> +
> +		if (list_empty(&sgx_active_page_list))
> +			return ERR_PTR(-ENOMEM);
> +
> +		if (!reclaim) {
> +			entry = ERR_PTR(-EBUSY);
> +			break;
> +		}
> +
> +		if (signal_pending(current)) {
> +			entry = ERR_PTR(-ERESTARTSYS);
> +			break;
> +		}
> +
> +		sgx_reclaim_pages();
> +		schedule();
> +	}
> +
> +	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> +		wake_up(&ksgxswapd_waitq);
> +
> +	return entry;
> +}
> +
>  /**
>   * sgx_free_epc_page() - Free an EPC page
>   * @page:	an EPC page
> @@ -138,12 +610,20 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	struct sgx_epc_section *section = sgx_get_epc_section(page);
>  	int ret;
> +	/*
> +	 * Don't take sgx_active_page_list_lock when asserting the page isn't
> +	 * reclaimable, missing a WARN in the very rare case is preferable to
> +	 * unnecessarily taking a global lock in the common case.
> +	 */
> +	WARN_ON_ONCE(page->desc & SGX_EPC_PAGE_RECLAIMABLE);
> +
>  	ret = __eremove(sgx_get_epc_addr(page));
>  	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
>  		return;
> 	spin_lock(&section->lock);
>  	list_add_tail(&page->list, &section->page_list);
> +	section->free_cnt++;
>  	spin_unlock(&section->lock);
>  }
> @@ -194,6 +674,7 @@ static bool __init sgx_setup_epc_section(u64 addr,  
> u64 size,
>  		list_add_tail(&page->list, &section->unsanitized_page_list);
>  	}
> +	section->free_cnt = nr_pages;
>  	return true;
> err_out:
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h  
> b/arch/x86/kernel/cpu/sgx/sgx.h
> index 8d126070db1e..ec4f7b338dbe 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -15,6 +15,7 @@
> struct sgx_epc_page {
>  	unsigned long desc;
> +	struct sgx_encl_page *owner;
>  	struct list_head list;
>  };
> @@ -27,6 +28,7 @@ struct sgx_epc_page {
>  struct sgx_epc_section {
>  	unsigned long pa;
>  	void *va;
> +	unsigned long free_cnt;
>  	struct list_head page_list;
>  	struct list_head unsanitized_page_list;
>  	spinlock_t lock;
> @@ -35,6 +37,10 @@ struct sgx_epc_section {
>  #define SGX_EPC_SECTION_MASK		GENMASK(7, 0)
>  #define SGX_MAX_EPC_SECTIONS		(SGX_EPC_SECTION_MASK + 1)
>  #define SGX_MAX_ADD_PAGES_LENGTH	0x100000
> +#define SGX_EPC_PAGE_RECLAIMABLE	BIT(8)
> +#define SGX_NR_TO_SCAN			16
> +#define SGX_NR_LOW_PAGES		32
> +#define SGX_NR_HIGH_PAGES		64
> extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> @@ -50,7 +56,10 @@ static inline void *sgx_get_epc_addr(struct  
> sgx_epc_page *page)
>  	return section->va + (page->desc & PAGE_MASK) - section->pa;
>  }
> +void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
> +int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
>  struct sgx_epc_page *__sgx_alloc_epc_page(void);
> +struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
>  void sgx_free_epc_page(struct sgx_epc_page *page);
> #endif /* _X86_SGX_H */


-- 
Using Opera's mail client: http://www.opera.com/mail/


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 16/24] x86/sgx: Add a page reclaimer
  2020-10-03  5:22   ` Haitao Huang
@ 2020-10-03 13:32     ` Jarkko Sakkinen
  2020-10-03 18:23       ` Haitao Huang
  0 siblings, 1 reply; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-03 13:32 UTC (permalink / raw)
  To: Haitao Huang
  Cc: x86, linux-sgx, linux-kernel, linux-mm, Jethro Beekman,
	Jordan Hand, Nathaniel McCallum, Chunyang Hui, Seth Moore,
	Sean Christopherson, akpm, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen, willy

On Sat, Oct 03, 2020 at 12:22:47AM -0500, Haitao Huang wrote:
> When I turn on CONFIG_PROVE_LOCKING, kernel reports following suspicious RCU
> usages. Not sure if it is an issue. Just reporting here:

I'm glad to hear that my tip helped you to get us the data.

This does not look like an issue in the page reclaimer, which was not
obvious for me before. That's a good thing. I was really worried about
that because it has been very stable for a long period now. The last
bug fix for the reclaimer was done in June in v31 version of the patch
set and after that it has been unchanged (except possibly some renames
requested by Boris).

I wildly guess I have a bad usage pattern for xarray. I migrated to it
in v36, and it is entirely possible that I've misused it. It was the
first time that I ever used it. Before xarray we had radix_tree but
based Matthew Wilcox feedback I did a migration to xarray.

What I'd ask you to do next is to, if by any means possible, to try to
run the same test with v35 so we can verify this. That one still has
the radix tree.

Thank you.

/Jarkko

> 
> [ +34.337095] =============================
> [  +0.000001] WARNING: suspicious RCU usage
> [  +0.000002] 5.9.0-rc6-lock-sgx39 #1 Not tainted
> [  +0.000001] -----------------------------
> [  +0.000001] ./include/linux/xarray.h:1165 suspicious
> rcu_dereference_check() usage!
> [  +0.000001]
>               other info that might help us debug this:
> 
> [  +0.000001]
>               rcu_scheduler_active = 2, debug_locks = 1
> [  +0.000001] 1 lock held by enclaveos-runne/4238:
> [  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:
> vm_mmap_pgoff+0xa1/0x120
> [  +0.000005]
>               stack backtrace:
> [  +0.000002] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted
> 5.9.0-rc6-lock-sgx39 #1
> [  +0.000001] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
> [  +0.000002] Call Trace:
> [  +0.000003]  dump_stack+0x7d/0x9f
> [  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
> [  +0.000004]  xas_start+0x14c/0x1c0
> [  +0.000003]  xas_load+0xf/0x50
> [  +0.000002]  xas_find+0x25c/0x2c0
> [  +0.000004]  sgx_encl_may_map+0x87/0x1c0
> [  +0.000006]  sgx_mmap+0x29/0x70
> [  +0.000003]  mmap_region+0x3ee/0x710
> [  +0.000006]  do_mmap+0x3f1/0x5e0
> [  +0.000004]  vm_mmap_pgoff+0xcd/0x120
> [  +0.000007]  ksys_mmap_pgoff+0x1de/0x240
> [  +0.000005]  __x64_sys_mmap+0x33/0x40
> [  +0.000002]  do_syscall_64+0x37/0x80
> [  +0.000003]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  +0.000002] RIP: 0033:0x7fe34efe06ba
> [  +0.000002] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da 4d 89
> f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05 <48> 3d
> 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
> [  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000009
> [  +0.000001] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
> 00007fe34efe06ba
> [  +0.000001] RDX: 0000000000000001 RSI: 0000000000001000 RDI:
> 0000000007fff000
> [  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:
> 0000000000000000
> [  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:
> 0000000007fff000
> [  +0.000001] R13: 0000000000001000 R14: 0000000000000011 R15:
> 0000000000000000
> 
> [  +0.000010] =============================
> [  +0.000001] WARNING: suspicious RCU usage
> [  +0.000001] 5.9.0-rc6-lock-sgx39 #1 Not tainted
> [  +0.000001] -----------------------------
> [  +0.000001] ./include/linux/xarray.h:1181 suspicious
> rcu_dereference_check() usage!
> [  +0.000001]
>               other info that might help us debug this:
> 
> [  +0.000001]
>               rcu_scheduler_active = 2, debug_locks = 1
> [  +0.000001] 1 lock held by enclaveos-runne/4238:
> [  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:
> vm_mmap_pgoff+0xa1/0x120
> [  +0.000003]
>               stack backtrace:
> [  +0.000001] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted
> 5.9.0-rc6-lock-sgx39 #1
> [  +0.000001] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
> [  +0.000001] Call Trace:
> [  +0.000001]  dump_stack+0x7d/0x9f
> [  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
> [  +0.000003]  xas_descend+0x116/0x120
> [  +0.000004]  xas_load+0x42/0x50
> [  +0.000002]  xas_find+0x25c/0x2c0
> [  +0.000004]  sgx_encl_may_map+0x87/0x1c0
> [  +0.000006]  sgx_mmap+0x29/0x70
> [  +0.000002]  mmap_region+0x3ee/0x710
> [  +0.000006]  do_mmap+0x3f1/0x5e0
> [  +0.000004]  vm_mmap_pgoff+0xcd/0x120
> [  +0.000007]  ksys_mmap_pgoff+0x1de/0x240
> [  +0.000005]  __x64_sys_mmap+0x33/0x40
> [  +0.000002]  do_syscall_64+0x37/0x80
> [  +0.000002]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  +0.000001] RIP: 0033:0x7fe34efe06ba
> [  +0.000001] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da 4d 89
> f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05 <48> 3d
> 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
> [  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000009
> [  +0.000001] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
> 00007fe34efe06ba
> [  +0.000001] RDX: 0000000000000001 RSI: 0000000000001000 RDI:
> 0000000007fff000
> [  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:
> 0000000000000000
> [  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:
> 0000000007fff000
> [  +0.000001] R13: 0000000000001000 R14: 0000000000000011 R15:
> 0000000000000000
> 
> [  +0.001117] =============================
> [  +0.000001] WARNING: suspicious RCU usage
> [  +0.000001] 5.9.0-rc6-lock-sgx39 #1 Not tainted
> [  +0.000001] -----------------------------
> [  +0.000001] ./include/linux/xarray.h:1181 suspicious
> rcu_dereference_check() usage!
> [  +0.000001]
>               other info that might help us debug this:
> 
> [  +0.000001]
>               rcu_scheduler_active = 2, debug_locks = 1
> [  +0.000001] 1 lock held by enclaveos-runne/4238:
> [  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:
> vm_mmap_pgoff+0xa1/0x120
> [  +0.000003]
>               stack backtrace:
> [  +0.000002] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted
> 5.9.0-rc6-lock-sgx39 #1
> [  +0.000001] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
> [  +0.000001] Call Trace:
> [  +0.000002]  dump_stack+0x7d/0x9f
> [  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
> [  +0.000003]  sgx_encl_may_map+0x1b0/0x1c0
> [  +0.000006]  sgx_mmap+0x29/0x70
> [  +0.000002]  mmap_region+0x3ee/0x710
> [  +0.000006]  do_mmap+0x3f1/0x5e0
> [  +0.000005]  vm_mmap_pgoff+0xcd/0x120
> [  +0.000006]  ksys_mmap_pgoff+0x1de/0x240
> [  +0.000005]  __x64_sys_mmap+0x33/0x40
> [  +0.000002]  do_syscall_64+0x37/0x80
> [  +0.000002]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  +0.000002] RIP: 0033:0x7fe34efe06ba
> [  +0.000001] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da 4d 89
> f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05 <48> 3d
> 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
> [  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000009
> [  +0.000001] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> 00007fe34efe06ba
> [  +0.000001] RDX: 0000000000000003 RSI: 0000000000010000 RDI:
> 0000000007fee000
> [  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:
> 0000000000000000
> [  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:
> 0000000007fee000
> [  +0.000001] R13: 0000000000010000 R14: 0000000000000011 R15:
> 0000000000000000
> 
> [  +0.003197] =============================
> [  +0.000001] WARNING: suspicious RCU usage
> [  +0.000001] 5.9.0-rc6-lock-sgx39 #1 Not tainted
> [  +0.000001] -----------------------------
> [  +0.000001] ./include/linux/xarray.h:1198 suspicious
> rcu_dereference_check() usage!
> [  +0.000001]
>               other info that might help us debug this:
> 
> [  +0.000001]
>               rcu_scheduler_active = 2, debug_locks = 1
> [  +0.000001] 1 lock held by enclaveos-runne/4238:
> [  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:
> vm_mmap_pgoff+0xa1/0x120
> [  +0.000003]
>               stack backtrace:
> [  +0.000002] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted
> 5.9.0-rc6-lock-sgx39 #1
> [  +0.000001] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
> [  +0.000001] Call Trace:
> [  +0.000002]  dump_stack+0x7d/0x9f
> [  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
> [  +0.000004]  xas_find+0x255/0x2c0
> [  +0.000003]  sgx_encl_may_map+0xad/0x1c0
> [  +0.000006]  sgx_mmap+0x29/0x70
> [  +0.000003]  mmap_region+0x3ee/0x710
> [  +0.000005]  do_mmap+0x3f1/0x5e0
> [  +0.000005]  vm_mmap_pgoff+0xcd/0x120
> [  +0.000007]  ksys_mmap_pgoff+0x1de/0x240
> [  +0.000004]  __x64_sys_mmap+0x33/0x40
> [  +0.000002]  do_syscall_64+0x37/0x80
> [  +0.000002]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  +0.000002] RIP: 0033:0x7fe34efe06ba
> [  +0.000001] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da 4d 89
> f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05 <48> 3d
> 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
> [  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000009
> [  +0.000002] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
> 00007fe34efe06ba
> [  +0.000001] RDX: 0000000000000003 RSI: 0000000000010000 RDI:
> 0000000007fba000
> [  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:
> 0000000000000000
> [  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:
> 0000000007fba000
> [  +0.000001] R13: 0000000000010000 R14: 0000000000000011 R15:
> 0000000000000000
> 
> 
> On Fri, 02 Oct 2020 23:50:51 -0500, Jarkko Sakkinen
> <jarkko.sakkinen@linux.intel.com> wrote:
> 
> > There is a limited amount of EPC available. Therefore, some of it must be
> > copied to the regular memory, and only subset kept in the SGX reserved
> > memory. While kernel cannot directly access enclave memory, SGX provides
> > a
> > set of ENCLS leaf functions to perform reclaiming.
> > 
> > Implement a page reclaimer by using these leaf functions. It picks the
> > victim pages in LRU fashion from all the enclaves running in the system.
> > The thread ksgxswapd reclaims pages on the event when the number of free
> > EPC pages goes below SGX_NR_LOW_PAGES up until it reaches
> > SGX_NR_HIGH_PAGES.
> > 
> > sgx_alloc_epc_page() can optionally directly reclaim pages with @reclaim
> > set true. A caller must also supply owner for each page so that the
> > reclaimer can access the associated enclaves. This is needed for locking,
> > as most of the ENCLS leafs cannot be executed concurrently for an
> > enclave.
> > The owner is also needed for accessing SECS, which is required to be
> > resident when its child pages are being reclaimed.
> > 
> > Cc: linux-mm@kvack.org
> > Acked-by: Jethro Beekman <jethro@fortanix.com>
> > Tested-by: Jethro Beekman <jethro@fortanix.com>
> > Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
> > Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
> > Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
> > Tested-by: Seth Moore <sethmo@google.com>
> > Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/driver.c |   1 +
> >  arch/x86/kernel/cpu/sgx/encl.c   | 344 +++++++++++++++++++++-
> >  arch/x86/kernel/cpu/sgx/encl.h   |  41 +++
> >  arch/x86/kernel/cpu/sgx/ioctl.c  |  78 ++++-
> >  arch/x86/kernel/cpu/sgx/main.c   | 481 +++++++++++++++++++++++++++++++
> >  arch/x86/kernel/cpu/sgx/sgx.h    |   9 +
> >  6 files changed, 947 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/driver.c
> > b/arch/x86/kernel/cpu/sgx/driver.c
> > index d01b28f7ce4a..0446781cc7a2 100644
> > --- a/arch/x86/kernel/cpu/sgx/driver.c
> > +++ b/arch/x86/kernel/cpu/sgx/driver.c
> > @@ -29,6 +29,7 @@ static int sgx_open(struct inode *inode, struct file
> > *file)
> >  	atomic_set(&encl->flags, 0);
> >  	kref_init(&encl->refcount);
> >  	xa_init(&encl->page_array);
> > +	INIT_LIST_HEAD(&encl->va_pages);
> >  	mutex_init(&encl->lock);
> >  	INIT_LIST_HEAD(&encl->mm_list);
> >  	spin_lock_init(&encl->mm_lock);
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c
> > b/arch/x86/kernel/cpu/sgx/encl.c
> > index c2c4a77af36b..54326efa6c2f 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -12,9 +12,88 @@
> >  #include "encls.h"
> >  #include "sgx.h"
> > +/*
> > + * ELDU: Load an EPC page as unblocked. For more info, see "OS
> > Management of EPC
> > + * Pages" in the SDM.
> > + */
> > +static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
> > +			   struct sgx_epc_page *epc_page,
> > +			   struct sgx_epc_page *secs_page)
> > +{
> > +	unsigned long va_offset = SGX_ENCL_PAGE_VA_OFFSET(encl_page);
> > +	struct sgx_encl *encl = encl_page->encl;
> > +	struct sgx_pageinfo pginfo;
> > +	struct sgx_backing b;
> > +	pgoff_t page_index;
> > +	int ret;
> > +
> > +	if (secs_page)
> > +		page_index = SGX_ENCL_PAGE_INDEX(encl_page);
> > +	else
> > +		page_index = PFN_DOWN(encl->size);
> > +
> > +	ret = sgx_encl_get_backing(encl, page_index, &b);
> > +	if (ret)
> > +		return ret;
> > +
> > +	pginfo.addr = SGX_ENCL_PAGE_ADDR(encl_page);
> > +	pginfo.contents = (unsigned long)kmap_atomic(b.contents);
> > +	pginfo.metadata = (unsigned long)kmap_atomic(b.pcmd) +
> > +			  b.pcmd_offset;
> > +
> > +	if (secs_page)
> > +		pginfo.secs = (u64)sgx_get_epc_addr(secs_page);
> > +	else
> > +		pginfo.secs = 0;
> > +
> > +	ret = __eldu(&pginfo, sgx_get_epc_addr(epc_page),
> > +		     sgx_get_epc_addr(encl_page->va_page->epc_page) +
> > +				      va_offset);
> > +	if (ret) {
> > +		if (encls_failed(ret))
> > +			ENCLS_WARN(ret, "ELDU");
> > +
> > +		ret = -EFAULT;
> > +	}
> > +
> > +	kunmap_atomic((void *)(unsigned long)(pginfo.metadata -
> > b.pcmd_offset));
> > +	kunmap_atomic((void *)(unsigned long)pginfo.contents);
> > +
> > +	sgx_encl_put_backing(&b, false);
> > +
> > +	return ret;
> > +}
> > +
> > +static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page
> > *encl_page,
> > +					  struct sgx_epc_page *secs_page)
> > +{
> > +	unsigned long va_offset = SGX_ENCL_PAGE_VA_OFFSET(encl_page);
> > +	struct sgx_encl *encl = encl_page->encl;
> > +	struct sgx_epc_page *epc_page;
> > +	int ret;
> > +
> > +	epc_page = sgx_alloc_epc_page(encl_page, false);
> > +	if (IS_ERR(epc_page))
> > +		return epc_page;
> > +
> > +	ret = __sgx_encl_eldu(encl_page, epc_page, secs_page);
> > +	if (ret) {
> > +		sgx_free_epc_page(epc_page);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	sgx_free_va_slot(encl_page->va_page, va_offset);
> > +	list_move(&encl_page->va_page->list, &encl->va_pages);
> > +	encl_page->desc &= ~SGX_ENCL_PAGE_VA_OFFSET_MASK;
> > +	encl_page->epc_page = epc_page;
> > +
> > +	return epc_page;
> > +}
> > +
> >  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
> >  						unsigned long addr)
> >  {
> > +	struct sgx_epc_page *epc_page;
> >  	struct sgx_encl_page *entry;
> >  	unsigned int flags;
> > @@ -33,10 +112,27 @@ static struct sgx_encl_page
> > *sgx_encl_load_page(struct sgx_encl *encl,
> >  		return ERR_PTR(-EFAULT);
> > 	/* Page is already resident in the EPC. */
> > -	if (entry->epc_page)
> > +	if (entry->epc_page) {
> > +		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> > +			return ERR_PTR(-EBUSY);
> > +
> >  		return entry;
> > +	}
> > +
> > +	if (!(encl->secs.epc_page)) {
> > +		epc_page = sgx_encl_eldu(&encl->secs, NULL);
> > +		if (IS_ERR(epc_page))
> > +			return ERR_CAST(epc_page);
> > +	}
> > -	return ERR_PTR(-EFAULT);
> > +	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
> > +	if (IS_ERR(epc_page))
> > +		return ERR_CAST(epc_page);
> > +
> > +	encl->secs_child_cnt++;
> > +	sgx_mark_page_reclaimable(entry->epc_page);
> > +
> > +	return entry;
> >  }
> > static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
> > @@ -132,6 +228,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct
> > mm_struct *mm)
> > 	spin_lock(&encl->mm_lock);
> >  	list_add_rcu(&encl_mm->list, &encl->mm_list);
> > +	/* Pairs with smp_rmb() in sgx_reclaimer_block(). */
> > +	smp_wmb();
> > +	encl->mm_list_version++;
> >  	spin_unlock(&encl->mm_lock);
> > 	return 0;
> > @@ -179,6 +278,8 @@ static unsigned int sgx_vma_fault(struct vm_fault
> > *vmf)
> >  		goto out;
> >  	}
> > +	sgx_encl_test_and_clear_young(vma->vm_mm, entry);
> > +
> >  out:
> >  	mutex_unlock(&encl->lock);
> >  	return ret;
> > @@ -280,6 +381,7 @@ int sgx_encl_find(struct mm_struct *mm, unsigned
> > long addr,
> >   */
> >  void sgx_encl_destroy(struct sgx_encl *encl)
> >  {
> > +	struct sgx_va_page *va_page;
> >  	struct sgx_encl_page *entry;
> >  	unsigned long index;
> > @@ -287,6 +389,13 @@ void sgx_encl_destroy(struct sgx_encl *encl)
> > 	xa_for_each(&encl->page_array, index, entry) {
> >  		if (entry->epc_page) {
> > +			/*
> > +			 * The page and its radix tree entry cannot be freed
> > +			 * if the page is being held by the reclaimer.
> > +			 */
> > +			if (sgx_unmark_page_reclaimable(entry->epc_page))
> > +				continue;
> > +
> >  			sgx_free_epc_page(entry->epc_page);
> >  			encl->secs_child_cnt--;
> >  			entry->epc_page = NULL;
> > @@ -301,6 +410,19 @@ void sgx_encl_destroy(struct sgx_encl *encl)
> >  		sgx_free_epc_page(encl->secs.epc_page);
> >  		encl->secs.epc_page = NULL;
> >  	}
> > +
> > +	/*
> > +	 * The reclaimer is responsible for checking SGX_ENCL_DEAD before doing
> > +	 * EWB, thus it's safe to free VA pages even if the reclaimer holds a
> > +	 * reference to the enclave.
> > +	 */
> > +	while (!list_empty(&encl->va_pages)) {
> > +		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
> > +					   list);
> > +		list_del(&va_page->list);
> > +		sgx_free_epc_page(va_page->epc_page);
> > +		kfree(va_page);
> > +	}
> >  }
> > /**
> > @@ -329,3 +451,221 @@ void sgx_encl_release(struct kref *ref)
> > 	kfree(encl);
> >  }
> > +
> > +static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl,
> > +					      pgoff_t index)
> > +{
> > +	struct inode *inode = encl->backing->f_path.dentry->d_inode;
> > +	struct address_space *mapping = inode->i_mapping;
> > +	gfp_t gfpmask = mapping_gfp_mask(mapping);
> > +
> > +	return shmem_read_mapping_page_gfp(mapping, index, gfpmask);
> > +}
> > +
> > +/**
> > + * sgx_encl_get_backing() - Pin the backing storage
> > + * @encl:	an enclave pointer
> > + * @page_index:	enclave page index
> > + * @backing:	data for accessing backing storage for the page
> > + *
> > + * Pin the backing storage pages for storing the encrypted contents and
> > Paging
> > + * Crypto MetaData (PCMD) of an enclave page.
> > + *
> > + * Return:
> > + *   0 on success,
> > + *   -errno otherwise.
> > + */
> > +int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long
> > page_index,
> > +			 struct sgx_backing *backing)
> > +{
> > +	pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
> > +	struct page *contents;
> > +	struct page *pcmd;
> > +
> > +	contents = sgx_encl_get_backing_page(encl, page_index);
> > +	if (IS_ERR(contents))
> > +		return PTR_ERR(contents);
> > +
> > +	pcmd = sgx_encl_get_backing_page(encl, pcmd_index);
> > +	if (IS_ERR(pcmd)) {
> > +		put_page(contents);
> > +		return PTR_ERR(pcmd);
> > +	}
> > +
> > +	backing->page_index = page_index;
> > +	backing->contents = contents;
> > +	backing->pcmd = pcmd;
> > +	backing->pcmd_offset =
> > +		(page_index & (PAGE_SIZE / sizeof(struct sgx_pcmd) - 1)) *
> > +		sizeof(struct sgx_pcmd);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * sgx_encl_put_backing() - Unpin the backing storage
> > + * @backing:	data for accessing backing storage for the page
> > + * @do_write:	mark pages dirty
> > + */
> > +void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write)
> > +{
> > +	if (do_write) {
> > +		set_page_dirty(backing->pcmd);
> > +		set_page_dirty(backing->contents);
> > +	}
> > +
> > +	put_page(backing->pcmd);
> > +	put_page(backing->contents);
> > +}
> > +
> > +static int sgx_encl_test_and_clear_young_cb(pte_t *ptep, unsigned long
> > addr,
> > +					    void *data)
> > +{
> > +	pte_t pte;
> > +	int ret;
> > +
> > +	ret = pte_young(*ptep);
> > +	if (ret) {
> > +		pte = pte_mkold(*ptep);
> > +		set_pte_at((struct mm_struct *)data, addr, ptep, pte);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * sgx_encl_test_and_clear_young() - Test and reset the accessed bit
> > + * @mm:		mm_struct that is checked
> > + * @page:	enclave page to be tested for recent access
> > + *
> > + * Checks the Access (A) bit from the PTE corresponding to the enclave
> > page and
> > + * clears it.
> > + *
> > + * Return: 1 if the page has been recently accessed and 0 if not.
> > + */
> > +int sgx_encl_test_and_clear_young(struct mm_struct *mm,
> > +				  struct sgx_encl_page *page)
> > +{
> > +	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
> > +	struct sgx_encl *encl = page->encl;
> > +	struct vm_area_struct *vma;
> > +	int ret;
> > +
> > +	ret = sgx_encl_find(mm, addr, &vma);
> > +	if (ret)
> > +		return 0;
> > +
> > +	if (encl != vma->vm_private_data)
> > +		return 0;
> > +
> > +	ret = apply_to_page_range(vma->vm_mm, addr, PAGE_SIZE,
> > +				  sgx_encl_test_and_clear_young_cb, vma->vm_mm);
> > +	if (ret < 0)
> > +		return 0;
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * sgx_encl_reserve_page() - Reserve an enclave page
> > + * @encl:	an enclave pointer
> > + * @addr:	a page address
> > + *
> > + * Load an enclave page and lock the enclave so that the page can be
> > used by
> > + * EDBG* and EMOD*.
> > + *
> > + * Return:
> > + *   an enclave page on success
> > + *   -EFAULT	if the load fails
> > + */
> > +struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> > +					    unsigned long addr)
> > +{
> > +	struct sgx_encl_page *entry;
> > +
> > +	for ( ; ; ) {
> > +		mutex_lock(&encl->lock);
> > +
> > +		entry = sgx_encl_load_page(encl, addr);
> > +		if (PTR_ERR(entry) != -EBUSY)
> > +			break;
> > +
> > +		mutex_unlock(&encl->lock);
> > +	}
> > +
> > +	if (IS_ERR(entry))
> > +		mutex_unlock(&encl->lock);
> > +
> > +	return entry;
> > +}
> > +
> > +/**
> > + * sgx_alloc_va_page() - Allocate a Version Array (VA) page
> > + *
> > + * Allocate a free EPC page and convert it to a Version Array (VA) page.
> > + *
> > + * Return:
> > + *   a VA page,
> > + *   -errno otherwise
> > + */
> > +struct sgx_epc_page *sgx_alloc_va_page(void)
> > +{
> > +	struct sgx_epc_page *epc_page;
> > +	int ret;
> > +
> > +	epc_page = sgx_alloc_epc_page(NULL, true);
> > +	if (IS_ERR(epc_page))
> > +		return ERR_CAST(epc_page);
> > +
> > +	ret = __epa(sgx_get_epc_addr(epc_page));
> > +	if (ret) {
> > +		WARN_ONCE(1, "EPA returned %d (0x%x)", ret, ret);
> > +		sgx_free_epc_page(epc_page);
> > +		return ERR_PTR(-EFAULT);
> > +	}
> > +
> > +	return epc_page;
> > +}
> > +
> > +/**
> > + * sgx_alloc_va_slot - allocate a VA slot
> > + * @va_page:	a &struct sgx_va_page instance
> > + *
> > + * Allocates a slot from a &struct sgx_va_page instance.
> > + *
> > + * Return: offset of the slot inside the VA page
> > + */
> > +unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page)
> > +{
> > +	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
> > +
> > +	if (slot < SGX_VA_SLOT_COUNT)
> > +		set_bit(slot, va_page->slots);
> > +
> > +	return slot << 3;
> > +}
> > +
> > +/**
> > + * sgx_free_va_slot - free a VA slot
> > + * @va_page:	a &struct sgx_va_page instance
> > + * @offset:	offset of the slot inside the VA page
> > + *
> > + * Frees a slot from a &struct sgx_va_page instance.
> > + */
> > +void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset)
> > +{
> > +	clear_bit(offset >> 3, va_page->slots);
> > +}
> > +
> > +/**
> > + * sgx_va_page_full - is the VA page full?
> > + * @va_page:	a &struct sgx_va_page instance
> > + *
> > + * Return: true if all slots have been taken
> > + */
> > +bool sgx_va_page_full(struct sgx_va_page *va_page)
> > +{
> > +	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
> > +
> > +	return slot == SGX_VA_SLOT_COUNT;
> > +}
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.h
> > b/arch/x86/kernel/cpu/sgx/encl.h
> > index 0448d22d3010..e8eb9e9a834e 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.h
> > +++ b/arch/x86/kernel/cpu/sgx/encl.h
> > @@ -19,6 +19,10 @@
> > /**
> >   * enum sgx_encl_page_desc - defines bits for an enclave page's
> > descriptor
> > + * %SGX_ENCL_PAGE_BEING_RECLAIMED:	The page is in the process of being
> > + *					reclaimed.
> > + * %SGX_ENCL_PAGE_VA_OFFSET_MASK:	Holds the offset in the Version Array
> > + *					(VA) page for a swapped page.
> >   * %SGX_ENCL_PAGE_ADDR_MASK:		Holds the virtual address of the page.
> >   *
> >   * The page address for SECS is zero and is used by the subsystem to
> > recognize
> > @@ -26,16 +30,23 @@
> >   */
> >  enum sgx_encl_page_desc {
> >  	/* Bits 11:3 are available when the page is not swapped. */
> > +	SGX_ENCL_PAGE_BEING_RECLAIMED		= BIT(3),
> > +	SGX_ENCL_PAGE_VA_OFFSET_MASK	= GENMASK_ULL(11, 3),
> >  	SGX_ENCL_PAGE_ADDR_MASK		= PAGE_MASK,
> >  };
> > #define SGX_ENCL_PAGE_ADDR(page) \
> >  	((page)->desc & SGX_ENCL_PAGE_ADDR_MASK)
> > +#define SGX_ENCL_PAGE_VA_OFFSET(page) \
> > +	((page)->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK)
> > +#define SGX_ENCL_PAGE_INDEX(page) \
> > +	PFN_DOWN((page)->desc - (page)->encl->base)
> > struct sgx_encl_page {
> >  	unsigned long desc;
> >  	unsigned long vm_max_prot_bits;
> >  	struct sgx_epc_page *epc_page;
> > +	struct sgx_va_page *va_page;
> >  	struct sgx_encl *encl;
> >  };
> > @@ -61,6 +72,7 @@ struct sgx_encl {
> >  	struct mutex lock;
> >  	struct list_head mm_list;
> >  	spinlock_t mm_lock;
> > +	unsigned long mm_list_version;
> >  	struct file *backing;
> >  	struct kref refcount;
> >  	struct srcu_struct srcu;
> > @@ -68,12 +80,21 @@ struct sgx_encl {
> >  	unsigned long size;
> >  	unsigned long ssaframesize;
> >  	struct xarray page_array;
> > +	struct list_head va_pages;
> >  	struct sgx_encl_page secs;
> >  	cpumask_t cpumask;
> >  	unsigned long attributes;
> >  	unsigned long attributes_mask;
> >  };
> > +#define SGX_VA_SLOT_COUNT 512
> > +
> > +struct sgx_va_page {
> > +	struct sgx_epc_page *epc_page;
> > +	DECLARE_BITMAP(slots, SGX_VA_SLOT_COUNT);
> > +	struct list_head list;
> > +};
> > +
> >  extern const struct vm_operations_struct sgx_vm_ops;
> > int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
> > @@ -84,4 +105,24 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct
> > mm_struct *mm);
> >  int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
> >  		     unsigned long end, unsigned long vm_flags);
> > +struct sgx_backing {
> > +	pgoff_t page_index;
> > +	struct page *contents;
> > +	struct page *pcmd;
> > +	unsigned long pcmd_offset;
> > +};
> > +
> > +int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long
> > page_index,
> > +			 struct sgx_backing *backing);
> > +void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write);
> > +int sgx_encl_test_and_clear_young(struct mm_struct *mm,
> > +				  struct sgx_encl_page *page);
> > +struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
> > +					    unsigned long addr);
> > +
> > +struct sgx_epc_page *sgx_alloc_va_page(void);
> > +unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
> > +void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
> > +bool sgx_va_page_full(struct sgx_va_page *va_page);
> > +
> >  #endif /* _X86_ENCL_H */
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c
> > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index 3c04798e83e5..613f6c03598e 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -16,6 +16,43 @@
> >  #include "encl.h"
> >  #include "encls.h"
> > +static struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl)
> > +{
> > +	struct sgx_va_page *va_page = NULL;
> > +	void *err;
> > +
> > +	BUILD_BUG_ON(SGX_VA_SLOT_COUNT !=
> > +		(SGX_ENCL_PAGE_VA_OFFSET_MASK >> 3) + 1);
> > +
> > +	if (!(encl->page_cnt % SGX_VA_SLOT_COUNT)) {
> > +		va_page = kzalloc(sizeof(*va_page), GFP_KERNEL);
> > +		if (!va_page)
> > +			return ERR_PTR(-ENOMEM);
> > +
> > +		va_page->epc_page = sgx_alloc_va_page();
> > +		if (IS_ERR(va_page->epc_page)) {
> > +			err = ERR_CAST(va_page->epc_page);
> > +			kfree(va_page);
> > +			return err;
> > +		}
> > +
> > +		WARN_ON_ONCE(encl->page_cnt % SGX_VA_SLOT_COUNT);
> > +	}
> > +	encl->page_cnt++;
> > +	return va_page;
> > +}
> > +
> > +static void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page
> > *va_page)
> > +{
> > +	encl->page_cnt--;
> > +
> > +	if (va_page) {
> > +		sgx_free_epc_page(va_page->epc_page);
> > +		list_del(&va_page->list);
> > +		kfree(va_page);
> > +	}
> > +}
> > +
> >  static u32 sgx_calc_ssa_frame_size(u32 miscselect, u64 xfrm)
> >  {
> >  	u32 size_max = PAGE_SIZE;
> > @@ -80,15 +117,24 @@ static int sgx_validate_secs(const struct sgx_secs
> > *secs)
> >  static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
> >  {
> >  	struct sgx_epc_page *secs_epc;
> > +	struct sgx_va_page *va_page;
> >  	struct sgx_pageinfo pginfo;
> >  	struct sgx_secinfo secinfo;
> >  	unsigned long encl_size;
> >  	struct file *backing;
> >  	long ret;
> > +	va_page = sgx_encl_grow(encl);
> > +	if (IS_ERR(va_page))
> > +		return PTR_ERR(va_page);
> > +	else if (va_page)
> > +		list_add(&va_page->list, &encl->va_pages);
> > +	/* else the tail page of the VA page list had free slots. */
> > +
> >  	if (sgx_validate_secs(secs)) {
> >  		pr_debug("invalid SECS\n");
> > -		return -EINVAL;
> > +		ret = -EINVAL;
> > +		goto err_out_shrink;
> >  	}
> > 	/* The extra page goes to SECS. */
> > @@ -96,12 +142,14 @@ static int sgx_encl_create(struct sgx_encl *encl,
> > struct sgx_secs *secs)
> > 	backing = shmem_file_setup("SGX backing", encl_size + (encl_size >> 5),
> >  				   VM_NORESERVE);
> > -	if (IS_ERR(backing))
> > -		return PTR_ERR(backing);
> > +	if (IS_ERR(backing)) {
> > +		ret = PTR_ERR(backing);
> > +		goto err_out_shrink;
> > +	}
> > 	encl->backing = backing;
> > -	secs_epc = __sgx_alloc_epc_page();
> > +	secs_epc = sgx_alloc_epc_page(&encl->secs, true);
> >  	if (IS_ERR(secs_epc)) {
> >  		ret = PTR_ERR(secs_epc);
> >  		goto err_out_backing;
> > @@ -149,6 +197,9 @@ static int sgx_encl_create(struct sgx_encl *encl,
> > struct sgx_secs *secs)
> >  	fput(encl->backing);
> >  	encl->backing = NULL;
> > +err_out_shrink:
> > +	sgx_encl_shrink(encl, va_page);
> > +
> >  	return ret;
> >  }
> > @@ -321,21 +372,35 @@ static int sgx_encl_add_page(struct sgx_encl
> > *encl, unsigned long src,
> >  {
> >  	struct sgx_encl_page *encl_page;
> >  	struct sgx_epc_page *epc_page;
> > +	struct sgx_va_page *va_page;
> >  	int ret;
> > 	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
> >  	if (IS_ERR(encl_page))
> >  		return PTR_ERR(encl_page);
> > -	epc_page = __sgx_alloc_epc_page();
> > +	epc_page = sgx_alloc_epc_page(encl_page, true);
> >  	if (IS_ERR(epc_page)) {
> >  		kfree(encl_page);
> >  		return PTR_ERR(epc_page);
> >  	}
> > +	va_page = sgx_encl_grow(encl);
> > +	if (IS_ERR(va_page)) {
> > +		ret = PTR_ERR(va_page);
> > +		goto err_out_free;
> > +	}
> > +
> >  	mmap_read_lock(current->mm);
> >  	mutex_lock(&encl->lock);
> > +	/*
> > +	 * Adding to encl->va_pages must be done under encl->lock.  Ditto for
> > +	 * deleting (via sgx_encl_shrink()) in the error path.
> > +	 */
> > +	if (va_page)
> > +		list_add(&va_page->list, &encl->va_pages);
> > +
> >  	/*
> >  	 * Insert prior to EADD in case of OOM.  EADD modifies MRENCLAVE, i.e.
> >  	 * can't be gracefully unwound, while failure on EADD/EXTEND is limited
> > @@ -366,6 +431,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl,
> > unsigned long src,
> >  			goto err_out;
> >  	}
> > +	sgx_mark_page_reclaimable(encl_page->epc_page);
> >  	mutex_unlock(&encl->lock);
> >  	mmap_read_unlock(current->mm);
> >  	return ret;
> > @@ -374,9 +440,11 @@ static int sgx_encl_add_page(struct sgx_encl *encl,
> > unsigned long src,
> >  	xa_erase(&encl->page_array, PFN_DOWN(encl_page->desc));
> > err_out_unlock:
> > +	sgx_encl_shrink(encl, va_page);
> >  	mutex_unlock(&encl->lock);
> >  	mmap_read_unlock(current->mm);
> > +err_out_free:
> >  	sgx_free_epc_page(epc_page);
> >  	kfree(encl_page);
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c
> > b/arch/x86/kernel/cpu/sgx/main.c
> > index 4137254fb29e..3f9130501370 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -16,6 +16,395 @@
> >  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> >  static int sgx_nr_epc_sections;
> >  static struct task_struct *ksgxswapd_tsk;
> > +static DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
> > +static LIST_HEAD(sgx_active_page_list);
> > +static DEFINE_SPINLOCK(sgx_active_page_list_lock);
> > +
> > +/**
> > + * sgx_mark_page_reclaimable() - Mark a page as reclaimable
> > + * @page:	EPC page
> > + *
> > + * Mark a page as reclaimable and add it to the active page list. Pages
> > + * are automatically removed from the active list when freed.
> > + */
> > +void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
> > +{
> > +	spin_lock(&sgx_active_page_list_lock);
> > +	page->desc |= SGX_EPC_PAGE_RECLAIMABLE;
> > +	list_add_tail(&page->list, &sgx_active_page_list);
> > +	spin_unlock(&sgx_active_page_list_lock);
> > +}
> > +
> > +/**
> > + * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
> > + * @page:	EPC page
> > + *
> > + * Clear the reclaimable flag and remove the page from the active page
> > list.
> > + *
> > + * Return:
> > + *   0 on success,
> > + *   -EBUSY if the page is in the process of being reclaimed
> > + */
> > +int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
> > +{
> > +	/*
> > +	 * Remove the page from the active list if necessary.  If the page
> > +	 * is actively being reclaimed, i.e. RECLAIMABLE is set but the
> > +	 * page isn't on the active list, return -EBUSY as we can't free
> > +	 * the page at this time since it is "owned" by the reclaimer.
> > +	 */
> > +	spin_lock(&sgx_active_page_list_lock);
> > +	if (page->desc & SGX_EPC_PAGE_RECLAIMABLE) {
> > +		if (list_empty(&page->list)) {
> > +			spin_unlock(&sgx_active_page_list_lock);
> > +			return -EBUSY;
> > +		}
> > +		list_del(&page->list);
> > +		page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
> > +	}
> > +	spin_unlock(&sgx_active_page_list_lock);
> > +
> > +	return 0;
> > +}
> > +
> > +static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
> > +{
> > +	struct sgx_encl_page *page = epc_page->owner;
> > +	struct sgx_encl *encl = page->encl;
> > +	struct sgx_encl_mm *encl_mm;
> > +	bool ret = true;
> > +	int idx;
> > +
> > +	idx = srcu_read_lock(&encl->srcu);
> > +
> > +	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > +		if (!mmget_not_zero(encl_mm->mm))
> > +			continue;
> > +
> > +		mmap_read_lock(encl_mm->mm);
> > +		ret = !sgx_encl_test_and_clear_young(encl_mm->mm, page);
> > +		mmap_read_unlock(encl_mm->mm);
> > +
> > +		mmput_async(encl_mm->mm);
> > +
> > +		if (!ret || (atomic_read(&encl->flags) & SGX_ENCL_DEAD))
> > +			break;
> > +	}
> > +
> > +	srcu_read_unlock(&encl->srcu, idx);
> > +
> > +	if (!ret && !(atomic_read(&encl->flags) & SGX_ENCL_DEAD))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
> > +{
> > +	struct sgx_encl_page *page = epc_page->owner;
> > +	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
> > +	struct sgx_encl *encl = page->encl;
> > +	unsigned long mm_list_version;
> > +	struct sgx_encl_mm *encl_mm;
> > +	struct vm_area_struct *vma;
> > +	int idx, ret;
> > +
> > +	do {
> > +		mm_list_version = encl->mm_list_version;
> > +
> > +		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
> > +		smp_rmb();
> > +
> > +		idx = srcu_read_lock(&encl->srcu);
> > +
> > +		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > +			if (!mmget_not_zero(encl_mm->mm))
> > +				continue;
> > +
> > +			mmap_read_lock(encl_mm->mm);
> > +
> > +			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
> > +			if (!ret && encl == vma->vm_private_data)
> > +				zap_vma_ptes(vma, addr, PAGE_SIZE);
> > +
> > +			mmap_read_unlock(encl_mm->mm);
> > +
> > +			mmput_async(encl_mm->mm);
> > +		}
> > +
> > +		srcu_read_unlock(&encl->srcu, idx);
> > +	} while (unlikely(encl->mm_list_version != mm_list_version));
> > +
> > +	mutex_lock(&encl->lock);
> > +
> > +	if (!(atomic_read(&encl->flags) & SGX_ENCL_DEAD)) {
> > +		ret = __eblock(sgx_get_epc_addr(epc_page));
> > +		if (encls_failed(ret))
> > +			ENCLS_WARN(ret, "EBLOCK");
> > +	}
> > +
> > +	mutex_unlock(&encl->lock);
> > +}
> > +
> > +static int __sgx_encl_ewb(struct sgx_epc_page *epc_page, void *va_slot,
> > +			  struct sgx_backing *backing)
> > +{
> > +	struct sgx_pageinfo pginfo;
> > +	int ret;
> > +
> > +	pginfo.addr = 0;
> > +	pginfo.secs = 0;
> > +
> > +	pginfo.contents = (unsigned long)kmap_atomic(backing->contents);
> > +	pginfo.metadata = (unsigned long)kmap_atomic(backing->pcmd) +
> > +			  backing->pcmd_offset;
> > +
> > +	ret = __ewb(&pginfo, sgx_get_epc_addr(epc_page), va_slot);
> > +
> > +	kunmap_atomic((void *)(unsigned long)(pginfo.metadata -
> > +					      backing->pcmd_offset));
> > +	kunmap_atomic((void *)(unsigned long)pginfo.contents);
> > +
> > +	return ret;
> > +}
> > +
> > +static void sgx_ipi_cb(void *info)
> > +{
> > +}
> > +
> > +static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl)
> > +{
> > +	cpumask_t *cpumask = &encl->cpumask;
> > +	struct sgx_encl_mm *encl_mm;
> > +	int idx;
> > +
> > +	/*
> > +	 * Can race with sgx_encl_mm_add(), but ETRACK has already been
> > +	 * executed, which means that the CPUs running in the new mm will enter
> > +	 * into the enclave with a fresh epoch.
> > +	 */
> > +	cpumask_clear(cpumask);
> > +
> > +	idx = srcu_read_lock(&encl->srcu);
> > +
> > +	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
> > +		if (!mmget_not_zero(encl_mm->mm))
> > +			continue;
> > +
> > +		cpumask_or(cpumask, cpumask, mm_cpumask(encl_mm->mm));
> > +
> > +		mmput_async(encl_mm->mm);
> > +	}
> > +
> > +	srcu_read_unlock(&encl->srcu, idx);
> > +
> > +	return cpumask;
> > +}
> > +
> > +/*
> > + * Swap page to the regular memory transformed to the blocked state by
> > using
> > + * EBLOCK, which means that it can no loger be referenced (no new TLB
> > entries).
> > + *
> > + * The first trial just tries to write the page assuming that some
> > other thread
> > + * has reset the count for threads inside the enlave by using ETRACK,
> > and
> > + * previous thread count has been zeroed out. The second trial calls
> > ETRACK
> > + * before EWB. If that fails we kick all the HW threads out, and then
> > do EWB,
> > + * which should be guaranteed the succeed.
> > + */
> > +static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
> > +			 struct sgx_backing *backing)
> > +{
> > +	struct sgx_encl_page *encl_page = epc_page->owner;
> > +	struct sgx_encl *encl = encl_page->encl;
> > +	struct sgx_va_page *va_page;
> > +	unsigned int va_offset;
> > +	void *va_slot;
> > +	int ret;
> > +
> > +	encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED;
> > +
> > +	va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
> > +				   list);
> > +	va_offset = sgx_alloc_va_slot(va_page);
> > +	va_slot = sgx_get_epc_addr(va_page->epc_page) + va_offset;
> > +	if (sgx_va_page_full(va_page))
> > +		list_move_tail(&va_page->list, &encl->va_pages);
> > +
> > +	ret = __sgx_encl_ewb(epc_page, va_slot, backing);
> > +	if (ret == SGX_NOT_TRACKED) {
> > +		ret = __etrack(sgx_get_epc_addr(encl->secs.epc_page));
> > +		if (ret) {
> > +			if (encls_failed(ret))
> > +				ENCLS_WARN(ret, "ETRACK");
> > +		}
> > +
> > +		ret = __sgx_encl_ewb(epc_page, va_slot, backing);
> > +		if (ret == SGX_NOT_TRACKED) {
> > +			/*
> > +			 * Slow path, send IPIs to kick cpus out of the
> > +			 * enclave.  Note, it's imperative that the cpu
> > +			 * mask is generated *after* ETRACK, else we'll
> > +			 * miss cpus that entered the enclave between
> > +			 * generating the mask and incrementing epoch.
> > +			 */
> > +			on_each_cpu_mask(sgx_encl_ewb_cpumask(encl),
> > +					 sgx_ipi_cb, NULL, 1);
> > +			ret = __sgx_encl_ewb(epc_page, va_slot, backing);
> > +		}
> > +	}
> > +
> > +	if (ret) {
> > +		if (encls_failed(ret))
> > +			ENCLS_WARN(ret, "EWB");
> > +
> > +		sgx_free_va_slot(va_page, va_offset);
> > +	} else {
> > +		encl_page->desc |= va_offset;
> > +		encl_page->va_page = va_page;
> > +	}
> > +}
> > +
> > +static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
> > +				struct sgx_backing *backing)
> > +{
> > +	struct sgx_encl_page *encl_page = epc_page->owner;
> > +	struct sgx_encl *encl = encl_page->encl;
> > +	struct sgx_backing secs_backing;
> > +	int ret;
> > +
> > +	mutex_lock(&encl->lock);
> > +
> > +	if (atomic_read(&encl->flags) & SGX_ENCL_DEAD) {
> > +		ret = __eremove(sgx_get_epc_addr(epc_page));
> > +		ENCLS_WARN(ret, "EREMOVE");
> > +	} else {
> > +		sgx_encl_ewb(epc_page, backing);
> > +	}
> > +
> > +	encl_page->epc_page = NULL;
> > +	encl->secs_child_cnt--;
> > +
> > +	if (!encl->secs_child_cnt) {
> > +		if (atomic_read(&encl->flags) & SGX_ENCL_DEAD) {
> > +			sgx_free_epc_page(encl->secs.epc_page);
> > +			encl->secs.epc_page = NULL;
> > +		} else if (atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED) {
> > +			ret = sgx_encl_get_backing(encl, PFN_DOWN(encl->size),
> > +						   &secs_backing);
> > +			if (ret)
> > +				goto out;
> > +
> > +			sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
> > +
> > +			sgx_free_epc_page(encl->secs.epc_page);
> > +			encl->secs.epc_page = NULL;
> > +
> > +			sgx_encl_put_backing(&secs_backing, true);
> > +		}
> > +	}
> > +
> > +out:
> > +	mutex_unlock(&encl->lock);
> > +}
> > +
> > +/*
> > + * Take a fixed number of pages from the head of the active page pool
> > and
> > + * reclaim them to the enclave's private shmem files. Skip the pages,
> > which have
> > + * been accessed since the last scan. Move those pages to the tail of
> > active
> > + * page pool so that the pages get scanned in LRU like fashion.
> > + *
> > + * Batch process a chunk of pages (at the moment 16) in order to
> > degrade amount
> > + * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does
> > degrade a bit
> > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK +
> > EWB and IPI
> > + * + EWB) but not sufficiently. Reclaiming one page at a time would
> > also be
> > + * problematic as it would increase the lock contention too much, which
> > would
> > + * halt forward progress.
> > + */
> > +static void sgx_reclaim_pages(void)
> > +{
> > +	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
> > +	struct sgx_backing backing[SGX_NR_TO_SCAN];
> > +	struct sgx_epc_section *section;
> > +	struct sgx_encl_page *encl_page;
> > +	struct sgx_epc_page *epc_page;
> > +	int cnt = 0;
> > +	int ret;
> > +	int i;
> > +
> > +	spin_lock(&sgx_active_page_list_lock);
> > +	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> > +		if (list_empty(&sgx_active_page_list))
> > +			break;
> > +
> > +		epc_page = list_first_entry(&sgx_active_page_list,
> > +					    struct sgx_epc_page, list);
> > +		list_del_init(&epc_page->list);
> > +		encl_page = epc_page->owner;
> > +
> > +		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
> > +			chunk[cnt++] = epc_page;
> > +		else
> > +			/* The owner is freeing the page. No need to add the
> > +			 * page back to the list of reclaimable pages.
> > +			 */
> > +			epc_page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
> > +	}
> > +	spin_unlock(&sgx_active_page_list_lock);
> > +
> > +	for (i = 0; i < cnt; i++) {
> > +		epc_page = chunk[i];
> > +		encl_page = epc_page->owner;
> > +
> > +		if (!sgx_reclaimer_age(epc_page))
> > +			goto skip;
> > +
> > +		ret = sgx_encl_get_backing(encl_page->encl,
> > +					   SGX_ENCL_PAGE_INDEX(encl_page),
> > +					   &backing[i]);
> > +		if (ret)
> > +			goto skip;
> > +
> > +		mutex_lock(&encl_page->encl->lock);
> > +		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
> > +		mutex_unlock(&encl_page->encl->lock);
> > +		continue;
> > +
> > +skip:
> > +		spin_lock(&sgx_active_page_list_lock);
> > +		list_add_tail(&epc_page->list, &sgx_active_page_list);
> > +		spin_unlock(&sgx_active_page_list_lock);
> > +
> > +		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> > +
> > +		chunk[i] = NULL;
> > +	}
> > +
> > +	for (i = 0; i < cnt; i++) {
> > +		epc_page = chunk[i];
> > +		if (epc_page)
> > +			sgx_reclaimer_block(epc_page);
> > +	}
> > +
> > +	for (i = 0; i < cnt; i++) {
> > +		epc_page = chunk[i];
> > +		if (!epc_page)
> > +			continue;
> > +
> > +		encl_page = epc_page->owner;
> > +		sgx_reclaimer_write(epc_page, &backing[i]);
> > +		sgx_encl_put_backing(&backing[i], true);
> > +
> > +		kref_put(&encl_page->encl->refcount, sgx_encl_release);
> > +		epc_page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
> > +
> > +		section = sgx_get_epc_section(epc_page);
> > +		spin_lock(&section->lock);
> > +		list_add_tail(&epc_page->list, &section->page_list);
> > +		section->free_cnt++;
> > +		spin_unlock(&section->lock);
> > +	}
> > +}
> > +
> > static void sgx_sanitize_section(struct sgx_epc_section *section)
> >  {
> > @@ -44,6 +433,23 @@ static void sgx_sanitize_section(struct
> > sgx_epc_section *section)
> >  	}
> >  }
> > +static unsigned long sgx_nr_free_pages(void)
> > +{
> > +	unsigned long cnt = 0;
> > +	int i;
> > +
> > +	for (i = 0; i < sgx_nr_epc_sections; i++)
> > +		cnt += sgx_epc_sections[i].free_cnt;
> > +
> > +	return cnt;
> > +}
> > +
> > +static bool sgx_should_reclaim(unsigned long watermark)
> > +{
> > +	return sgx_nr_free_pages() < watermark &&
> > +	       !list_empty(&sgx_active_page_list);
> > +}
> > +
> >  static int ksgxswapd(void *p)
> >  {
> >  	int i;
> > @@ -69,6 +475,20 @@ static int ksgxswapd(void *p)
> >  			WARN(1, "EPC section %d has unsanitized pages.\n", i);
> >  	}
> > +	while (!kthread_should_stop()) {
> > +		if (try_to_freeze())
> > +			continue;
> > +
> > +		wait_event_freezable(ksgxswapd_waitq,
> > +				     kthread_should_stop() ||
> > +				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
> > +
> > +		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> > +			sgx_reclaim_pages();
> > +
> > +		cond_resched();
> > +	}
> > +
> >  	return 0;
> >  }
> > @@ -94,6 +514,7 @@ static struct sgx_epc_page
> > *__sgx_alloc_epc_page_from_section(struct sgx_epc_sec
> > 	page = list_first_entry(&section->page_list, struct sgx_epc_page, list);
> >  	list_del_init(&page->list);
> > +	section->free_cnt--;
> > 	return page;
> >  }
> > @@ -127,6 +548,57 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
> >  	return ERR_PTR(-ENOMEM);
> >  }
> > +/**
> > + * sgx_alloc_epc_page() - Allocate an EPC page
> > + * @owner:	the owner of the EPC page
> > + * @reclaim:	reclaim pages if necessary
> > + *
> > + * Iterate through EPC sections and borrow a free EPC page to the
> > caller. When a
> > + * page is no longer needed it must be released with
> > sgx_free_epc_page(). If
> > + * @reclaim is set to true, directly reclaim pages when we are out of
> > pages. No
> > + * mm's can be locked when @reclaim is set to true.
> > + *
> > + * Finally, wake up ksgxswapd when the number of pages goes below the
> > watermark
> > + * before returning back to the caller.
> > + *
> > + * Return:
> > + *   an EPC page,
> > + *   -errno on error
> > + */
> > +struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> > +{
> > +	struct sgx_epc_page *entry;
> > +
> > +	for ( ; ; ) {
> > +		entry = __sgx_alloc_epc_page();
> > +		if (!IS_ERR(entry)) {
> > +			entry->owner = owner;
> > +			break;
> > +		}
> > +
> > +		if (list_empty(&sgx_active_page_list))
> > +			return ERR_PTR(-ENOMEM);
> > +
> > +		if (!reclaim) {
> > +			entry = ERR_PTR(-EBUSY);
> > +			break;
> > +		}
> > +
> > +		if (signal_pending(current)) {
> > +			entry = ERR_PTR(-ERESTARTSYS);
> > +			break;
> > +		}
> > +
> > +		sgx_reclaim_pages();
> > +		schedule();
> > +	}
> > +
> > +	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> > +		wake_up(&ksgxswapd_waitq);
> > +
> > +	return entry;
> > +}
> > +
> >  /**
> >   * sgx_free_epc_page() - Free an EPC page
> >   * @page:	an EPC page
> > @@ -138,12 +610,20 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> >  	struct sgx_epc_section *section = sgx_get_epc_section(page);
> >  	int ret;
> > +	/*
> > +	 * Don't take sgx_active_page_list_lock when asserting the page isn't
> > +	 * reclaimable, missing a WARN in the very rare case is preferable to
> > +	 * unnecessarily taking a global lock in the common case.
> > +	 */
> > +	WARN_ON_ONCE(page->desc & SGX_EPC_PAGE_RECLAIMABLE);
> > +
> >  	ret = __eremove(sgx_get_epc_addr(page));
> >  	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
> >  		return;
> > 	spin_lock(&section->lock);
> >  	list_add_tail(&page->list, &section->page_list);
> > +	section->free_cnt++;
> >  	spin_unlock(&section->lock);
> >  }
> > @@ -194,6 +674,7 @@ static bool __init sgx_setup_epc_section(u64 addr,
> > u64 size,
> >  		list_add_tail(&page->list, &section->unsanitized_page_list);
> >  	}
> > +	section->free_cnt = nr_pages;
> >  	return true;
> > err_out:
> > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h
> > b/arch/x86/kernel/cpu/sgx/sgx.h
> > index 8d126070db1e..ec4f7b338dbe 100644
> > --- a/arch/x86/kernel/cpu/sgx/sgx.h
> > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> > @@ -15,6 +15,7 @@
> > struct sgx_epc_page {
> >  	unsigned long desc;
> > +	struct sgx_encl_page *owner;
> >  	struct list_head list;
> >  };
> > @@ -27,6 +28,7 @@ struct sgx_epc_page {
> >  struct sgx_epc_section {
> >  	unsigned long pa;
> >  	void *va;
> > +	unsigned long free_cnt;
> >  	struct list_head page_list;
> >  	struct list_head unsanitized_page_list;
> >  	spinlock_t lock;
> > @@ -35,6 +37,10 @@ struct sgx_epc_section {
> >  #define SGX_EPC_SECTION_MASK		GENMASK(7, 0)
> >  #define SGX_MAX_EPC_SECTIONS		(SGX_EPC_SECTION_MASK + 1)
> >  #define SGX_MAX_ADD_PAGES_LENGTH	0x100000
> > +#define SGX_EPC_PAGE_RECLAIMABLE	BIT(8)
> > +#define SGX_NR_TO_SCAN			16
> > +#define SGX_NR_LOW_PAGES		32
> > +#define SGX_NR_HIGH_PAGES		64
> > extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> > @@ -50,7 +56,10 @@ static inline void *sgx_get_epc_addr(struct
> > sgx_epc_page *page)
> >  	return section->va + (page->desc & PAGE_MASK) - section->pa;
> >  }
> > +void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
> > +int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
> >  struct sgx_epc_page *__sgx_alloc_epc_page(void);
> > +struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
> >  void sgx_free_epc_page(struct sgx_epc_page *page);
> > #endif /* _X86_SGX_H */
> 
> 
> -- 
> Using Opera's mail client: http://www.opera.com/mail/


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-03  4:50 ` [PATCH v39 11/24] x86/sgx: Add SGX enclave driver Jarkko Sakkinen
@ 2020-10-03 14:39   ` Greg KH
  2020-10-04 14:32     ` Jarkko Sakkinen
                       ` (2 more replies)
  2020-10-03 19:54   ` Matthew Wilcox
  1 sibling, 3 replies; 35+ messages in thread
From: Greg KH @ 2020-10-03 14:39 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Haitao Huang,
	Chunyang Hui, Jordan Hand, Nathaniel McCallum, Seth Moore,
	Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk, rientjes,
	tglx, yaozhangx, mikko.ylinen

On Sat, Oct 03, 2020 at 07:50:46AM +0300, Jarkko Sakkinen wrote:
> Intel Software Guard eXtensions (SGX) is a set of CPU instructions that can
> be used by applications to set aside private regions of code and data. The
> code outside the SGX hosted software entity is prevented from accessing the
> memory inside the enclave by the CPU. We call these entities enclaves.
> 
> Add a driver that provides an ioctl API to construct and run enclaves.
> Enclaves are constructed from pages residing in reserved physical memory
> areas. The contents of these pages can only be accessed when they are
> mapped as part of an enclave, by a hardware thread running inside the
> enclave.
> 
> The starting state of an enclave consists of a fixed measured set of
> pages that are copied to the EPC during the construction process by
> using the opcode ENCLS leaf functions and Software Enclave Control
> Structure (SECS) that defines the enclave properties.
> 
> Enclaves are constructed by using ENCLS leaf functions ECREATE, EADD and
> EINIT. ECREATE initializes SECS, EADD copies pages from system memory to
> the EPC and EINIT checks a given signed measurement and moves the enclave
> into a state ready for execution.
> 
> An initialized enclave can only be accessed through special Thread Control
> Structure (TCS) pages by using ENCLU (ring-3 only) leaf EENTER.  This leaf
> function converts a thread into enclave mode and continues the execution in
> the offset defined by the TCS provided to EENTER. An enclave is exited
> through syscall, exception, interrupts or by explicitly calling another
> ENCLU leaf EEXIT.
> 
> The mmap() permissions are capped by the contained enclave page
> permissions. The mapped areas must also be populated, i.e. each page
> address must contain a page. This logic is implemented in
> sgx_encl_may_map().
> 
> Cc: linux-security-module@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Acked-by: Jethro Beekman <jethro@fortanix.com>
> Tested-by: Jethro Beekman <jethro@fortanix.com>
> Tested-by: Haitao Huang <haitao.huang@linux.intel.com>
> Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
> Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
> Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
> Tested-by: Seth Moore <sethmo@google.com>
> Tested-by: Darren Kenny <darren.kenny@oracle.com>
> Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Co-developed-by: Suresh Siddha <suresh.b.siddha@intel.com>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/sgx/Makefile |   2 +
>  arch/x86/kernel/cpu/sgx/driver.c | 173 ++++++++++++++++
>  arch/x86/kernel/cpu/sgx/driver.h |  29 +++
>  arch/x86/kernel/cpu/sgx/encl.c   | 331 +++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/encl.h   |  85 ++++++++
>  arch/x86/kernel/cpu/sgx/main.c   |  11 +
>  6 files changed, 631 insertions(+)
>  create mode 100644 arch/x86/kernel/cpu/sgx/driver.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/driver.h
>  create mode 100644 arch/x86/kernel/cpu/sgx/encl.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/encl.h
> 
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 79510ce01b3b..3fc451120735 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -1,2 +1,4 @@
>  obj-y += \
> +	driver.o \
> +	encl.o \
>  	main.o
> diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
> new file mode 100644
> index 000000000000..f54da5f19c2b
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/driver.c
> @@ -0,0 +1,173 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)

You use gpl-only header files in this file, so how in the world can it
be bsd-3 licensed?

Please get your legal department to agree with this, after you explain
to them how you are mixing gpl2-only code in with this file.

> +// Copyright(c) 2016-18 Intel Corporation.

Dates are hard to get right :(

> +
> +#include <linux/acpi.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mman.h>
> +#include <linux/security.h>
> +#include <linux/suspend.h>
> +#include <asm/traps.h>
> +#include "driver.h"
> +#include "encl.h"
> +
> +u64 sgx_encl_size_max_32;
> +u64 sgx_encl_size_max_64;
> +u32 sgx_misc_reserved_mask;
> +u64 sgx_attributes_reserved_mask;
> +u64 sgx_xfrm_reserved_mask = ~0x3;
> +u32 sgx_xsave_size_tbl[64];
> +
> +static int sgx_open(struct inode *inode, struct file *file)
> +{
> +	struct sgx_encl *encl;
> +	int ret;
> +
> +	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
> +	if (!encl)
> +		return -ENOMEM;
> +
> +	atomic_set(&encl->flags, 0);
> +	kref_init(&encl->refcount);
> +	xa_init(&encl->page_array);
> +	mutex_init(&encl->lock);
> +	INIT_LIST_HEAD(&encl->mm_list);
> +	spin_lock_init(&encl->mm_lock);
> +
> +	ret = init_srcu_struct(&encl->srcu);
> +	if (ret) {
> +		kfree(encl);
> +		return ret;
> +	}
> +
> +	file->private_data = encl;
> +
> +	return 0;
> +}
> +
> +static int sgx_release(struct inode *inode, struct file *file)
> +{
> +	struct sgx_encl *encl = file->private_data;
> +	struct sgx_encl_mm *encl_mm;
> +
> +	for ( ; ; )  {
> +		spin_lock(&encl->mm_lock);
> +
> +		if (list_empty(&encl->mm_list)) {
> +			encl_mm = NULL;
> +		} else {
> +			encl_mm = list_first_entry(&encl->mm_list,
> +						   struct sgx_encl_mm, list);
> +			list_del_rcu(&encl_mm->list);
> +		}
> +
> +		spin_unlock(&encl->mm_lock);
> +
> +		/* The list is empty, ready to go. */
> +		if (!encl_mm)
> +			break;
> +
> +		synchronize_srcu(&encl->srcu);
> +		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
> +		kfree(encl_mm);
> +	}
> +
> +	mutex_lock(&encl->lock);
> +	atomic_or(SGX_ENCL_DEAD, &encl->flags);

So you set a flag that this is dead, and then instantly delete it?  Why
does that matter?  I see you check for this flag elsewhere, but as you
are just about to delete this structure, how can this be an issue?

> +	mutex_unlock(&encl->lock);
> +
> +	kref_put(&encl->refcount, sgx_encl_release);

Don't you need to hold the lock across the put?  If not, what is
serializing this?

But an even larger comment, why is this reference count needed at all?

You never grab it except at init time, and you free it at close time.
Why not rely on the reference counting that the vfs ensures you?



> +	return 0;
> +}
> +
> +static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct sgx_encl *encl = file->private_data;
> +	int ret;
> +
> +	ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags);
> +	if (ret)
> +		return ret;
> +
> +	ret = sgx_encl_mm_add(encl, vma->vm_mm);
> +	if (ret)
> +		return ret;
> +
> +	vma->vm_ops = &sgx_vm_ops;
> +	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
> +	vma->vm_private_data = encl;
> +
> +	return 0;
> +}
> +
> +static unsigned long sgx_get_unmapped_area(struct file *file,
> +					   unsigned long addr,
> +					   unsigned long len,
> +					   unsigned long pgoff,
> +					   unsigned long flags)
> +{
> +	if ((flags & MAP_TYPE) == MAP_PRIVATE)
> +		return -EINVAL;
> +
> +	if (flags & MAP_FIXED)
> +		return addr;
> +
> +	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
> +}
> +
> +static const struct file_operations sgx_encl_fops = {
> +	.owner			= THIS_MODULE,
> +	.open			= sgx_open,
> +	.release		= sgx_release,
> +	.mmap			= sgx_mmap,
> +	.get_unmapped_area	= sgx_get_unmapped_area,
> +};
> +
> +static struct miscdevice sgx_dev_enclave = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "enclave",
> +	.nodename = "sgx/enclave",

A subdir for a single device node?  Ok, odd, but why not just
"sgx_enclave"?  How "special" is this device node?

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 16/24] x86/sgx: Add a page reclaimer
  2020-10-03 13:32     ` Jarkko Sakkinen
@ 2020-10-03 18:23       ` Haitao Huang
  2020-10-04 22:39         ` Jarkko Sakkinen
  0 siblings, 1 reply; 35+ messages in thread
From: Haitao Huang @ 2020-10-03 18:23 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, linux-mm, Jethro Beekman,
	Jordan Hand, Nathaniel McCallum, Chunyang Hui, Seth Moore,
	Sean Christopherson, akpm, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen, willy

On Sat, 03 Oct 2020 08:32:45 -0500, Jarkko Sakkinen  
<jarkko.sakkinen@linux.intel.com> wrote:

> On Sat, Oct 03, 2020 at 12:22:47AM -0500, Haitao Huang wrote:
>> When I turn on CONFIG_PROVE_LOCKING, kernel reports following  
>> suspicious RCU
>> usages. Not sure if it is an issue. Just reporting here:
>
> I'm glad to hear that my tip helped you to get us the data.
>
> This does not look like an issue in the page reclaimer, which was not
> obvious for me before. That's a good thing. I was really worried about
> that because it has been very stable for a long period now. The last
> bug fix for the reclaimer was done in June in v31 version of the patch
> set and after that it has been unchanged (except possibly some renames
> requested by Boris).
>
> I wildly guess I have a bad usage pattern for xarray. I migrated to it
> in v36, and it is entirely possible that I've misused it. It was the
> first time that I ever used it. Before xarray we had radix_tree but
> based Matthew Wilcox feedback I did a migration to xarray.
>
> What I'd ask you to do next is to, if by any means possible, to try to
> run the same test with v35 so we can verify this. That one still has
> the radix tree.
>


v35 does not cause any such warning messages from kernel

> Thank you.
>
> /Jarkko
>
>>
>> [ +34.337095] =============================
>> [  +0.000001] WARNING: suspicious RCU usage
>> [  +0.000002] 5.9.0-rc6-lock-sgx39 #1 Not tainted
>> [  +0.000001] -----------------------------
>> [  +0.000001] ./include/linux/xarray.h:1165 suspicious
>> rcu_dereference_check() usage!
>> [  +0.000001]
>>               other info that might help us debug this:
>>
>> [  +0.000001]
>>               rcu_scheduler_active = 2, debug_locks = 1
>> [  +0.000001] 1 lock held by enclaveos-runne/4238:
>> [  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:
>> vm_mmap_pgoff+0xa1/0x120
>> [  +0.000005]
>>               stack backtrace:
>> [  +0.000002] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted
>> 5.9.0-rc6-lock-sgx39 #1
>> [  +0.000001] Hardware name: Microsoft Corporation Virtual  
>> Machine/Virtual
>> Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
>> [  +0.000002] Call Trace:
>> [  +0.000003]  dump_stack+0x7d/0x9f
>> [  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
>> [  +0.000004]  xas_start+0x14c/0x1c0
>> [  +0.000003]  xas_load+0xf/0x50
>> [  +0.000002]  xas_find+0x25c/0x2c0
>> [  +0.000004]  sgx_encl_may_map+0x87/0x1c0
>> [  +0.000006]  sgx_mmap+0x29/0x70
>> [  +0.000003]  mmap_region+0x3ee/0x710
>> [  +0.000006]  do_mmap+0x3f1/0x5e0
>> [  +0.000004]  vm_mmap_pgoff+0xcd/0x120
>> [  +0.000007]  ksys_mmap_pgoff+0x1de/0x240
>> [  +0.000005]  __x64_sys_mmap+0x33/0x40
>> [  +0.000002]  do_syscall_64+0x37/0x80
>> [  +0.000003]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  +0.000002] RIP: 0033:0x7fe34efe06ba
>> [  +0.000002] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da  
>> 4d 89
>> f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05  
>> <48> 3d
>> 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
>> [  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:
>> 0000000000000009
>> [  +0.000001] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
>> 00007fe34efe06ba
>> [  +0.000001] RDX: 0000000000000001 RSI: 0000000000001000 RDI:
>> 0000000007fff000
>> [  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:
>> 0000000000000000
>> [  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:
>> 0000000007fff000
>> [  +0.000001] R13: 0000000000001000 R14: 0000000000000011 R15:
>> 0000000000000000
>>
>> [  +0.000010] =============================
>> [  +0.000001] WARNING: suspicious RCU usage
>> [  +0.000001] 5.9.0-rc6-lock-sgx39 #1 Not tainted
>> [  +0.000001] -----------------------------
>> [  +0.000001] ./include/linux/xarray.h:1181 suspicious
>> rcu_dereference_check() usage!
>> [  +0.000001]
>>               other info that might help us debug this:
>>
>> [  +0.000001]
>>               rcu_scheduler_active = 2, debug_locks = 1
>> [  +0.000001] 1 lock held by enclaveos-runne/4238:
>> [  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:
>> vm_mmap_pgoff+0xa1/0x120
>> [  +0.000003]
>>               stack backtrace:
>> [  +0.000001] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted
>> 5.9.0-rc6-lock-sgx39 #1
>> [  +0.000001] Hardware name: Microsoft Corporation Virtual  
>> Machine/Virtual
>> Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
>> [  +0.000001] Call Trace:
>> [  +0.000001]  dump_stack+0x7d/0x9f
>> [  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
>> [  +0.000003]  xas_descend+0x116/0x120
>> [  +0.000004]  xas_load+0x42/0x50
>> [  +0.000002]  xas_find+0x25c/0x2c0
>> [  +0.000004]  sgx_encl_may_map+0x87/0x1c0
>> [  +0.000006]  sgx_mmap+0x29/0x70
>> [  +0.000002]  mmap_region+0x3ee/0x710
>> [  +0.000006]  do_mmap+0x3f1/0x5e0
>> [  +0.000004]  vm_mmap_pgoff+0xcd/0x120
>> [  +0.000007]  ksys_mmap_pgoff+0x1de/0x240
>> [  +0.000005]  __x64_sys_mmap+0x33/0x40
>> [  +0.000002]  do_syscall_64+0x37/0x80
>> [  +0.000002]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  +0.000001] RIP: 0033:0x7fe34efe06ba
>> [  +0.000001] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da  
>> 4d 89
>> f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05  
>> <48> 3d
>> 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
>> [  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:
>> 0000000000000009
>> [  +0.000001] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
>> 00007fe34efe06ba
>> [  +0.000001] RDX: 0000000000000001 RSI: 0000000000001000 RDI:
>> 0000000007fff000
>> [  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:
>> 0000000000000000
>> [  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:
>> 0000000007fff000
>> [  +0.000001] R13: 0000000000001000 R14: 0000000000000011 R15:
>> 0000000000000000
>>
>> [  +0.001117] =============================
>> [  +0.000001] WARNING: suspicious RCU usage
>> [  +0.000001] 5.9.0-rc6-lock-sgx39 #1 Not tainted
>> [  +0.000001] -----------------------------
>> [  +0.000001] ./include/linux/xarray.h:1181 suspicious
>> rcu_dereference_check() usage!
>> [  +0.000001]
>>               other info that might help us debug this:
>>
>> [  +0.000001]
>>               rcu_scheduler_active = 2, debug_locks = 1
>> [  +0.000001] 1 lock held by enclaveos-runne/4238:
>> [  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:
>> vm_mmap_pgoff+0xa1/0x120
>> [  +0.000003]
>>               stack backtrace:
>> [  +0.000002] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted
>> 5.9.0-rc6-lock-sgx39 #1
>> [  +0.000001] Hardware name: Microsoft Corporation Virtual  
>> Machine/Virtual
>> Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
>> [  +0.000001] Call Trace:
>> [  +0.000002]  dump_stack+0x7d/0x9f
>> [  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
>> [  +0.000003]  sgx_encl_may_map+0x1b0/0x1c0
>> [  +0.000006]  sgx_mmap+0x29/0x70
>> [  +0.000002]  mmap_region+0x3ee/0x710
>> [  +0.000006]  do_mmap+0x3f1/0x5e0
>> [  +0.000005]  vm_mmap_pgoff+0xcd/0x120
>> [  +0.000006]  ksys_mmap_pgoff+0x1de/0x240
>> [  +0.000005]  __x64_sys_mmap+0x33/0x40
>> [  +0.000002]  do_syscall_64+0x37/0x80
>> [  +0.000002]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  +0.000002] RIP: 0033:0x7fe34efe06ba
>> [  +0.000001] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da  
>> 4d 89
>> f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05  
>> <48> 3d
>> 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
>> [  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:
>> 0000000000000009
>> [  +0.000001] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
>> 00007fe34efe06ba
>> [  +0.000001] RDX: 0000000000000003 RSI: 0000000000010000 RDI:
>> 0000000007fee000
>> [  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:
>> 0000000000000000
>> [  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:
>> 0000000007fee000
>> [  +0.000001] R13: 0000000000010000 R14: 0000000000000011 R15:
>> 0000000000000000
>>
>> [  +0.003197] =============================
>> [  +0.000001] WARNING: suspicious RCU usage
>> [  +0.000001] 5.9.0-rc6-lock-sgx39 #1 Not tainted
>> [  +0.000001] -----------------------------
>> [  +0.000001] ./include/linux/xarray.h:1198 suspicious
>> rcu_dereference_check() usage!
>> [  +0.000001]
>>               other info that might help us debug this:
>>
>> [  +0.000001]
>>               rcu_scheduler_active = 2, debug_locks = 1
>> [  +0.000001] 1 lock held by enclaveos-runne/4238:
>> [  +0.000001]  #0: ffff9cc6657e45e8 (&mm->mmap_lock#2){++++}-{3:3}, at:
>> vm_mmap_pgoff+0xa1/0x120
>> [  +0.000003]
>>               stack backtrace:
>> [  +0.000002] CPU: 1 PID: 4238 Comm: enclaveos-runne Not tainted
>> 5.9.0-rc6-lock-sgx39 #1
>> [  +0.000001] Hardware name: Microsoft Corporation Virtual  
>> Machine/Virtual
>> Machine, BIOS Hyper-V UEFI Release v4.1 04/02/2020
>> [  +0.000001] Call Trace:
>> [  +0.000002]  dump_stack+0x7d/0x9f
>> [  +0.000003]  lockdep_rcu_suspicious+0xce/0xf0
>> [  +0.000004]  xas_find+0x255/0x2c0
>> [  +0.000003]  sgx_encl_may_map+0xad/0x1c0
>> [  +0.000006]  sgx_mmap+0x29/0x70
>> [  +0.000003]  mmap_region+0x3ee/0x710
>> [  +0.000005]  do_mmap+0x3f1/0x5e0
>> [  +0.000005]  vm_mmap_pgoff+0xcd/0x120
>> [  +0.000007]  ksys_mmap_pgoff+0x1de/0x240
>> [  +0.000004]  __x64_sys_mmap+0x33/0x40
>> [  +0.000002]  do_syscall_64+0x37/0x80
>> [  +0.000002]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  +0.000002] RIP: 0033:0x7fe34efe06ba
>> [  +0.000001] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da  
>> 4d 89
>> f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05  
>> <48> 3d
>> 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
>> [  +0.000001] RSP: 002b:00007ffee83eac08 EFLAGS: 00000206 ORIG_RAX:
>> 0000000000000009
>> [  +0.000002] RAX: ffffffffffffffda RBX: 0000000000000003 RCX:
>> 00007fe34efe06ba
>> [  +0.000001] RDX: 0000000000000003 RSI: 0000000000010000 RDI:
>> 0000000007fba000
>> [  +0.000001] RBP: 0000000000000004 R08: 0000000000000004 R09:
>> 0000000000000000
>> [  +0.000001] R10: 0000000000000011 R11: 0000000000000206 R12:
>> 0000000007fba000
>> [  +0.000001] R13: 0000000000010000 R14: 0000000000000011 R15:
>> 0000000000000000
>>
>>
>> On Fri, 02 Oct 2020 23:50:51 -0500, Jarkko Sakkinen
>> <jarkko.sakkinen@linux.intel.com> wrote:
>>
>> > There is a limited amount of EPC available. Therefore, some of it  
>> must be
>> > copied to the regular memory, and only subset kept in the SGX reserved
>> > memory. While kernel cannot directly access enclave memory, SGX  
>> provides
>> > a
>> > set of ENCLS leaf functions to perform reclaiming.
>> >
>> > Implement a page reclaimer by using these leaf functions. It picks the
>> > victim pages in LRU fashion from all the enclaves running in the  
>> system.
>> > The thread ksgxswapd reclaims pages on the event when the number of  
>> free
>> > EPC pages goes below SGX_NR_LOW_PAGES up until it reaches
>> > SGX_NR_HIGH_PAGES.
>> >
>> > sgx_alloc_epc_page() can optionally directly reclaim pages with  
>> @reclaim
>> > set true. A caller must also supply owner for each page so that the
>> > reclaimer can access the associated enclaves. This is needed for  
>> locking,
>> > as most of the ENCLS leafs cannot be executed concurrently for an
>> > enclave.
>> > The owner is also needed for accessing SECS, which is required to be
>> > resident when its child pages are being reclaimed.
>> >
>> > Cc: linux-mm@kvack.org
>> > Acked-by: Jethro Beekman <jethro@fortanix.com>
>> > Tested-by: Jethro Beekman <jethro@fortanix.com>
>> > Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
>> > Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
>> > Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
>> > Tested-by: Seth Moore <sethmo@google.com>
>> > Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
>> > ---
>> >  arch/x86/kernel/cpu/sgx/driver.c |   1 +
>> >  arch/x86/kernel/cpu/sgx/encl.c   | 344 +++++++++++++++++++++-
>> >  arch/x86/kernel/cpu/sgx/encl.h   |  41 +++
>> >  arch/x86/kernel/cpu/sgx/ioctl.c  |  78 ++++-
>> >  arch/x86/kernel/cpu/sgx/main.c   | 481  
>> +++++++++++++++++++++++++++++++
>> >  arch/x86/kernel/cpu/sgx/sgx.h    |   9 +
>> >  6 files changed, 947 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/arch/x86/kernel/cpu/sgx/driver.c
>> > b/arch/x86/kernel/cpu/sgx/driver.c
>> > index d01b28f7ce4a..0446781cc7a2 100644
>> > --- a/arch/x86/kernel/cpu/sgx/driver.c
>> > +++ b/arch/x86/kernel/cpu/sgx/driver.c
>> > @@ -29,6 +29,7 @@ static int sgx_open(struct inode *inode, struct file
>> > *file)
>> >  	atomic_set(&encl->flags, 0);
>> >  	kref_init(&encl->refcount);
>> >  	xa_init(&encl->page_array);
>> > +	INIT_LIST_HEAD(&encl->va_pages);
>> >  	mutex_init(&encl->lock);
>> >  	INIT_LIST_HEAD(&encl->mm_list);
>> >  	spin_lock_init(&encl->mm_lock);
>> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c
>> > b/arch/x86/kernel/cpu/sgx/encl.c
>> > index c2c4a77af36b..54326efa6c2f 100644
>> > --- a/arch/x86/kernel/cpu/sgx/encl.c
>> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
>> > @@ -12,9 +12,88 @@
>> >  #include "encls.h"
>> >  #include "sgx.h"
>> > +/*
>> > + * ELDU: Load an EPC page as unblocked. For more info, see "OS
>> > Management of EPC
>> > + * Pages" in the SDM.
>> > + */
>> > +static int __sgx_encl_eldu(struct sgx_encl_page *encl_page,
>> > +			   struct sgx_epc_page *epc_page,
>> > +			   struct sgx_epc_page *secs_page)
>> > +{
>> > +	unsigned long va_offset = SGX_ENCL_PAGE_VA_OFFSET(encl_page);
>> > +	struct sgx_encl *encl = encl_page->encl;
>> > +	struct sgx_pageinfo pginfo;
>> > +	struct sgx_backing b;
>> > +	pgoff_t page_index;
>> > +	int ret;
>> > +
>> > +	if (secs_page)
>> > +		page_index = SGX_ENCL_PAGE_INDEX(encl_page);
>> > +	else
>> > +		page_index = PFN_DOWN(encl->size);
>> > +
>> > +	ret = sgx_encl_get_backing(encl, page_index, &b);
>> > +	if (ret)
>> > +		return ret;
>> > +
>> > +	pginfo.addr = SGX_ENCL_PAGE_ADDR(encl_page);
>> > +	pginfo.contents = (unsigned long)kmap_atomic(b.contents);
>> > +	pginfo.metadata = (unsigned long)kmap_atomic(b.pcmd) +
>> > +			  b.pcmd_offset;
>> > +
>> > +	if (secs_page)
>> > +		pginfo.secs = (u64)sgx_get_epc_addr(secs_page);
>> > +	else
>> > +		pginfo.secs = 0;
>> > +
>> > +	ret = __eldu(&pginfo, sgx_get_epc_addr(epc_page),
>> > +		     sgx_get_epc_addr(encl_page->va_page->epc_page) +
>> > +				      va_offset);
>> > +	if (ret) {
>> > +		if (encls_failed(ret))
>> > +			ENCLS_WARN(ret, "ELDU");
>> > +
>> > +		ret = -EFAULT;
>> > +	}
>> > +
>> > +	kunmap_atomic((void *)(unsigned long)(pginfo.metadata -
>> > b.pcmd_offset));
>> > +	kunmap_atomic((void *)(unsigned long)pginfo.contents);
>> > +
>> > +	sgx_encl_put_backing(&b, false);
>> > +
>> > +	return ret;
>> > +}
>> > +
>> > +static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page
>> > *encl_page,
>> > +					  struct sgx_epc_page *secs_page)
>> > +{
>> > +	unsigned long va_offset = SGX_ENCL_PAGE_VA_OFFSET(encl_page);
>> > +	struct sgx_encl *encl = encl_page->encl;
>> > +	struct sgx_epc_page *epc_page;
>> > +	int ret;
>> > +
>> > +	epc_page = sgx_alloc_epc_page(encl_page, false);
>> > +	if (IS_ERR(epc_page))
>> > +		return epc_page;
>> > +
>> > +	ret = __sgx_encl_eldu(encl_page, epc_page, secs_page);
>> > +	if (ret) {
>> > +		sgx_free_epc_page(epc_page);
>> > +		return ERR_PTR(ret);
>> > +	}
>> > +
>> > +	sgx_free_va_slot(encl_page->va_page, va_offset);
>> > +	list_move(&encl_page->va_page->list, &encl->va_pages);
>> > +	encl_page->desc &= ~SGX_ENCL_PAGE_VA_OFFSET_MASK;
>> > +	encl_page->epc_page = epc_page;
>> > +
>> > +	return epc_page;
>> > +}
>> > +
>> >  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl  
>> *encl,
>> >  						unsigned long addr)
>> >  {
>> > +	struct sgx_epc_page *epc_page;
>> >  	struct sgx_encl_page *entry;
>> >  	unsigned int flags;
>> > @@ -33,10 +112,27 @@ static struct sgx_encl_page
>> > *sgx_encl_load_page(struct sgx_encl *encl,
>> >  		return ERR_PTR(-EFAULT);
>> > 	/* Page is already resident in the EPC. */
>> > -	if (entry->epc_page)
>> > +	if (entry->epc_page) {
>> > +		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
>> > +			return ERR_PTR(-EBUSY);
>> > +
>> >  		return entry;
>> > +	}
>> > +
>> > +	if (!(encl->secs.epc_page)) {
>> > +		epc_page = sgx_encl_eldu(&encl->secs, NULL);
>> > +		if (IS_ERR(epc_page))
>> > +			return ERR_CAST(epc_page);
>> > +	}
>> > -	return ERR_PTR(-EFAULT);
>> > +	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
>> > +	if (IS_ERR(epc_page))
>> > +		return ERR_CAST(epc_page);
>> > +
>> > +	encl->secs_child_cnt++;
>> > +	sgx_mark_page_reclaimable(entry->epc_page);
>> > +
>> > +	return entry;
>> >  }
>> > static void sgx_mmu_notifier_release(struct mmu_notifier *mn,
>> > @@ -132,6 +228,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct
>> > mm_struct *mm)
>> > 	spin_lock(&encl->mm_lock);
>> >  	list_add_rcu(&encl_mm->list, &encl->mm_list);
>> > +	/* Pairs with smp_rmb() in sgx_reclaimer_block(). */
>> > +	smp_wmb();
>> > +	encl->mm_list_version++;
>> >  	spin_unlock(&encl->mm_lock);
>> > 	return 0;
>> > @@ -179,6 +278,8 @@ static unsigned int sgx_vma_fault(struct vm_fault
>> > *vmf)
>> >  		goto out;
>> >  	}
>> > +	sgx_encl_test_and_clear_young(vma->vm_mm, entry);
>> > +
>> >  out:
>> >  	mutex_unlock(&encl->lock);
>> >  	return ret;
>> > @@ -280,6 +381,7 @@ int sgx_encl_find(struct mm_struct *mm, unsigned
>> > long addr,
>> >   */
>> >  void sgx_encl_destroy(struct sgx_encl *encl)
>> >  {
>> > +	struct sgx_va_page *va_page;
>> >  	struct sgx_encl_page *entry;
>> >  	unsigned long index;
>> > @@ -287,6 +389,13 @@ void sgx_encl_destroy(struct sgx_encl *encl)
>> > 	xa_for_each(&encl->page_array, index, entry) {
>> >  		if (entry->epc_page) {
>> > +			/*
>> > +			 * The page and its radix tree entry cannot be freed
>> > +			 * if the page is being held by the reclaimer.
>> > +			 */
>> > +			if (sgx_unmark_page_reclaimable(entry->epc_page))
>> > +				continue;
>> > +
>> >  			sgx_free_epc_page(entry->epc_page);
>> >  			encl->secs_child_cnt--;
>> >  			entry->epc_page = NULL;
>> > @@ -301,6 +410,19 @@ void sgx_encl_destroy(struct sgx_encl *encl)
>> >  		sgx_free_epc_page(encl->secs.epc_page);
>> >  		encl->secs.epc_page = NULL;
>> >  	}
>> > +
>> > +	/*
>> > +	 * The reclaimer is responsible for checking SGX_ENCL_DEAD before  
>> doing
>> > +	 * EWB, thus it's safe to free VA pages even if the reclaimer holds  
>> a
>> > +	 * reference to the enclave.
>> > +	 */
>> > +	while (!list_empty(&encl->va_pages)) {
>> > +		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
>> > +					   list);
>> > +		list_del(&va_page->list);
>> > +		sgx_free_epc_page(va_page->epc_page);
>> > +		kfree(va_page);
>> > +	}
>> >  }
>> > /**
>> > @@ -329,3 +451,221 @@ void sgx_encl_release(struct kref *ref)
>> > 	kfree(encl);
>> >  }
>> > +
>> > +static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl,
>> > +					      pgoff_t index)
>> > +{
>> > +	struct inode *inode = encl->backing->f_path.dentry->d_inode;
>> > +	struct address_space *mapping = inode->i_mapping;
>> > +	gfp_t gfpmask = mapping_gfp_mask(mapping);
>> > +
>> > +	return shmem_read_mapping_page_gfp(mapping, index, gfpmask);
>> > +}
>> > +
>> > +/**
>> > + * sgx_encl_get_backing() - Pin the backing storage
>> > + * @encl:	an enclave pointer
>> > + * @page_index:	enclave page index
>> > + * @backing:	data for accessing backing storage for the page
>> > + *
>> > + * Pin the backing storage pages for storing the encrypted contents  
>> and
>> > Paging
>> > + * Crypto MetaData (PCMD) of an enclave page.
>> > + *
>> > + * Return:
>> > + *   0 on success,
>> > + *   -errno otherwise.
>> > + */
>> > +int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long
>> > page_index,
>> > +			 struct sgx_backing *backing)
>> > +{
>> > +	pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5);
>> > +	struct page *contents;
>> > +	struct page *pcmd;
>> > +
>> > +	contents = sgx_encl_get_backing_page(encl, page_index);
>> > +	if (IS_ERR(contents))
>> > +		return PTR_ERR(contents);
>> > +
>> > +	pcmd = sgx_encl_get_backing_page(encl, pcmd_index);
>> > +	if (IS_ERR(pcmd)) {
>> > +		put_page(contents);
>> > +		return PTR_ERR(pcmd);
>> > +	}
>> > +
>> > +	backing->page_index = page_index;
>> > +	backing->contents = contents;
>> > +	backing->pcmd = pcmd;
>> > +	backing->pcmd_offset =
>> > +		(page_index & (PAGE_SIZE / sizeof(struct sgx_pcmd) - 1)) *
>> > +		sizeof(struct sgx_pcmd);
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +/**
>> > + * sgx_encl_put_backing() - Unpin the backing storage
>> > + * @backing:	data for accessing backing storage for the page
>> > + * @do_write:	mark pages dirty
>> > + */
>> > +void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write)
>> > +{
>> > +	if (do_write) {
>> > +		set_page_dirty(backing->pcmd);
>> > +		set_page_dirty(backing->contents);
>> > +	}
>> > +
>> > +	put_page(backing->pcmd);
>> > +	put_page(backing->contents);
>> > +}
>> > +
>> > +static int sgx_encl_test_and_clear_young_cb(pte_t *ptep, unsigned  
>> long
>> > addr,
>> > +					    void *data)
>> > +{
>> > +	pte_t pte;
>> > +	int ret;
>> > +
>> > +	ret = pte_young(*ptep);
>> > +	if (ret) {
>> > +		pte = pte_mkold(*ptep);
>> > +		set_pte_at((struct mm_struct *)data, addr, ptep, pte);
>> > +	}
>> > +
>> > +	return ret;
>> > +}
>> > +
>> > +/**
>> > + * sgx_encl_test_and_clear_young() - Test and reset the accessed bit
>> > + * @mm:		mm_struct that is checked
>> > + * @page:	enclave page to be tested for recent access
>> > + *
>> > + * Checks the Access (A) bit from the PTE corresponding to the  
>> enclave
>> > page and
>> > + * clears it.
>> > + *
>> > + * Return: 1 if the page has been recently accessed and 0 if not.
>> > + */
>> > +int sgx_encl_test_and_clear_young(struct mm_struct *mm,
>> > +				  struct sgx_encl_page *page)
>> > +{
>> > +	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
>> > +	struct sgx_encl *encl = page->encl;
>> > +	struct vm_area_struct *vma;
>> > +	int ret;
>> > +
>> > +	ret = sgx_encl_find(mm, addr, &vma);
>> > +	if (ret)
>> > +		return 0;
>> > +
>> > +	if (encl != vma->vm_private_data)
>> > +		return 0;
>> > +
>> > +	ret = apply_to_page_range(vma->vm_mm, addr, PAGE_SIZE,
>> > +				  sgx_encl_test_and_clear_young_cb, vma->vm_mm);
>> > +	if (ret < 0)
>> > +		return 0;
>> > +
>> > +	return ret;
>> > +}
>> > +
>> > +/**
>> > + * sgx_encl_reserve_page() - Reserve an enclave page
>> > + * @encl:	an enclave pointer
>> > + * @addr:	a page address
>> > + *
>> > + * Load an enclave page and lock the enclave so that the page can be
>> > used by
>> > + * EDBG* and EMOD*.
>> > + *
>> > + * Return:
>> > + *   an enclave page on success
>> > + *   -EFAULT	if the load fails
>> > + */
>> > +struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
>> > +					    unsigned long addr)
>> > +{
>> > +	struct sgx_encl_page *entry;
>> > +
>> > +	for ( ; ; ) {
>> > +		mutex_lock(&encl->lock);
>> > +
>> > +		entry = sgx_encl_load_page(encl, addr);
>> > +		if (PTR_ERR(entry) != -EBUSY)
>> > +			break;
>> > +
>> > +		mutex_unlock(&encl->lock);
>> > +	}
>> > +
>> > +	if (IS_ERR(entry))
>> > +		mutex_unlock(&encl->lock);
>> > +
>> > +	return entry;
>> > +}
>> > +
>> > +/**
>> > + * sgx_alloc_va_page() - Allocate a Version Array (VA) page
>> > + *
>> > + * Allocate a free EPC page and convert it to a Version Array (VA)  
>> page.
>> > + *
>> > + * Return:
>> > + *   a VA page,
>> > + *   -errno otherwise
>> > + */
>> > +struct sgx_epc_page *sgx_alloc_va_page(void)
>> > +{
>> > +	struct sgx_epc_page *epc_page;
>> > +	int ret;
>> > +
>> > +	epc_page = sgx_alloc_epc_page(NULL, true);
>> > +	if (IS_ERR(epc_page))
>> > +		return ERR_CAST(epc_page);
>> > +
>> > +	ret = __epa(sgx_get_epc_addr(epc_page));
>> > +	if (ret) {
>> > +		WARN_ONCE(1, "EPA returned %d (0x%x)", ret, ret);
>> > +		sgx_free_epc_page(epc_page);
>> > +		return ERR_PTR(-EFAULT);
>> > +	}
>> > +
>> > +	return epc_page;
>> > +}
>> > +
>> > +/**
>> > + * sgx_alloc_va_slot - allocate a VA slot
>> > + * @va_page:	a &struct sgx_va_page instance
>> > + *
>> > + * Allocates a slot from a &struct sgx_va_page instance.
>> > + *
>> > + * Return: offset of the slot inside the VA page
>> > + */
>> > +unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page)
>> > +{
>> > +	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
>> > +
>> > +	if (slot < SGX_VA_SLOT_COUNT)
>> > +		set_bit(slot, va_page->slots);
>> > +
>> > +	return slot << 3;
>> > +}
>> > +
>> > +/**
>> > + * sgx_free_va_slot - free a VA slot
>> > + * @va_page:	a &struct sgx_va_page instance
>> > + * @offset:	offset of the slot inside the VA page
>> > + *
>> > + * Frees a slot from a &struct sgx_va_page instance.
>> > + */
>> > +void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int  
>> offset)
>> > +{
>> > +	clear_bit(offset >> 3, va_page->slots);
>> > +}
>> > +
>> > +/**
>> > + * sgx_va_page_full - is the VA page full?
>> > + * @va_page:	a &struct sgx_va_page instance
>> > + *
>> > + * Return: true if all slots have been taken
>> > + */
>> > +bool sgx_va_page_full(struct sgx_va_page *va_page)
>> > +{
>> > +	int slot = find_first_zero_bit(va_page->slots, SGX_VA_SLOT_COUNT);
>> > +
>> > +	return slot == SGX_VA_SLOT_COUNT;
>> > +}
>> > diff --git a/arch/x86/kernel/cpu/sgx/encl.h
>> > b/arch/x86/kernel/cpu/sgx/encl.h
>> > index 0448d22d3010..e8eb9e9a834e 100644
>> > --- a/arch/x86/kernel/cpu/sgx/encl.h
>> > +++ b/arch/x86/kernel/cpu/sgx/encl.h
>> > @@ -19,6 +19,10 @@
>> > /**
>> >   * enum sgx_encl_page_desc - defines bits for an enclave page's
>> > descriptor
>> > + * %SGX_ENCL_PAGE_BEING_RECLAIMED:	The page is in the process of  
>> being
>> > + *					reclaimed.
>> > + * %SGX_ENCL_PAGE_VA_OFFSET_MASK:	Holds the offset in the Version  
>> Array
>> > + *					(VA) page for a swapped page.
>> >   * %SGX_ENCL_PAGE_ADDR_MASK:		Holds the virtual address of the page.
>> >   *
>> >   * The page address for SECS is zero and is used by the subsystem to
>> > recognize
>> > @@ -26,16 +30,23 @@
>> >   */
>> >  enum sgx_encl_page_desc {
>> >  	/* Bits 11:3 are available when the page is not swapped. */
>> > +	SGX_ENCL_PAGE_BEING_RECLAIMED		= BIT(3),
>> > +	SGX_ENCL_PAGE_VA_OFFSET_MASK	= GENMASK_ULL(11, 3),
>> >  	SGX_ENCL_PAGE_ADDR_MASK		= PAGE_MASK,
>> >  };
>> > #define SGX_ENCL_PAGE_ADDR(page) \
>> >  	((page)->desc & SGX_ENCL_PAGE_ADDR_MASK)
>> > +#define SGX_ENCL_PAGE_VA_OFFSET(page) \
>> > +	((page)->desc & SGX_ENCL_PAGE_VA_OFFSET_MASK)
>> > +#define SGX_ENCL_PAGE_INDEX(page) \
>> > +	PFN_DOWN((page)->desc - (page)->encl->base)
>> > struct sgx_encl_page {
>> >  	unsigned long desc;
>> >  	unsigned long vm_max_prot_bits;
>> >  	struct sgx_epc_page *epc_page;
>> > +	struct sgx_va_page *va_page;
>> >  	struct sgx_encl *encl;
>> >  };
>> > @@ -61,6 +72,7 @@ struct sgx_encl {
>> >  	struct mutex lock;
>> >  	struct list_head mm_list;
>> >  	spinlock_t mm_lock;
>> > +	unsigned long mm_list_version;
>> >  	struct file *backing;
>> >  	struct kref refcount;
>> >  	struct srcu_struct srcu;
>> > @@ -68,12 +80,21 @@ struct sgx_encl {
>> >  	unsigned long size;
>> >  	unsigned long ssaframesize;
>> >  	struct xarray page_array;
>> > +	struct list_head va_pages;
>> >  	struct sgx_encl_page secs;
>> >  	cpumask_t cpumask;
>> >  	unsigned long attributes;
>> >  	unsigned long attributes_mask;
>> >  };
>> > +#define SGX_VA_SLOT_COUNT 512
>> > +
>> > +struct sgx_va_page {
>> > +	struct sgx_epc_page *epc_page;
>> > +	DECLARE_BITMAP(slots, SGX_VA_SLOT_COUNT);
>> > +	struct list_head list;
>> > +};
>> > +
>> >  extern const struct vm_operations_struct sgx_vm_ops;
>> > int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
>> > @@ -84,4 +105,24 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct
>> > mm_struct *mm);
>> >  int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
>> >  		     unsigned long end, unsigned long vm_flags);
>> > +struct sgx_backing {
>> > +	pgoff_t page_index;
>> > +	struct page *contents;
>> > +	struct page *pcmd;
>> > +	unsigned long pcmd_offset;
>> > +};
>> > +
>> > +int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long
>> > page_index,
>> > +			 struct sgx_backing *backing);
>> > +void sgx_encl_put_backing(struct sgx_backing *backing, bool  
>> do_write);
>> > +int sgx_encl_test_and_clear_young(struct mm_struct *mm,
>> > +				  struct sgx_encl_page *page);
>> > +struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
>> > +					    unsigned long addr);
>> > +
>> > +struct sgx_epc_page *sgx_alloc_va_page(void);
>> > +unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
>> > +void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int  
>> offset);
>> > +bool sgx_va_page_full(struct sgx_va_page *va_page);
>> > +
>> >  #endif /* _X86_ENCL_H */
>> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c
>> > b/arch/x86/kernel/cpu/sgx/ioctl.c
>> > index 3c04798e83e5..613f6c03598e 100644
>> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
>> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
>> > @@ -16,6 +16,43 @@
>> >  #include "encl.h"
>> >  #include "encls.h"
>> > +static struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl)
>> > +{
>> > +	struct sgx_va_page *va_page = NULL;
>> > +	void *err;
>> > +
>> > +	BUILD_BUG_ON(SGX_VA_SLOT_COUNT !=
>> > +		(SGX_ENCL_PAGE_VA_OFFSET_MASK >> 3) + 1);
>> > +
>> > +	if (!(encl->page_cnt % SGX_VA_SLOT_COUNT)) {
>> > +		va_page = kzalloc(sizeof(*va_page), GFP_KERNEL);
>> > +		if (!va_page)
>> > +			return ERR_PTR(-ENOMEM);
>> > +
>> > +		va_page->epc_page = sgx_alloc_va_page();
>> > +		if (IS_ERR(va_page->epc_page)) {
>> > +			err = ERR_CAST(va_page->epc_page);
>> > +			kfree(va_page);
>> > +			return err;
>> > +		}
>> > +
>> > +		WARN_ON_ONCE(encl->page_cnt % SGX_VA_SLOT_COUNT);
>> > +	}
>> > +	encl->page_cnt++;
>> > +	return va_page;
>> > +}
>> > +
>> > +static void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page
>> > *va_page)
>> > +{
>> > +	encl->page_cnt--;
>> > +
>> > +	if (va_page) {
>> > +		sgx_free_epc_page(va_page->epc_page);
>> > +		list_del(&va_page->list);
>> > +		kfree(va_page);
>> > +	}
>> > +}
>> > +
>> >  static u32 sgx_calc_ssa_frame_size(u32 miscselect, u64 xfrm)
>> >  {
>> >  	u32 size_max = PAGE_SIZE;
>> > @@ -80,15 +117,24 @@ static int sgx_validate_secs(const struct  
>> sgx_secs
>> > *secs)
>> >  static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs  
>> *secs)
>> >  {
>> >  	struct sgx_epc_page *secs_epc;
>> > +	struct sgx_va_page *va_page;
>> >  	struct sgx_pageinfo pginfo;
>> >  	struct sgx_secinfo secinfo;
>> >  	unsigned long encl_size;
>> >  	struct file *backing;
>> >  	long ret;
>> > +	va_page = sgx_encl_grow(encl);
>> > +	if (IS_ERR(va_page))
>> > +		return PTR_ERR(va_page);
>> > +	else if (va_page)
>> > +		list_add(&va_page->list, &encl->va_pages);
>> > +	/* else the tail page of the VA page list had free slots. */
>> > +
>> >  	if (sgx_validate_secs(secs)) {
>> >  		pr_debug("invalid SECS\n");
>> > -		return -EINVAL;
>> > +		ret = -EINVAL;
>> > +		goto err_out_shrink;
>> >  	}
>> > 	/* The extra page goes to SECS. */
>> > @@ -96,12 +142,14 @@ static int sgx_encl_create(struct sgx_encl *encl,
>> > struct sgx_secs *secs)
>> > 	backing = shmem_file_setup("SGX backing", encl_size + (encl_size >>  
>> 5),
>> >  				   VM_NORESERVE);
>> > -	if (IS_ERR(backing))
>> > -		return PTR_ERR(backing);
>> > +	if (IS_ERR(backing)) {
>> > +		ret = PTR_ERR(backing);
>> > +		goto err_out_shrink;
>> > +	}
>> > 	encl->backing = backing;
>> > -	secs_epc = __sgx_alloc_epc_page();
>> > +	secs_epc = sgx_alloc_epc_page(&encl->secs, true);
>> >  	if (IS_ERR(secs_epc)) {
>> >  		ret = PTR_ERR(secs_epc);
>> >  		goto err_out_backing;
>> > @@ -149,6 +197,9 @@ static int sgx_encl_create(struct sgx_encl *encl,
>> > struct sgx_secs *secs)
>> >  	fput(encl->backing);
>> >  	encl->backing = NULL;
>> > +err_out_shrink:
>> > +	sgx_encl_shrink(encl, va_page);
>> > +
>> >  	return ret;
>> >  }
>> > @@ -321,21 +372,35 @@ static int sgx_encl_add_page(struct sgx_encl
>> > *encl, unsigned long src,
>> >  {
>> >  	struct sgx_encl_page *encl_page;
>> >  	struct sgx_epc_page *epc_page;
>> > +	struct sgx_va_page *va_page;
>> >  	int ret;
>> > 	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
>> >  	if (IS_ERR(encl_page))
>> >  		return PTR_ERR(encl_page);
>> > -	epc_page = __sgx_alloc_epc_page();
>> > +	epc_page = sgx_alloc_epc_page(encl_page, true);
>> >  	if (IS_ERR(epc_page)) {
>> >  		kfree(encl_page);
>> >  		return PTR_ERR(epc_page);
>> >  	}
>> > +	va_page = sgx_encl_grow(encl);
>> > +	if (IS_ERR(va_page)) {
>> > +		ret = PTR_ERR(va_page);
>> > +		goto err_out_free;
>> > +	}
>> > +
>> >  	mmap_read_lock(current->mm);
>> >  	mutex_lock(&encl->lock);
>> > +	/*
>> > +	 * Adding to encl->va_pages must be done under encl->lock.  Ditto  
>> for
>> > +	 * deleting (via sgx_encl_shrink()) in the error path.
>> > +	 */
>> > +	if (va_page)
>> > +		list_add(&va_page->list, &encl->va_pages);
>> > +
>> >  	/*
>> >  	 * Insert prior to EADD in case of OOM.  EADD modifies MRENCLAVE,  
>> i.e.
>> >  	 * can't be gracefully unwound, while failure on EADD/EXTEND is  
>> limited
>> > @@ -366,6 +431,7 @@ static int sgx_encl_add_page(struct sgx_encl  
>> *encl,
>> > unsigned long src,
>> >  			goto err_out;
>> >  	}
>> > +	sgx_mark_page_reclaimable(encl_page->epc_page);
>> >  	mutex_unlock(&encl->lock);
>> >  	mmap_read_unlock(current->mm);
>> >  	return ret;
>> > @@ -374,9 +440,11 @@ static int sgx_encl_add_page(struct sgx_encl  
>> *encl,
>> > unsigned long src,
>> >  	xa_erase(&encl->page_array, PFN_DOWN(encl_page->desc));
>> > err_out_unlock:
>> > +	sgx_encl_shrink(encl, va_page);
>> >  	mutex_unlock(&encl->lock);
>> >  	mmap_read_unlock(current->mm);
>> > +err_out_free:
>> >  	sgx_free_epc_page(epc_page);
>> >  	kfree(encl_page);
>> > diff --git a/arch/x86/kernel/cpu/sgx/main.c
>> > b/arch/x86/kernel/cpu/sgx/main.c
>> > index 4137254fb29e..3f9130501370 100644
>> > --- a/arch/x86/kernel/cpu/sgx/main.c
>> > +++ b/arch/x86/kernel/cpu/sgx/main.c
>> > @@ -16,6 +16,395 @@
>> >  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>> >  static int sgx_nr_epc_sections;
>> >  static struct task_struct *ksgxswapd_tsk;
>> > +static DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
>> > +static LIST_HEAD(sgx_active_page_list);
>> > +static DEFINE_SPINLOCK(sgx_active_page_list_lock);
>> > +
>> > +/**
>> > + * sgx_mark_page_reclaimable() - Mark a page as reclaimable
>> > + * @page:	EPC page
>> > + *
>> > + * Mark a page as reclaimable and add it to the active page list.  
>> Pages
>> > + * are automatically removed from the active list when freed.
>> > + */
>> > +void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
>> > +{
>> > +	spin_lock(&sgx_active_page_list_lock);
>> > +	page->desc |= SGX_EPC_PAGE_RECLAIMABLE;
>> > +	list_add_tail(&page->list, &sgx_active_page_list);
>> > +	spin_unlock(&sgx_active_page_list_lock);
>> > +}
>> > +
>> > +/**
>> > + * sgx_unmark_page_reclaimable() - Remove a page from the reclaim  
>> list
>> > + * @page:	EPC page
>> > + *
>> > + * Clear the reclaimable flag and remove the page from the active  
>> page
>> > list.
>> > + *
>> > + * Return:
>> > + *   0 on success,
>> > + *   -EBUSY if the page is in the process of being reclaimed
>> > + */
>> > +int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
>> > +{
>> > +	/*
>> > +	 * Remove the page from the active list if necessary.  If the page
>> > +	 * is actively being reclaimed, i.e. RECLAIMABLE is set but the
>> > +	 * page isn't on the active list, return -EBUSY as we can't free
>> > +	 * the page at this time since it is "owned" by the reclaimer.
>> > +	 */
>> > +	spin_lock(&sgx_active_page_list_lock);
>> > +	if (page->desc & SGX_EPC_PAGE_RECLAIMABLE) {
>> > +		if (list_empty(&page->list)) {
>> > +			spin_unlock(&sgx_active_page_list_lock);
>> > +			return -EBUSY;
>> > +		}
>> > +		list_del(&page->list);
>> > +		page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
>> > +	}
>> > +	spin_unlock(&sgx_active_page_list_lock);
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
>> > +{
>> > +	struct sgx_encl_page *page = epc_page->owner;
>> > +	struct sgx_encl *encl = page->encl;
>> > +	struct sgx_encl_mm *encl_mm;
>> > +	bool ret = true;
>> > +	int idx;
>> > +
>> > +	idx = srcu_read_lock(&encl->srcu);
>> > +
>> > +	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
>> > +		if (!mmget_not_zero(encl_mm->mm))
>> > +			continue;
>> > +
>> > +		mmap_read_lock(encl_mm->mm);
>> > +		ret = !sgx_encl_test_and_clear_young(encl_mm->mm, page);
>> > +		mmap_read_unlock(encl_mm->mm);
>> > +
>> > +		mmput_async(encl_mm->mm);
>> > +
>> > +		if (!ret || (atomic_read(&encl->flags) & SGX_ENCL_DEAD))
>> > +			break;
>> > +	}
>> > +
>> > +	srcu_read_unlock(&encl->srcu, idx);
>> > +
>> > +	if (!ret && !(atomic_read(&encl->flags) & SGX_ENCL_DEAD))
>> > +		return false;
>> > +
>> > +	return true;
>> > +}
>> > +
>> > +static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
>> > +{
>> > +	struct sgx_encl_page *page = epc_page->owner;
>> > +	unsigned long addr = SGX_ENCL_PAGE_ADDR(page);
>> > +	struct sgx_encl *encl = page->encl;
>> > +	unsigned long mm_list_version;
>> > +	struct sgx_encl_mm *encl_mm;
>> > +	struct vm_area_struct *vma;
>> > +	int idx, ret;
>> > +
>> > +	do {
>> > +		mm_list_version = encl->mm_list_version;
>> > +
>> > +		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
>> > +		smp_rmb();
>> > +
>> > +		idx = srcu_read_lock(&encl->srcu);
>> > +
>> > +		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
>> > +			if (!mmget_not_zero(encl_mm->mm))
>> > +				continue;
>> > +
>> > +			mmap_read_lock(encl_mm->mm);
>> > +
>> > +			ret = sgx_encl_find(encl_mm->mm, addr, &vma);
>> > +			if (!ret && encl == vma->vm_private_data)
>> > +				zap_vma_ptes(vma, addr, PAGE_SIZE);
>> > +
>> > +			mmap_read_unlock(encl_mm->mm);
>> > +
>> > +			mmput_async(encl_mm->mm);
>> > +		}
>> > +
>> > +		srcu_read_unlock(&encl->srcu, idx);
>> > +	} while (unlikely(encl->mm_list_version != mm_list_version));
>> > +
>> > +	mutex_lock(&encl->lock);
>> > +
>> > +	if (!(atomic_read(&encl->flags) & SGX_ENCL_DEAD)) {
>> > +		ret = __eblock(sgx_get_epc_addr(epc_page));
>> > +		if (encls_failed(ret))
>> > +			ENCLS_WARN(ret, "EBLOCK");
>> > +	}
>> > +
>> > +	mutex_unlock(&encl->lock);
>> > +}
>> > +
>> > +static int __sgx_encl_ewb(struct sgx_epc_page *epc_page, void  
>> *va_slot,
>> > +			  struct sgx_backing *backing)
>> > +{
>> > +	struct sgx_pageinfo pginfo;
>> > +	int ret;
>> > +
>> > +	pginfo.addr = 0;
>> > +	pginfo.secs = 0;
>> > +
>> > +	pginfo.contents = (unsigned long)kmap_atomic(backing->contents);
>> > +	pginfo.metadata = (unsigned long)kmap_atomic(backing->pcmd) +
>> > +			  backing->pcmd_offset;
>> > +
>> > +	ret = __ewb(&pginfo, sgx_get_epc_addr(epc_page), va_slot);
>> > +
>> > +	kunmap_atomic((void *)(unsigned long)(pginfo.metadata -
>> > +					      backing->pcmd_offset));
>> > +	kunmap_atomic((void *)(unsigned long)pginfo.contents);
>> > +
>> > +	return ret;
>> > +}
>> > +
>> > +static void sgx_ipi_cb(void *info)
>> > +{
>> > +}
>> > +
>> > +static const cpumask_t *sgx_encl_ewb_cpumask(struct sgx_encl *encl)
>> > +{
>> > +	cpumask_t *cpumask = &encl->cpumask;
>> > +	struct sgx_encl_mm *encl_mm;
>> > +	int idx;
>> > +
>> > +	/*
>> > +	 * Can race with sgx_encl_mm_add(), but ETRACK has already been
>> > +	 * executed, which means that the CPUs running in the new mm will  
>> enter
>> > +	 * into the enclave with a fresh epoch.
>> > +	 */
>> > +	cpumask_clear(cpumask);
>> > +
>> > +	idx = srcu_read_lock(&encl->srcu);
>> > +
>> > +	list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
>> > +		if (!mmget_not_zero(encl_mm->mm))
>> > +			continue;
>> > +
>> > +		cpumask_or(cpumask, cpumask, mm_cpumask(encl_mm->mm));
>> > +
>> > +		mmput_async(encl_mm->mm);
>> > +	}
>> > +
>> > +	srcu_read_unlock(&encl->srcu, idx);
>> > +
>> > +	return cpumask;
>> > +}
>> > +
>> > +/*
>> > + * Swap page to the regular memory transformed to the blocked state  
>> by
>> > using
>> > + * EBLOCK, which means that it can no loger be referenced (no new TLB
>> > entries).
>> > + *
>> > + * The first trial just tries to write the page assuming that some
>> > other thread
>> > + * has reset the count for threads inside the enlave by using ETRACK,
>> > and
>> > + * previous thread count has been zeroed out. The second trial calls
>> > ETRACK
>> > + * before EWB. If that fails we kick all the HW threads out, and then
>> > do EWB,
>> > + * which should be guaranteed the succeed.
>> > + */
>> > +static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
>> > +			 struct sgx_backing *backing)
>> > +{
>> > +	struct sgx_encl_page *encl_page = epc_page->owner;
>> > +	struct sgx_encl *encl = encl_page->encl;
>> > +	struct sgx_va_page *va_page;
>> > +	unsigned int va_offset;
>> > +	void *va_slot;
>> > +	int ret;
>> > +
>> > +	encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED;
>> > +
>> > +	va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
>> > +				   list);
>> > +	va_offset = sgx_alloc_va_slot(va_page);
>> > +	va_slot = sgx_get_epc_addr(va_page->epc_page) + va_offset;
>> > +	if (sgx_va_page_full(va_page))
>> > +		list_move_tail(&va_page->list, &encl->va_pages);
>> > +
>> > +	ret = __sgx_encl_ewb(epc_page, va_slot, backing);
>> > +	if (ret == SGX_NOT_TRACKED) {
>> > +		ret = __etrack(sgx_get_epc_addr(encl->secs.epc_page));
>> > +		if (ret) {
>> > +			if (encls_failed(ret))
>> > +				ENCLS_WARN(ret, "ETRACK");
>> > +		}
>> > +
>> > +		ret = __sgx_encl_ewb(epc_page, va_slot, backing);
>> > +		if (ret == SGX_NOT_TRACKED) {
>> > +			/*
>> > +			 * Slow path, send IPIs to kick cpus out of the
>> > +			 * enclave.  Note, it's imperative that the cpu
>> > +			 * mask is generated *after* ETRACK, else we'll
>> > +			 * miss cpus that entered the enclave between
>> > +			 * generating the mask and incrementing epoch.
>> > +			 */
>> > +			on_each_cpu_mask(sgx_encl_ewb_cpumask(encl),
>> > +					 sgx_ipi_cb, NULL, 1);
>> > +			ret = __sgx_encl_ewb(epc_page, va_slot, backing);
>> > +		}
>> > +	}
>> > +
>> > +	if (ret) {
>> > +		if (encls_failed(ret))
>> > +			ENCLS_WARN(ret, "EWB");
>> > +
>> > +		sgx_free_va_slot(va_page, va_offset);
>> > +	} else {
>> > +		encl_page->desc |= va_offset;
>> > +		encl_page->va_page = va_page;
>> > +	}
>> > +}
>> > +
>> > +static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>> > +				struct sgx_backing *backing)
>> > +{
>> > +	struct sgx_encl_page *encl_page = epc_page->owner;
>> > +	struct sgx_encl *encl = encl_page->encl;
>> > +	struct sgx_backing secs_backing;
>> > +	int ret;
>> > +
>> > +	mutex_lock(&encl->lock);
>> > +
>> > +	if (atomic_read(&encl->flags) & SGX_ENCL_DEAD) {
>> > +		ret = __eremove(sgx_get_epc_addr(epc_page));
>> > +		ENCLS_WARN(ret, "EREMOVE");
>> > +	} else {
>> > +		sgx_encl_ewb(epc_page, backing);
>> > +	}
>> > +
>> > +	encl_page->epc_page = NULL;
>> > +	encl->secs_child_cnt--;
>> > +
>> > +	if (!encl->secs_child_cnt) {
>> > +		if (atomic_read(&encl->flags) & SGX_ENCL_DEAD) {
>> > +			sgx_free_epc_page(encl->secs.epc_page);
>> > +			encl->secs.epc_page = NULL;
>> > +		} else if (atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED) {
>> > +			ret = sgx_encl_get_backing(encl, PFN_DOWN(encl->size),
>> > +						   &secs_backing);
>> > +			if (ret)
>> > +				goto out;
>> > +
>> > +			sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
>> > +
>> > +			sgx_free_epc_page(encl->secs.epc_page);
>> > +			encl->secs.epc_page = NULL;
>> > +
>> > +			sgx_encl_put_backing(&secs_backing, true);
>> > +		}
>> > +	}
>> > +
>> > +out:
>> > +	mutex_unlock(&encl->lock);
>> > +}
>> > +
>> > +/*
>> > + * Take a fixed number of pages from the head of the active page pool
>> > and
>> > + * reclaim them to the enclave's private shmem files. Skip the pages,
>> > which have
>> > + * been accessed since the last scan. Move those pages to the tail of
>> > active
>> > + * page pool so that the pages get scanned in LRU like fashion.
>> > + *
>> > + * Batch process a chunk of pages (at the moment 16) in order to
>> > degrade amount
>> > + * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does
>> > degrade a bit
>> > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK +
>> > EWB and IPI
>> > + * + EWB) but not sufficiently. Reclaiming one page at a time would
>> > also be
>> > + * problematic as it would increase the lock contention too much,  
>> which
>> > would
>> > + * halt forward progress.
>> > + */
>> > +static void sgx_reclaim_pages(void)
>> > +{
>> > +	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
>> > +	struct sgx_backing backing[SGX_NR_TO_SCAN];
>> > +	struct sgx_epc_section *section;
>> > +	struct sgx_encl_page *encl_page;
>> > +	struct sgx_epc_page *epc_page;
>> > +	int cnt = 0;
>> > +	int ret;
>> > +	int i;
>> > +
>> > +	spin_lock(&sgx_active_page_list_lock);
>> > +	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
>> > +		if (list_empty(&sgx_active_page_list))
>> > +			break;
>> > +
>> > +		epc_page = list_first_entry(&sgx_active_page_list,
>> > +					    struct sgx_epc_page, list);
>> > +		list_del_init(&epc_page->list);
>> > +		encl_page = epc_page->owner;
>> > +
>> > +		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
>> > +			chunk[cnt++] = epc_page;
>> > +		else
>> > +			/* The owner is freeing the page. No need to add the
>> > +			 * page back to the list of reclaimable pages.
>> > +			 */
>> > +			epc_page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
>> > +	}
>> > +	spin_unlock(&sgx_active_page_list_lock);
>> > +
>> > +	for (i = 0; i < cnt; i++) {
>> > +		epc_page = chunk[i];
>> > +		encl_page = epc_page->owner;
>> > +
>> > +		if (!sgx_reclaimer_age(epc_page))
>> > +			goto skip;
>> > +
>> > +		ret = sgx_encl_get_backing(encl_page->encl,
>> > +					   SGX_ENCL_PAGE_INDEX(encl_page),
>> > +					   &backing[i]);
>> > +		if (ret)
>> > +			goto skip;
>> > +
>> > +		mutex_lock(&encl_page->encl->lock);
>> > +		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
>> > +		mutex_unlock(&encl_page->encl->lock);
>> > +		continue;
>> > +
>> > +skip:
>> > +		spin_lock(&sgx_active_page_list_lock);
>> > +		list_add_tail(&epc_page->list, &sgx_active_page_list);
>> > +		spin_unlock(&sgx_active_page_list_lock);
>> > +
>> > +		kref_put(&encl_page->encl->refcount, sgx_encl_release);
>> > +
>> > +		chunk[i] = NULL;
>> > +	}
>> > +
>> > +	for (i = 0; i < cnt; i++) {
>> > +		epc_page = chunk[i];
>> > +		if (epc_page)
>> > +			sgx_reclaimer_block(epc_page);
>> > +	}
>> > +
>> > +	for (i = 0; i < cnt; i++) {
>> > +		epc_page = chunk[i];
>> > +		if (!epc_page)
>> > +			continue;
>> > +
>> > +		encl_page = epc_page->owner;
>> > +		sgx_reclaimer_write(epc_page, &backing[i]);
>> > +		sgx_encl_put_backing(&backing[i], true);
>> > +
>> > +		kref_put(&encl_page->encl->refcount, sgx_encl_release);
>> > +		epc_page->desc &= ~SGX_EPC_PAGE_RECLAIMABLE;
>> > +
>> > +		section = sgx_get_epc_section(epc_page);
>> > +		spin_lock(&section->lock);
>> > +		list_add_tail(&epc_page->list, &section->page_list);
>> > +		section->free_cnt++;
>> > +		spin_unlock(&section->lock);
>> > +	}
>> > +}
>> > +
>> > static void sgx_sanitize_section(struct sgx_epc_section *section)
>> >  {
>> > @@ -44,6 +433,23 @@ static void sgx_sanitize_section(struct
>> > sgx_epc_section *section)
>> >  	}
>> >  }
>> > +static unsigned long sgx_nr_free_pages(void)
>> > +{
>> > +	unsigned long cnt = 0;
>> > +	int i;
>> > +
>> > +	for (i = 0; i < sgx_nr_epc_sections; i++)
>> > +		cnt += sgx_epc_sections[i].free_cnt;
>> > +
>> > +	return cnt;
>> > +}
>> > +
>> > +static bool sgx_should_reclaim(unsigned long watermark)
>> > +{
>> > +	return sgx_nr_free_pages() < watermark &&
>> > +	       !list_empty(&sgx_active_page_list);
>> > +}
>> > +
>> >  static int ksgxswapd(void *p)
>> >  {
>> >  	int i;
>> > @@ -69,6 +475,20 @@ static int ksgxswapd(void *p)
>> >  			WARN(1, "EPC section %d has unsanitized pages.\n", i);
>> >  	}
>> > +	while (!kthread_should_stop()) {
>> > +		if (try_to_freeze())
>> > +			continue;
>> > +
>> > +		wait_event_freezable(ksgxswapd_waitq,
>> > +				     kthread_should_stop() ||
>> > +				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>> > +
>> > +		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
>> > +			sgx_reclaim_pages();
>> > +
>> > +		cond_resched();
>> > +	}
>> > +
>> >  	return 0;
>> >  }
>> > @@ -94,6 +514,7 @@ static struct sgx_epc_page
>> > *__sgx_alloc_epc_page_from_section(struct sgx_epc_sec
>> > 	page = list_first_entry(&section->page_list, struct sgx_epc_page,  
>> list);
>> >  	list_del_init(&page->list);
>> > +	section->free_cnt--;
>> > 	return page;
>> >  }
>> > @@ -127,6 +548,57 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>> >  	return ERR_PTR(-ENOMEM);
>> >  }
>> > +/**
>> > + * sgx_alloc_epc_page() - Allocate an EPC page
>> > + * @owner:	the owner of the EPC page
>> > + * @reclaim:	reclaim pages if necessary
>> > + *
>> > + * Iterate through EPC sections and borrow a free EPC page to the
>> > caller. When a
>> > + * page is no longer needed it must be released with
>> > sgx_free_epc_page(). If
>> > + * @reclaim is set to true, directly reclaim pages when we are out of
>> > pages. No
>> > + * mm's can be locked when @reclaim is set to true.
>> > + *
>> > + * Finally, wake up ksgxswapd when the number of pages goes below the
>> > watermark
>> > + * before returning back to the caller.
>> > + *
>> > + * Return:
>> > + *   an EPC page,
>> > + *   -errno on error
>> > + */
>> > +struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>> > +{
>> > +	struct sgx_epc_page *entry;
>> > +
>> > +	for ( ; ; ) {
>> > +		entry = __sgx_alloc_epc_page();
>> > +		if (!IS_ERR(entry)) {
>> > +			entry->owner = owner;
>> > +			break;
>> > +		}
>> > +
>> > +		if (list_empty(&sgx_active_page_list))
>> > +			return ERR_PTR(-ENOMEM);
>> > +
>> > +		if (!reclaim) {
>> > +			entry = ERR_PTR(-EBUSY);
>> > +			break;
>> > +		}
>> > +
>> > +		if (signal_pending(current)) {
>> > +			entry = ERR_PTR(-ERESTARTSYS);
>> > +			break;
>> > +		}
>> > +
>> > +		sgx_reclaim_pages();
>> > +		schedule();
>> > +	}
>> > +
>> > +	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>> > +		wake_up(&ksgxswapd_waitq);
>> > +
>> > +	return entry;
>> > +}
>> > +
>> >  /**
>> >   * sgx_free_epc_page() - Free an EPC page
>> >   * @page:	an EPC page
>> > @@ -138,12 +610,20 @@ void sgx_free_epc_page(struct sgx_epc_page  
>> *page)
>> >  	struct sgx_epc_section *section = sgx_get_epc_section(page);
>> >  	int ret;
>> > +	/*
>> > +	 * Don't take sgx_active_page_list_lock when asserting the page  
>> isn't
>> > +	 * reclaimable, missing a WARN in the very rare case is preferable  
>> to
>> > +	 * unnecessarily taking a global lock in the common case.
>> > +	 */
>> > +	WARN_ON_ONCE(page->desc & SGX_EPC_PAGE_RECLAIMABLE);
>> > +
>> >  	ret = __eremove(sgx_get_epc_addr(page));
>> >  	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
>> >  		return;
>> > 	spin_lock(&section->lock);
>> >  	list_add_tail(&page->list, &section->page_list);
>> > +	section->free_cnt++;
>> >  	spin_unlock(&section->lock);
>> >  }
>> > @@ -194,6 +674,7 @@ static bool __init sgx_setup_epc_section(u64 addr,
>> > u64 size,
>> >  		list_add_tail(&page->list, &section->unsanitized_page_list);
>> >  	}
>> > +	section->free_cnt = nr_pages;
>> >  	return true;
>> > err_out:
>> > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h
>> > b/arch/x86/kernel/cpu/sgx/sgx.h
>> > index 8d126070db1e..ec4f7b338dbe 100644
>> > --- a/arch/x86/kernel/cpu/sgx/sgx.h
>> > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
>> > @@ -15,6 +15,7 @@
>> > struct sgx_epc_page {
>> >  	unsigned long desc;
>> > +	struct sgx_encl_page *owner;
>> >  	struct list_head list;
>> >  };
>> > @@ -27,6 +28,7 @@ struct sgx_epc_page {
>> >  struct sgx_epc_section {
>> >  	unsigned long pa;
>> >  	void *va;
>> > +	unsigned long free_cnt;
>> >  	struct list_head page_list;
>> >  	struct list_head unsanitized_page_list;
>> >  	spinlock_t lock;
>> > @@ -35,6 +37,10 @@ struct sgx_epc_section {
>> >  #define SGX_EPC_SECTION_MASK		GENMASK(7, 0)
>> >  #define SGX_MAX_EPC_SECTIONS		(SGX_EPC_SECTION_MASK + 1)
>> >  #define SGX_MAX_ADD_PAGES_LENGTH	0x100000
>> > +#define SGX_EPC_PAGE_RECLAIMABLE	BIT(8)
>> > +#define SGX_NR_TO_SCAN			16
>> > +#define SGX_NR_LOW_PAGES		32
>> > +#define SGX_NR_HIGH_PAGES		64
>> > extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>> > @@ -50,7 +56,10 @@ static inline void *sgx_get_epc_addr(struct
>> > sgx_epc_page *page)
>> >  	return section->va + (page->desc & PAGE_MASK) - section->pa;
>> >  }
>> > +void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
>> > +int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
>> >  struct sgx_epc_page *__sgx_alloc_epc_page(void);
>> > +struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
>> >  void sgx_free_epc_page(struct sgx_epc_page *page);
>> > #endif /* _X86_SGX_H */
>>
>>
>> --


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-03  4:50 ` [PATCH v39 11/24] x86/sgx: Add SGX enclave driver Jarkko Sakkinen
  2020-10-03 14:39   ` Greg KH
@ 2020-10-03 19:54   ` Matthew Wilcox
  2020-10-04 21:50     ` Jarkko Sakkinen
  1 sibling, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2020-10-03 19:54 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Jethro Beekman, Haitao Huang, Chunyang Hui,
	Jordan Hand, Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Sat, Oct 03, 2020 at 07:50:46AM +0300, Jarkko Sakkinen wrote:
> +	XA_STATE(xas, &encl->page_array, idx_start);
> +
> +	/*
> +	 * Disallow READ_IMPLIES_EXEC tasks as their VMA permissions might
> +	 * conflict with the enclave page permissions.
> +	 */
> +	if (current->personality & READ_IMPLIES_EXEC)
> +		return -EACCES;
> +
> +	xas_for_each(&xas, page, idx_end)
> +		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> +			return -EACCES;

You're iterating the array without holding any lock that the XArray knows
about.  If you're OK with another thread adding/removing pages behind your
back, or there's a higher level lock (the mmap_sem?) protecting the XArray
from being modified while you walk it, then hold the rcu_read_lock()
while walking the array.  Otherwise you can prevent modification by
calling xas_lock(&xas) and xas_unlock()..

> +	return 0;
> +}
> +
> +static int sgx_vma_mprotect(struct vm_area_struct *vma,
> +			    struct vm_area_struct **pprev, unsigned long start,
> +			    unsigned long end, unsigned long newflags)
> +{
> +	int ret;
> +
> +	ret = sgx_encl_may_map(vma->vm_private_data, start, end, newflags);
> +	if (ret)
> +		return ret;
> +
> +	return mprotect_fixup(vma, pprev, start, end, newflags);
> +}
> +
> +const struct vm_operations_struct sgx_vm_ops = {
> +	.open = sgx_vma_open,
> +	.fault = sgx_vma_fault,
> +	.mprotect = sgx_vma_mprotect,
> +};
> +
> +/**
> + * sgx_encl_find - find an enclave
> + * @mm:		mm struct of the current process
> + * @addr:	address in the ELRANGE
> + * @vma:	the resulting VMA
> + *
> + * Find an enclave identified by the given address. Give back a VMA that is
> + * part of the enclave and located in that address. The VMA is given back if it
> + * is a proper enclave VMA even if an &sgx_encl instance does not exist yet
> + * (enclave creation has not been performed).
> + *
> + * Return:
> + *   0 on success,
> + *   -EINVAL if an enclave was not found,
> + *   -ENOENT if the enclave has not been created yet
> + */
> +int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
> +		  struct vm_area_struct **vma)
> +{
> +	struct vm_area_struct *result;
> +	struct sgx_encl *encl;
> +
> +	result = find_vma(mm, addr);
> +	if (!result || result->vm_ops != &sgx_vm_ops || addr < result->vm_start)
> +		return -EINVAL;
> +
> +	encl = result->vm_private_data;
> +	*vma = result;
> +
> +	return encl ? 0 : -ENOENT;
> +}
> +
> +/**
> + * sgx_encl_destroy() - destroy enclave resources
> + * @encl:	an enclave pointer
> + */
> +void sgx_encl_destroy(struct sgx_encl *encl)
> +{
> +	struct sgx_encl_page *entry;
> +	unsigned long index;
> +
> +	atomic_or(SGX_ENCL_DEAD, &encl->flags);
> +
> +	xa_for_each(&encl->page_array, index, entry) {
> +		if (entry->epc_page) {
> +			sgx_free_epc_page(entry->epc_page);
> +			encl->secs_child_cnt--;
> +			entry->epc_page = NULL;
> +		}
> +
> +		kfree(entry);
> +	}
> +
> +	xa_destroy(&encl->page_array);
> +
> +	if (!encl->secs_child_cnt && encl->secs.epc_page) {
> +		sgx_free_epc_page(encl->secs.epc_page);
> +		encl->secs.epc_page = NULL;
> +	}
> +}
> +
> +/**
> + * sgx_encl_release - Destroy an enclave instance
> + * @kref:	address of a kref inside &sgx_encl
> + *
> + * Used together with kref_put(). Frees all the resources associated with the
> + * enclave and the instance itself.
> + */
> +void sgx_encl_release(struct kref *ref)
> +{
> +	struct sgx_encl *encl = container_of(ref, struct sgx_encl, refcount);
> +
> +	sgx_encl_destroy(encl);
> +
> +	if (encl->backing)
> +		fput(encl->backing);
> +
> +	cleanup_srcu_struct(&encl->srcu);
> +
> +	WARN_ON_ONCE(!list_empty(&encl->mm_list));
> +
> +	/* Detect EPC page leak's. */
> +	WARN_ON_ONCE(encl->secs_child_cnt);
> +	WARN_ON_ONCE(encl->secs.epc_page);
> +
> +	kfree(encl);
> +}
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> new file mode 100644
> index 000000000000..8ff445476657
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -0,0 +1,85 @@
> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> +/**
> + * Copyright(c) 2016-19 Intel Corporation.
> + */
> +#ifndef _X86_ENCL_H
> +#define _X86_ENCL_H
> +
> +#include <linux/cpumask.h>
> +#include <linux/kref.h>
> +#include <linux/list.h>
> +#include <linux/mm_types.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/mutex.h>
> +#include <linux/notifier.h>
> +#include <linux/srcu.h>
> +#include <linux/workqueue.h>
> +#include <linux/xarray.h>
> +#include "sgx.h"
> +
> +/**
> + * enum sgx_encl_page_desc - defines bits for an enclave page's descriptor
> + * %SGX_ENCL_PAGE_ADDR_MASK:		Holds the virtual address of the page.
> + *
> + * The page address for SECS is zero and is used by the subsystem to recognize
> + * the SECS page.
> + */
> +enum sgx_encl_page_desc {
> +	/* Bits 11:3 are available when the page is not swapped. */
> +	SGX_ENCL_PAGE_ADDR_MASK		= PAGE_MASK,
> +};
> +
> +#define SGX_ENCL_PAGE_ADDR(page) \
> +	((page)->desc & SGX_ENCL_PAGE_ADDR_MASK)
> +
> +struct sgx_encl_page {
> +	unsigned long desc;
> +	unsigned long vm_max_prot_bits;
> +	struct sgx_epc_page *epc_page;
> +	struct sgx_encl *encl;
> +};
> +
> +enum sgx_encl_flags {
> +	SGX_ENCL_CREATED	= BIT(0),
> +	SGX_ENCL_INITIALIZED	= BIT(1),
> +	SGX_ENCL_DEBUG		= BIT(2),
> +	SGX_ENCL_DEAD		= BIT(3),
> +	SGX_ENCL_IOCTL		= BIT(4),
> +};
> +
> +struct sgx_encl_mm {
> +	struct sgx_encl *encl;
> +	struct mm_struct *mm;
> +	struct list_head list;
> +	struct mmu_notifier mmu_notifier;
> +};
> +
> +struct sgx_encl {
> +	atomic_t flags;
> +	unsigned int page_cnt;
> +	unsigned int secs_child_cnt;
> +	struct mutex lock;
> +	struct list_head mm_list;
> +	spinlock_t mm_lock;
> +	struct file *backing;
> +	struct kref refcount;
> +	struct srcu_struct srcu;
> +	unsigned long base;
> +	unsigned long size;
> +	unsigned long ssaframesize;
> +	struct xarray page_array;
> +	struct sgx_encl_page secs;
> +	cpumask_t cpumask;
> +};
> +
> +extern const struct vm_operations_struct sgx_vm_ops;
> +
> +int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
> +		  struct vm_area_struct **vma);
> +void sgx_encl_destroy(struct sgx_encl *encl);
> +void sgx_encl_release(struct kref *ref);
> +int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
> +int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
> +		     unsigned long end, unsigned long vm_flags);
> +
> +#endif /* _X86_ENCL_H */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 97c6895fb6c9..4137254fb29e 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -9,6 +9,8 @@
>  #include <linux/sched/mm.h>
>  #include <linux/sched/signal.h>
>  #include <linux/slab.h>
> +#include "driver.h"
> +#include "encl.h"
>  #include "encls.h"
>  
>  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> @@ -260,6 +262,8 @@ static bool __init sgx_page_cache_init(void)
>  
>  static void __init sgx_init(void)
>  {
> +	int ret;
> +
>  	if (!boot_cpu_has(X86_FEATURE_SGX))
>  		return;
>  
> @@ -269,8 +273,15 @@ static void __init sgx_init(void)
>  	if (!sgx_page_reclaimer_init())
>  		goto err_page_cache;
>  
> +	ret = sgx_drv_init();
> +	if (ret)
> +		goto err_kthread;
> +
>  	return;
>  
> +err_kthread:
> +	kthread_stop(ksgxswapd_tsk);
> +
>  err_page_cache:
>  	sgx_page_cache_teardown();
>  }
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-03 14:39   ` Greg KH
@ 2020-10-04 14:32     ` Jarkko Sakkinen
  2020-10-04 15:01       ` Jarkko Sakkinen
  2020-10-05  9:42       ` Greg KH
  2020-10-05  8:45     ` Christoph Hellwig
  2020-10-09  7:10     ` Pavel Machek
  2 siblings, 2 replies; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-04 14:32 UTC (permalink / raw)
  To: Greg KH
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Haitao Huang,
	Chunyang Hui, Jordan Hand, Nathaniel McCallum, Seth Moore,
	Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk, rientjes,
	tglx, yaozhangx, mikko.ylinen

On Sat, Oct 03, 2020 at 04:39:25PM +0200, Greg KH wrote:
> On Sat, Oct 03, 2020 at 07:50:46AM +0300, Jarkko Sakkinen wrote:
> > Intel Software Guard eXtensions (SGX) is a set of CPU instructions that can
> > be used by applications to set aside private regions of code and data. The
> > code outside the SGX hosted software entity is prevented from accessing the
> > memory inside the enclave by the CPU. We call these entities enclaves.
> > 
> > Add a driver that provides an ioctl API to construct and run enclaves.
> > Enclaves are constructed from pages residing in reserved physical memory
> > areas. The contents of these pages can only be accessed when they are
> > mapped as part of an enclave, by a hardware thread running inside the
> > enclave.
> > 
> > The starting state of an enclave consists of a fixed measured set of
> > pages that are copied to the EPC during the construction process by
> > using the opcode ENCLS leaf functions and Software Enclave Control
> > Structure (SECS) that defines the enclave properties.
> > 
> > Enclaves are constructed by using ENCLS leaf functions ECREATE, EADD and
> > EINIT. ECREATE initializes SECS, EADD copies pages from system memory to
> > the EPC and EINIT checks a given signed measurement and moves the enclave
> > into a state ready for execution.
> > 
> > An initialized enclave can only be accessed through special Thread Control
> > Structure (TCS) pages by using ENCLU (ring-3 only) leaf EENTER.  This leaf
> > function converts a thread into enclave mode and continues the execution in
> > the offset defined by the TCS provided to EENTER. An enclave is exited
> > through syscall, exception, interrupts or by explicitly calling another
> > ENCLU leaf EEXIT.
> > 
> > The mmap() permissions are capped by the contained enclave page
> > permissions. The mapped areas must also be populated, i.e. each page
> > address must contain a page. This logic is implemented in
> > sgx_encl_may_map().
> > 
> > Cc: linux-security-module@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Acked-by: Jethro Beekman <jethro@fortanix.com>
> > Tested-by: Jethro Beekman <jethro@fortanix.com>
> > Tested-by: Haitao Huang <haitao.huang@linux.intel.com>
> > Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
> > Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
> > Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
> > Tested-by: Seth Moore <sethmo@google.com>
> > Tested-by: Darren Kenny <darren.kenny@oracle.com>
> > Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
> > Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Co-developed-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > ---
> >  arch/x86/kernel/cpu/sgx/Makefile |   2 +
> >  arch/x86/kernel/cpu/sgx/driver.c | 173 ++++++++++++++++
> >  arch/x86/kernel/cpu/sgx/driver.h |  29 +++
> >  arch/x86/kernel/cpu/sgx/encl.c   | 331 +++++++++++++++++++++++++++++++
> >  arch/x86/kernel/cpu/sgx/encl.h   |  85 ++++++++
> >  arch/x86/kernel/cpu/sgx/main.c   |  11 +
> >  6 files changed, 631 insertions(+)
> >  create mode 100644 arch/x86/kernel/cpu/sgx/driver.c
> >  create mode 100644 arch/x86/kernel/cpu/sgx/driver.h
> >  create mode 100644 arch/x86/kernel/cpu/sgx/encl.c
> >  create mode 100644 arch/x86/kernel/cpu/sgx/encl.h
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> > index 79510ce01b3b..3fc451120735 100644
> > --- a/arch/x86/kernel/cpu/sgx/Makefile
> > +++ b/arch/x86/kernel/cpu/sgx/Makefile
> > @@ -1,2 +1,4 @@
> >  obj-y += \
> > +	driver.o \
> > +	encl.o \
> >  	main.o
> > diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
> > new file mode 100644
> > index 000000000000..f54da5f19c2b
> > --- /dev/null
> > +++ b/arch/x86/kernel/cpu/sgx/driver.c
> > @@ -0,0 +1,173 @@
> > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> 
> You use gpl-only header files in this file, so how in the world can it
> be bsd-3 licensed?
> 
> Please get your legal department to agree with this, after you explain
> to them how you are mixing gpl2-only code in with this file.

I'll do what I already stated that I will do. Should I do something
more?

> > +// Copyright(c) 2016-18 Intel Corporation.
> 
> Dates are hard to get right :(

Will fix.

> 
> > +
> > +#include <linux/acpi.h>
> > +#include <linux/miscdevice.h>
> > +#include <linux/mman.h>
> > +#include <linux/security.h>
> > +#include <linux/suspend.h>
> > +#include <asm/traps.h>
> > +#include "driver.h"
> > +#include "encl.h"
> > +
> > +u64 sgx_encl_size_max_32;
> > +u64 sgx_encl_size_max_64;
> > +u32 sgx_misc_reserved_mask;
> > +u64 sgx_attributes_reserved_mask;
> > +u64 sgx_xfrm_reserved_mask = ~0x3;
> > +u32 sgx_xsave_size_tbl[64];
> > +
> > +static int sgx_open(struct inode *inode, struct file *file)
> > +{
> > +	struct sgx_encl *encl;
> > +	int ret;
> > +
> > +	encl = kzalloc(sizeof(*encl), GFP_KERNEL);
> > +	if (!encl)
> > +		return -ENOMEM;
> > +
> > +	atomic_set(&encl->flags, 0);
> > +	kref_init(&encl->refcount);
> > +	xa_init(&encl->page_array);
> > +	mutex_init(&encl->lock);
> > +	INIT_LIST_HEAD(&encl->mm_list);
> > +	spin_lock_init(&encl->mm_lock);
> > +
> > +	ret = init_srcu_struct(&encl->srcu);
> > +	if (ret) {
> > +		kfree(encl);
> > +		return ret;
> > +	}
> > +
> > +	file->private_data = encl;
> > +
> > +	return 0;
> > +}
> > +
> > +static int sgx_release(struct inode *inode, struct file *file)
> > +{
> > +	struct sgx_encl *encl = file->private_data;
> > +	struct sgx_encl_mm *encl_mm;
> > +
> > +	for ( ; ; )  {
> > +		spin_lock(&encl->mm_lock);
> > +
> > +		if (list_empty(&encl->mm_list)) {
> > +			encl_mm = NULL;
> > +		} else {
> > +			encl_mm = list_first_entry(&encl->mm_list,
> > +						   struct sgx_encl_mm, list);
> > +			list_del_rcu(&encl_mm->list);
> > +		}
> > +
> > +		spin_unlock(&encl->mm_lock);
> > +
> > +		/* The list is empty, ready to go. */
> > +		if (!encl_mm)
> > +			break;
> > +
> > +		synchronize_srcu(&encl->srcu);
> > +		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
> > +		kfree(encl_mm);
> > +	}
> > +
> > +	mutex_lock(&encl->lock);
> > +	atomic_or(SGX_ENCL_DEAD, &encl->flags);
> 
> So you set a flag that this is dead, and then instantly delete it?  Why
> does that matter?  I see you check for this flag elsewhere, but as you
> are just about to delete this structure, how can this be an issue?

It matters because ksgxswapd (sgx_reclaimer_*) might be processing it.

It will use the flag to skip the operations that it would do to a victim
page, when the enclave is still alive.

> 
> > +	mutex_unlock(&encl->lock);
> > +
> > +	kref_put(&encl->refcount, sgx_encl_release);
> 
> Don't you need to hold the lock across the put?  If not, what is
> serializing this?
> 
> But an even larger comment, why is this reference count needed at all?
> 
> You never grab it except at init time, and you free it at close time.
> Why not rely on the reference counting that the vfs ensures you?

Because ksgxswapd needs the alive enclave instance while it is in the
process of swapping a victim page. The reason for this is the
hierarchical nature of the enclave pages.

As an example, a write operation to main memory, EWB (SDM vol 3D 40-79)
needs to access SGX Enclave Control Structure (SECS) page, which is
contains global data for an enclave, like the unswapped child count.


> > +	return 0;
> > +}
> > +
> > +static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +	struct sgx_encl *encl = file->private_data;
> > +	int ret;
> > +
> > +	ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = sgx_encl_mm_add(encl, vma->vm_mm);
> > +	if (ret)
> > +		return ret;
> > +
> > +	vma->vm_ops = &sgx_vm_ops;
> > +	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
> > +	vma->vm_private_data = encl;
> > +
> > +	return 0;
> > +}
> > +
> > +static unsigned long sgx_get_unmapped_area(struct file *file,
> > +					   unsigned long addr,
> > +					   unsigned long len,
> > +					   unsigned long pgoff,
> > +					   unsigned long flags)
> > +{
> > +	if ((flags & MAP_TYPE) == MAP_PRIVATE)
> > +		return -EINVAL;
> > +
> > +	if (flags & MAP_FIXED)
> > +		return addr;
> > +
> > +	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
> > +}
> > +
> > +static const struct file_operations sgx_encl_fops = {
> > +	.owner			= THIS_MODULE,
> > +	.open			= sgx_open,
> > +	.release		= sgx_release,
> > +	.mmap			= sgx_mmap,
> > +	.get_unmapped_area	= sgx_get_unmapped_area,
> > +};
> > +
> > +static struct miscdevice sgx_dev_enclave = {
> > +	.minor = MISC_DYNAMIC_MINOR,
> > +	.name = "enclave",
> > +	.nodename = "sgx/enclave",
> 
> A subdir for a single device node?  Ok, odd, but why not just
> "sgx_enclave"?  How "special" is this device node?

There is a patch that adds "sgx/provision".

Either works for me. Should I flatten them to "sgx_enclave" and
"sgx_provision", or keep them as they are?

> thanks,
> 
> greg k-h

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-04 14:32     ` Jarkko Sakkinen
@ 2020-10-04 15:01       ` Jarkko Sakkinen
  2020-10-05  9:42       ` Greg KH
  1 sibling, 0 replies; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-04 15:01 UTC (permalink / raw)
  To: Greg KH
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Haitao Huang,
	Chunyang Hui, Jordan Hand, Nathaniel McCallum, Seth Moore,
	Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk, rientjes,
	tglx, yaozhangx, mikko.ylinen

On Sun, Oct 04, 2020 at 05:32:57PM +0300, Jarkko Sakkinen wrote:
> On Sat, Oct 03, 2020 at 04:39:25PM +0200, Greg KH wrote:
> > You use gpl-only header files in this file, so how in the world can it
> > be bsd-3 licensed?
> > 
> > Please get your legal department to agree with this, after you explain
> > to them how you are mixing gpl2-only code in with this file.
> 
> I'll do what I already stated that I will do. Should I do something
> more?

And forward this message to the aformentioned entity.

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-03 19:54   ` Matthew Wilcox
@ 2020-10-04 21:50     ` Jarkko Sakkinen
  2020-10-04 22:02       ` Jarkko Sakkinen
  2020-10-04 22:27       ` Matthew Wilcox
  0 siblings, 2 replies; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-04 21:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Jethro Beekman, Haitao Huang, Chunyang Hui,
	Jordan Hand, Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Sat, Oct 03, 2020 at 08:54:40PM +0100, Matthew Wilcox wrote:
> On Sat, Oct 03, 2020 at 07:50:46AM +0300, Jarkko Sakkinen wrote:
> > +	XA_STATE(xas, &encl->page_array, idx_start);
> > +
> > +	/*
> > +	 * Disallow READ_IMPLIES_EXEC tasks as their VMA permissions might
> > +	 * conflict with the enclave page permissions.
> > +	 */
> > +	if (current->personality & READ_IMPLIES_EXEC)
> > +		return -EACCES;
> > +
> > +	xas_for_each(&xas, page, idx_end)
> > +		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> > +			return -EACCES;
> 
> You're iterating the array without holding any lock that the XArray knows
> about.  If you're OK with another thread adding/removing pages behind your
> back, or there's a higher level lock (the mmap_sem?) protecting the XArray
> from being modified while you walk it, then hold the rcu_read_lock()
> while walking the array.  Otherwise you can prevent modification by
> calling xas_lock(&xas) and xas_unlock()..

I backtracked this. The locks have been there from v21-v35. This is a
refactoring mistake in radix_tree to xarray migration happened in v36.
It's by no means intentional.

What is shoukd take is encl->lock.

The loop was pre-v36 like:

	idx_start = PFN_DOWN(start);
	idx_end = PFN_DOWN(end - 1);

	for (idx = idx_start; idx <= idx_end; ++idx) {
		mutex_lock(&encl->lock);
		page = radix_tree_lookup(&encl->page_tree, idx);
		mutex_unlock(&encl->lock);

		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
			return -EACCES;
	}

Looking at xarray.h and filemap.c, I'm thinking something along the
lines of:

	for (idx = idx_start; idx <= idx_end; ++idx) {
		mutex_lock(&encl->lock);
		page = xas_find(&xas, idx + 1);
		mutex_unlock(&encl->lock);

		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
			return -EACCES;
	}

Does this look about right?

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-04 21:50     ` Jarkko Sakkinen
@ 2020-10-04 22:02       ` Jarkko Sakkinen
  2020-10-04 22:27       ` Matthew Wilcox
  1 sibling, 0 replies; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-04 22:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Jethro Beekman, Haitao Huang, Chunyang Hui,
	Jordan Hand, Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Mon, Oct 05, 2020 at 12:51:00AM +0300, Jarkko Sakkinen wrote:
> On Sat, Oct 03, 2020 at 08:54:40PM +0100, Matthew Wilcox wrote:
> > On Sat, Oct 03, 2020 at 07:50:46AM +0300, Jarkko Sakkinen wrote:
> > > +	XA_STATE(xas, &encl->page_array, idx_start);
> > > +
> > > +	/*
> > > +	 * Disallow READ_IMPLIES_EXEC tasks as their VMA permissions might
> > > +	 * conflict with the enclave page permissions.
> > > +	 */
> > > +	if (current->personality & READ_IMPLIES_EXEC)
> > > +		return -EACCES;
> > > +
> > > +	xas_for_each(&xas, page, idx_end)
> > > +		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> > > +			return -EACCES;
> > 
> > You're iterating the array without holding any lock that the XArray knows
> > about.  If you're OK with another thread adding/removing pages behind your
> > back, or there's a higher level lock (the mmap_sem?) protecting the XArray
> > from being modified while you walk it, then hold the rcu_read_lock()
> > while walking the array.  Otherwise you can prevent modification by
> > calling xas_lock(&xas) and xas_unlock()..
> 
> I backtracked this. The locks have been there from v21-v35. This is a
> refactoring mistake in radix_tree to xarray migration happened in v36.
> It's by no means intentional.
> 
> What is shoukd take is encl->lock.
> 
> The loop was pre-v36 like:
> 
> 	idx_start = PFN_DOWN(start);
> 	idx_end = PFN_DOWN(end - 1);
> 
> 	for (idx = idx_start; idx <= idx_end; ++idx) {
> 		mutex_lock(&encl->lock);
> 		page = radix_tree_lookup(&encl->page_tree, idx);
> 		mutex_unlock(&encl->lock);
> 
> 		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> 			return -EACCES;
> 	}
> 
> Looking at xarray.h and filemap.c, I'm thinking something along the
> lines of:
> 
> 	for (idx = idx_start; idx <= idx_end; ++idx) {
> 		mutex_lock(&encl->lock);
> 		page = xas_find(&xas, idx + 1);
                                      ~~~~~~~
				      idx

> 		mutex_unlock(&encl->lock);
> 
> 		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> 			return -EACCES;
> 	}
> 
> Does this look about right?

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-04 21:50     ` Jarkko Sakkinen
  2020-10-04 22:02       ` Jarkko Sakkinen
@ 2020-10-04 22:27       ` Matthew Wilcox
  2020-10-04 23:41         ` Jarkko Sakkinen
  1 sibling, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2020-10-04 22:27 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Jethro Beekman, Haitao Huang, Chunyang Hui,
	Jordan Hand, Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Mon, Oct 05, 2020 at 12:50:49AM +0300, Jarkko Sakkinen wrote:
> What is shoukd take is encl->lock.
> 
> The loop was pre-v36 like:
> 
> 	idx_start = PFN_DOWN(start);
> 	idx_end = PFN_DOWN(end - 1);
> 
> 	for (idx = idx_start; idx <= idx_end; ++idx) {
> 		mutex_lock(&encl->lock);
> 		page = radix_tree_lookup(&encl->page_tree, idx);
> 		mutex_unlock(&encl->lock);
> 
> 		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> 			return -EACCES;
> 	}
> 
> Looking at xarray.h and filemap.c, I'm thinking something along the
> lines of:
> 
> 	for (idx = idx_start; idx <= idx_end; ++idx) {
> 		mutex_lock(&encl->lock);
> 		page = xas_find(&xas, idx + 1);
> 		mutex_unlock(&encl->lock);
> 
> 		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> 			return -EACCES;
> 	}
> 
> Does this look about right?

Not really ...

	int ret = 0;

	mutex_lock(&encl->lock);
	rcu_read_lock();
	while (xas.index < idx_end) {
		page = xas_next(&xas);
		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
			ret = -EACCESS;
			break;
		}
	}
	rcu_read_unlock();
	mutex_unlock(&encl->lock);

	return ret;

... or you could rework to use the xa_lock instead of encl->lock.
I don't know how feasible that is for you.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 16/24] x86/sgx: Add a page reclaimer
  2020-10-03 18:23       ` Haitao Huang
@ 2020-10-04 22:39         ` Jarkko Sakkinen
  2020-10-07 17:25           ` Jarkko Sakkinen
  0 siblings, 1 reply; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-04 22:39 UTC (permalink / raw)
  To: Haitao Huang
  Cc: x86, linux-sgx, linux-kernel, linux-mm, Jethro Beekman,
	Jordan Hand, Nathaniel McCallum, Chunyang Hui, Seth Moore,
	Sean Christopherson, akpm, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen, willy

On Sat, Oct 03, 2020 at 01:23:49PM -0500, Haitao Huang wrote:
> On Sat, 03 Oct 2020 08:32:45 -0500, Jarkko Sakkinen
> <jarkko.sakkinen@linux.intel.com> wrote:
> 
> > On Sat, Oct 03, 2020 at 12:22:47AM -0500, Haitao Huang wrote:
> > > When I turn on CONFIG_PROVE_LOCKING, kernel reports following
> > > suspicious RCU
> > > usages. Not sure if it is an issue. Just reporting here:
> > 
> > I'm glad to hear that my tip helped you to get us the data.
> > 
> > This does not look like an issue in the page reclaimer, which was not
> > obvious for me before. That's a good thing. I was really worried about
> > that because it has been very stable for a long period now. The last
> > bug fix for the reclaimer was done in June in v31 version of the patch
> > set and after that it has been unchanged (except possibly some renames
> > requested by Boris).
> > 
> > I wildly guess I have a bad usage pattern for xarray. I migrated to it
> > in v36, and it is entirely possible that I've misused it. It was the
> > first time that I ever used it. Before xarray we had radix_tree but
> > based Matthew Wilcox feedback I did a migration to xarray.
> > 
> > What I'd ask you to do next is to, if by any means possible, to try to
> > run the same test with v35 so we can verify this. That one still has
> > the radix tree.
> > 
> 
> 
> v35 does not cause any such warning messages from kernel

Thank you. Looks like Matthew already located the issue, a fix will
land soon.

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-04 22:27       ` Matthew Wilcox
@ 2020-10-04 23:41         ` Jarkko Sakkinen
  2020-10-05  1:30           ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-04 23:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Jethro Beekman, Haitao Huang, Chunyang Hui,
	Jordan Hand, Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Sun, Oct 04, 2020 at 11:27:50PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 05, 2020 at 12:50:49AM +0300, Jarkko Sakkinen wrote:
> > What is shoukd take is encl->lock.
> > 
> > The loop was pre-v36 like:
> > 
> > 	idx_start = PFN_DOWN(start);
> > 	idx_end = PFN_DOWN(end - 1);
> > 
> > 	for (idx = idx_start; idx <= idx_end; ++idx) {
> > 		mutex_lock(&encl->lock);
> > 		page = radix_tree_lookup(&encl->page_tree, idx);
> > 		mutex_unlock(&encl->lock);
> > 
> > 		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> > 			return -EACCES;
> > 	}
> > 
> > Looking at xarray.h and filemap.c, I'm thinking something along the
> > lines of:
> > 
> > 	for (idx = idx_start; idx <= idx_end; ++idx) {
> > 		mutex_lock(&encl->lock);
> > 		page = xas_find(&xas, idx + 1);
> > 		mutex_unlock(&encl->lock);
> > 
> > 		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> > 			return -EACCES;
> > 	}
> > 
> > Does this look about right?
> 
> Not really ...
> 
> 	int ret = 0;
> 
> 	mutex_lock(&encl->lock);
> 	rcu_read_lock();

Right, so xa_*() take RCU lock implicitly and xas_* do not.

> 	while (xas.index < idx_end) {
> 		page = xas_next(&xas);

It should iterate through every possible page index within the range,
even the ones that do not have an entry, i.e. this loop also checks
that there are no empty slots.

Does xas_next() go through every possible index, or skip the non-empty
ones?

> 		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> 			ret = -EACCESS;
> 			break;
> 		}
> 	}
> 	rcu_read_unlock();
> 	mutex_unlock(&encl->lock);

In my Geminilake NUC the maximum size of the address space is 64GB for
an enclave, and it is not fixed but can grow in microarchitectures
beyond that.

That means that in (*artificial*) worst case the locks would be kept for
64*1024*1024*1024/4096 = 16777216 iterations.

I just realized that in sgx_encl_load_page ([1], the encl->lock is
acquired by the caller) I have used xa_load(), which more or less would
be compatible with the old radix_tree pattern, i.e.

for (idx = idx_start; idx <= idx_end; ++idx) {
	mutex_lock(&encl->lock);
	page = xas_load(&encl->page_array, idx);
	mutex_unlock(&encl->lock);

	if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
		return -EACCES;
}

To make things stable again, I'll go with this for the immediate future.

> 	return ret;
> 
> ... or you could rework to use the xa_lock instead of encl->lock.
> I don't know how feasible that is for you.

encl->lock is used to protect enclave state but it is true that
page->vm_max_prort_bits is not modified through concurrent access, once
the page is added (e.g. by the reclaimer, which gets pages through
sgx_activate_page_list, not through xarray).

It's an interesting idea, but before even considering it I want to fix
the bug, even if the fix ought to be somehow unoptimal in terms of
performance.

Thanks for helping with this. xarray is still somewhat alien to me and
most of the code I see just use the iterator macros excep mm/*, but
I'm slowly adapting the concepts.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-sgx.git/tree/arch/x86/kernel/cpu/sgx/encl.c
[2] https://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-sgx.git/tree/arch/x86/kernel/cpu/sgx/main.c

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-04 23:41         ` Jarkko Sakkinen
@ 2020-10-05  1:30           ` Matthew Wilcox
  2020-10-05  3:06             ` Jarkko Sakkinen
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2020-10-05  1:30 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Jethro Beekman, Haitao Huang, Chunyang Hui,
	Jordan Hand, Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Mon, Oct 05, 2020 at 02:41:53AM +0300, Jarkko Sakkinen wrote:
> On Sun, Oct 04, 2020 at 11:27:50PM +0100, Matthew Wilcox wrote:
> > 	int ret = 0;
> > 
> > 	mutex_lock(&encl->lock);
> > 	rcu_read_lock();
> 
> Right, so xa_*() take RCU lock implicitly and xas_* do not.

Not necessarily the RCU lock ... I did document all this in xarray.rst:

https://www.kernel.org/doc/html/latest/core-api/xarray.html

> > 	while (xas.index < idx_end) {
> > 		page = xas_next(&xas);
> 
> It should iterate through every possible page index within the range,
> even the ones that do not have an entry, i.e. this loop also checks
> that there are no empty slots.
> 
> Does xas_next() go through every possible index, or skip the non-empty
> ones?

xas_next(), as its documentation says, will move to the next array
index:

https://www.kernel.org/doc/html/latest/core-api/xarray.html#c.xas_next

> > 		if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
> > 			ret = -EACCESS;
> > 			break;
> > 		}
> > 	}
> > 	rcu_read_unlock();
> > 	mutex_unlock(&encl->lock);
> 
> In my Geminilake NUC the maximum size of the address space is 64GB for
> an enclave, and it is not fixed but can grow in microarchitectures
> beyond that.
> 
> That means that in (*artificial*) worst case the locks would be kept for
> 64*1024*1024*1024/4096 = 16777216 iterations.

Oh, there's support for that on the XArray API too.

        xas_lock_irq(&xas);
        xas_for_each_marked(&xas, page, end, PAGECACHE_TAG_DIRTY) {
                xas_set_mark(&xas, PAGECACHE_TAG_TOWRITE);
                if (++tagged % XA_CHECK_SCHED)
                        continue;

                xas_pause(&xas);
                xas_unlock_irq(&xas);
                cond_resched();
                xas_lock_irq(&xas);
        }
        xas_unlock_irq(&xas);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-05  1:30           ` Matthew Wilcox
@ 2020-10-05  3:06             ` Jarkko Sakkinen
  0 siblings, 0 replies; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-05  3:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Jethro Beekman, Haitao Huang, Chunyang Hui,
	Jordan Hand, Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Mon, Oct 05, 2020 at 02:30:53AM +0100, Matthew Wilcox wrote:
> > In my Geminilake NUC the maximum size of the address space is 64GB for
> > an enclave, and it is not fixed but can grow in microarchitectures
> > beyond that.
> > 
> > That means that in (*artificial*) worst case the locks would be kept for
> > 64*1024*1024*1024/4096 = 16777216 iterations.
> 
> Oh, there's support for that on the XArray API too.
> 
>         xas_lock_irq(&xas);
>         xas_for_each_marked(&xas, page, end, PAGECACHE_TAG_DIRTY) {
>                 xas_set_mark(&xas, PAGECACHE_TAG_TOWRITE);
>                 if (++tagged % XA_CHECK_SCHED)
>                         continue;
> 
>                 xas_pause(&xas);
>                 xas_unlock_irq(&xas);
>                 cond_resched();
>                 xas_lock_irq(&xas);
>         }
>         xas_unlock_irq(&xas);

Assuming we can iterate the array without encl->lock, I think this
would translate to:

/*
 * Not taking encl->lock because:
 * 1. page attributes are not written.
 * 2. the only page attribute read is set before it is put to the array
 *    and stays constant throughout the enclave life-cycle.
 */
xas_lock(&xas);
xas_for_each_marked(&xas, page, idx_end) {
	if (++tagged % XA_CHECK_SCHED)
		continue;

	xas_pause(&xas);
	xas_unlock(&xas);

	/*
	 * Attributes are not protected by the xa_lock, so I'm assuming
	 * that this is the legit place for the check.
	 */
	if (!page || (~page->vm_max_prot_bits & vm_prot_bits))
		return -EACCES;

	cond_resched();
 	xas_lock(&xas);
}
xas_unlock(&xas);

Obviously, we cannot use this pattern by taking the encl->lock inside
the loop (ABBA and encl->lock is a mutex).

Let's enumerate:

A. sgx_encl_add_page(): uses xa_insert() and xa_erase().
B. sgx_encl_load_page(): uses xa_load().
C. sgx_encl_may_map(): is broken (for the moment).

A and B implicitly the lock and if a page exist at all we only access
a pure constant.

Also, since the open file keeps the instance alive, nobody is going
to pull carpet under our feet.

OK, I've just concluded tha we don't need to take encl->lock in this
case. Great.

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-03 14:39   ` Greg KH
  2020-10-04 14:32     ` Jarkko Sakkinen
@ 2020-10-05  8:45     ` Christoph Hellwig
  2020-10-05 11:42       ` Jarkko Sakkinen
  2020-10-09  7:10     ` Pavel Machek
  2 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2020-10-05  8:45 UTC (permalink / raw)
  To: Greg KH
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Sat, Oct 03, 2020 at 04:39:25PM +0200, Greg KH wrote:
> > @@ -0,0 +1,173 @@
> > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> 
> You use gpl-only header files in this file, so how in the world can it
> be bsd-3 licensed?
> 
> Please get your legal department to agree with this, after you explain
> to them how you are mixing gpl2-only code in with this file.
> 
> > +// Copyright(c) 2016-18 Intel Corporation.
> 
> Dates are hard to get right :(

As is comment formatting apparently.  Don't use // comments for anything
but the SPDX header, please.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-04 14:32     ` Jarkko Sakkinen
  2020-10-04 15:01       ` Jarkko Sakkinen
@ 2020-10-05  9:42       ` Greg KH
  2020-10-05 12:42         ` Jarkko Sakkinen
  1 sibling, 1 reply; 35+ messages in thread
From: Greg KH @ 2020-10-05  9:42 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Haitao Huang,
	Chunyang Hui, Jordan Hand, Nathaniel McCallum, Seth Moore,
	Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk, rientjes,
	tglx, yaozhangx, mikko.ylinen

On Sun, Oct 04, 2020 at 05:32:46PM +0300, Jarkko Sakkinen wrote:
> On Sat, Oct 03, 2020 at 04:39:25PM +0200, Greg KH wrote:
> > On Sat, Oct 03, 2020 at 07:50:46AM +0300, Jarkko Sakkinen wrote:
> > > Intel Software Guard eXtensions (SGX) is a set of CPU instructions that can
> > > be used by applications to set aside private regions of code and data. The
> > > code outside the SGX hosted software entity is prevented from accessing the
> > > memory inside the enclave by the CPU. We call these entities enclaves.
> > > 
> > > Add a driver that provides an ioctl API to construct and run enclaves.
> > > Enclaves are constructed from pages residing in reserved physical memory
> > > areas. The contents of these pages can only be accessed when they are
> > > mapped as part of an enclave, by a hardware thread running inside the
> > > enclave.
> > > 
> > > The starting state of an enclave consists of a fixed measured set of
> > > pages that are copied to the EPC during the construction process by
> > > using the opcode ENCLS leaf functions and Software Enclave Control
> > > Structure (SECS) that defines the enclave properties.
> > > 
> > > Enclaves are constructed by using ENCLS leaf functions ECREATE, EADD and
> > > EINIT. ECREATE initializes SECS, EADD copies pages from system memory to
> > > the EPC and EINIT checks a given signed measurement and moves the enclave
> > > into a state ready for execution.
> > > 
> > > An initialized enclave can only be accessed through special Thread Control
> > > Structure (TCS) pages by using ENCLU (ring-3 only) leaf EENTER.  This leaf
> > > function converts a thread into enclave mode and continues the execution in
> > > the offset defined by the TCS provided to EENTER. An enclave is exited
> > > through syscall, exception, interrupts or by explicitly calling another
> > > ENCLU leaf EEXIT.
> > > 
> > > The mmap() permissions are capped by the contained enclave page
> > > permissions. The mapped areas must also be populated, i.e. each page
> > > address must contain a page. This logic is implemented in
> > > sgx_encl_may_map().
> > > 
> > > Cc: linux-security-module@vger.kernel.org
> > > Cc: linux-mm@kvack.org
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Acked-by: Jethro Beekman <jethro@fortanix.com>
> > > Tested-by: Jethro Beekman <jethro@fortanix.com>
> > > Tested-by: Haitao Huang <haitao.huang@linux.intel.com>
> > > Tested-by: Chunyang Hui <sanqian.hcy@antfin.com>
> > > Tested-by: Jordan Hand <jorhand@linux.microsoft.com>
> > > Tested-by: Nathaniel McCallum <npmccallum@redhat.com>
> > > Tested-by: Seth Moore <sethmo@google.com>
> > > Tested-by: Darren Kenny <darren.kenny@oracle.com>
> > > Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
> > > Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > Co-developed-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > > ---
> > >  arch/x86/kernel/cpu/sgx/Makefile |   2 +
> > >  arch/x86/kernel/cpu/sgx/driver.c | 173 ++++++++++++++++
> > >  arch/x86/kernel/cpu/sgx/driver.h |  29 +++
> > >  arch/x86/kernel/cpu/sgx/encl.c   | 331 +++++++++++++++++++++++++++++++
> > >  arch/x86/kernel/cpu/sgx/encl.h   |  85 ++++++++
> > >  arch/x86/kernel/cpu/sgx/main.c   |  11 +
> > >  6 files changed, 631 insertions(+)
> > >  create mode 100644 arch/x86/kernel/cpu/sgx/driver.c
> > >  create mode 100644 arch/x86/kernel/cpu/sgx/driver.h
> > >  create mode 100644 arch/x86/kernel/cpu/sgx/encl.c
> > >  create mode 100644 arch/x86/kernel/cpu/sgx/encl.h
> > > 
> > > diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> > > index 79510ce01b3b..3fc451120735 100644
> > > --- a/arch/x86/kernel/cpu/sgx/Makefile
> > > +++ b/arch/x86/kernel/cpu/sgx/Makefile
> > > @@ -1,2 +1,4 @@
> > >  obj-y += \
> > > +	driver.o \
> > > +	encl.o \
> > >  	main.o
> > > diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
> > > new file mode 100644
> > > index 000000000000..f54da5f19c2b
> > > --- /dev/null
> > > +++ b/arch/x86/kernel/cpu/sgx/driver.c
> > > @@ -0,0 +1,173 @@
> > > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > 
> > You use gpl-only header files in this file, so how in the world can it
> > be bsd-3 licensed?
> > 
> > Please get your legal department to agree with this, after you explain
> > to them how you are mixing gpl2-only code in with this file.
> 
> I'll do what I already stated that I will do. Should I do something
> more?

This was written before your previous response.

> > > +	mutex_lock(&encl->lock);
> > > +	atomic_or(SGX_ENCL_DEAD, &encl->flags);
> > 
> > So you set a flag that this is dead, and then instantly delete it?  Why
> > does that matter?  I see you check for this flag elsewhere, but as you
> > are just about to delete this structure, how can this be an issue?
> 
> It matters because ksgxswapd (sgx_reclaimer_*) might be processing it.

I don't see that happening in this patch, did I miss it?

> It will use the flag to skip the operations that it would do to a victim
> page, when the enclave is still alive.

Again, why are you adding flags when the patch does not use them?
Please put new functionality in the specific patch that uses it.

And can you really rely on this?  How did sgx_reclaimer_* (whatever that
is), get the reference on this object in the first place?  Again, I
don't see that happening at all in here, and at a quick glance in the
other patches I don't see it there either.  What am I missing?

> > > +	mutex_unlock(&encl->lock);
> > > +
> > > +	kref_put(&encl->refcount, sgx_encl_release);
> > 
> > Don't you need to hold the lock across the put?  If not, what is
> > serializing this?
> > 
> > But an even larger comment, why is this reference count needed at all?
> > 
> > You never grab it except at init time, and you free it at close time.
> > Why not rely on the reference counting that the vfs ensures you?
> 
> Because ksgxswapd needs the alive enclave instance while it is in the
> process of swapping a victim page. The reason for this is the
> hierarchical nature of the enclave pages.
> 
> As an example, a write operation to main memory, EWB (SDM vol 3D 40-79)

What is that referencing?

> needs to access SGX Enclave Control Structure (SECS) page, which is
> contains global data for an enclave, like the unswapped child count.

Ok, but how did it get access to this structure in the first place, like
I ask above?

> > > +	return 0;
> > > +}
> > > +
> > > +static int sgx_mmap(struct file *file, struct vm_area_struct *vma)
> > > +{
> > > +	struct sgx_encl *encl = file->private_data;
> > > +	int ret;
> > > +
> > > +	ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	ret = sgx_encl_mm_add(encl, vma->vm_mm);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	vma->vm_ops = &sgx_vm_ops;
> > > +	vma->vm_flags |= VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP | VM_IO;
> > > +	vma->vm_private_data = encl;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static unsigned long sgx_get_unmapped_area(struct file *file,
> > > +					   unsigned long addr,
> > > +					   unsigned long len,
> > > +					   unsigned long pgoff,
> > > +					   unsigned long flags)
> > > +{
> > > +	if ((flags & MAP_TYPE) == MAP_PRIVATE)
> > > +		return -EINVAL;
> > > +
> > > +	if (flags & MAP_FIXED)
> > > +		return addr;
> > > +
> > > +	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
> > > +}
> > > +
> > > +static const struct file_operations sgx_encl_fops = {
> > > +	.owner			= THIS_MODULE,
> > > +	.open			= sgx_open,
> > > +	.release		= sgx_release,
> > > +	.mmap			= sgx_mmap,
> > > +	.get_unmapped_area	= sgx_get_unmapped_area,
> > > +};
> > > +
> > > +static struct miscdevice sgx_dev_enclave = {
> > > +	.minor = MISC_DYNAMIC_MINOR,
> > > +	.name = "enclave",
> > > +	.nodename = "sgx/enclave",
> > 
> > A subdir for a single device node?  Ok, odd, but why not just
> > "sgx_enclave"?  How "special" is this device node?
> 
> There is a patch that adds "sgx/provision".

What number in this series?

> Either works for me. Should I flatten them to "sgx_enclave" and
> "sgx_provision", or keep them as they are?

Having 2 char nodes in a subdir is better than one, I will give you
that.  But none is even better, don't you think?

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-05  8:45     ` Christoph Hellwig
@ 2020-10-05 11:42       ` Jarkko Sakkinen
  2020-10-05 11:50         ` Greg KH
  0 siblings, 1 reply; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-05 11:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Greg KH, x86, linux-sgx, linux-kernel, linux-security-module,
	linux-mm, Andrew Morton, Matthew Wilcox, Jethro Beekman,
	Haitao Huang, Chunyang Hui, Jordan Hand, Nathaniel McCallum,
	Seth Moore, Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk, rientjes,
	tglx, yaozhangx, mikko.ylinen

On Mon, Oct 05, 2020 at 09:45:54AM +0100, Christoph Hellwig wrote:
> On Sat, Oct 03, 2020 at 04:39:25PM +0200, Greg KH wrote:
> > > @@ -0,0 +1,173 @@
> > > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > 
> > You use gpl-only header files in this file, so how in the world can it
> > be bsd-3 licensed?
> > 
> > Please get your legal department to agree with this, after you explain
> > to them how you are mixing gpl2-only code in with this file.
> > 
> > > +// Copyright(c) 2016-18 Intel Corporation.
> > 
> > Dates are hard to get right :(
> 
> As is comment formatting apparently.  Don't use // comments for anything
> but the SPDX header, please.

I'll bring some context to this.

When I moved into using SPDX, I took the example from places where I saw
also the copyright using "//". That's the reason for the choice.

I.e.

$ git grep "// Copyright" | wc -l
2123

I don't care, which one to use, just wondering is it done in the wrong
way in all these sites?

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-05 11:42       ` Jarkko Sakkinen
@ 2020-10-05 11:50         ` Greg KH
  2020-10-05 14:23           ` Jarkko Sakkinen
  0 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2020-10-05 11:50 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Christoph Hellwig, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Mon, Oct 05, 2020 at 02:42:50PM +0300, Jarkko Sakkinen wrote:
> On Mon, Oct 05, 2020 at 09:45:54AM +0100, Christoph Hellwig wrote:
> > On Sat, Oct 03, 2020 at 04:39:25PM +0200, Greg KH wrote:
> > > > @@ -0,0 +1,173 @@
> > > > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > > 
> > > You use gpl-only header files in this file, so how in the world can it
> > > be bsd-3 licensed?
> > > 
> > > Please get your legal department to agree with this, after you explain
> > > to them how you are mixing gpl2-only code in with this file.
> > > 
> > > > +// Copyright(c) 2016-18 Intel Corporation.
> > > 
> > > Dates are hard to get right :(
> > 
> > As is comment formatting apparently.  Don't use // comments for anything
> > but the SPDX header, please.
> 
> I'll bring some context to this.
> 
> When I moved into using SPDX, I took the example from places where I saw
> also the copyright using "//". That's the reason for the choice.
> 
> I.e.
> 
> $ git grep "// Copyright" | wc -l
> 2123
> 
> I don't care, which one to use, just wondering is it done in the wrong
> way in all these sites?

Probably, but I know at least one subsystem requires their headers to be
in this manner.  There's no accounting for taste :)

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-05  9:42       ` Greg KH
@ 2020-10-05 12:42         ` Jarkko Sakkinen
  2020-10-07 18:09           ` Haitao Huang
  0 siblings, 1 reply; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-05 12:42 UTC (permalink / raw)
  To: Greg KH
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Haitao Huang,
	Chunyang Hui, Jordan Hand, Nathaniel McCallum, Seth Moore,
	Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk, rientjes,
	tglx, yaozhangx, mikko.ylinen

On Mon, Oct 05, 2020 at 11:42:46AM +0200, Greg KH wrote:
> > > You use gpl-only header files in this file, so how in the world can it
> > > be bsd-3 licensed?
> > > 
> > > Please get your legal department to agree with this, after you explain
> > > to them how you are mixing gpl2-only code in with this file.
> > 
> > I'll do what I already stated that I will do. Should I do something
> > more?
> 
> This was written before your previous response.

OK, that is weird, I got this one some time later.

> > > > +	mutex_lock(&encl->lock);
> > > > +	atomic_or(SGX_ENCL_DEAD, &encl->flags);
> > > 
> > > So you set a flag that this is dead, and then instantly delete it?  Why
> > > does that matter?  I see you check for this flag elsewhere, but as you
> > > are just about to delete this structure, how can this be an issue?
> > 
> > It matters because ksgxswapd (sgx_reclaimer_*) might be processing it.
> 
> I don't see that happening in this patch, did I miss it?

It's implemented in 16/24:

https://lore.kernel.org/linux-sgx/20201004223921.GA48517@linux.intel.com/T/#u

> > It will use the flag to skip the operations that it would do to a victim
> > page, when the enclave is still alive.
> 
> Again, why are you adding flags when the patch does not use them?
> Please put new functionality in the specific patch that uses it.
> 
> And can you really rely on this?  How did sgx_reclaimer_* (whatever that
> is), get the reference on this object in the first place?  Again, I
> don't see that happening at all in here, and at a quick glance in the
> other patches I don't see it there either.  What am I missing?

I went through the patch, and yes, they can be migrated to 16/24.
I agree with this, no excuses.

In 16/24 pages are added to sgx_active_page_list from which they are
swapped by the reclaimer to the main memory when Enclave Page Cache
(EPC), the memory where enclave pages reside, gets full.

When a reclaimer thread takes a victim page from that list, it will also
get a kref to the enclave so that struct sgx_encl instance does not
get wiped while it's doing its job.

> > Because ksgxswapd needs the alive enclave instance while it is in the
> > process of swapping a victim page. The reason for this is the
> > hierarchical nature of the enclave pages.
> > 
> > As an example, a write operation to main memory, EWB (SDM vol 3D 40-79)
> 
> What is that referencing?

https://software.intel.com/content/dam/develop/public/us/en/documents/332831-sdm-vol-3d.pdf

> > needs to access SGX Enclave Control Structure (SECS) page, which is
> > contains global data for an enclave, like the unswapped child count.
> 
> Ok, but how did it get access to this structure in the first place, like
> I ask above?

I guess I answered that, and I also fully agree with your suggestions.

It used to be many iterations ago that enclaves were not file based but
just memory mappings (long story short: was not great way to make them
multiprocess, that's why file centered now), and then refcount played a
bigger role. Having those "extras" in this patch is by no means
intentional but more like cruft of many iterations of refactoring.

Sometimes when you work long with this kind of pile of code, which has
converged through many iterations, you really need someone else to point
some of the simple and obvious things out.

> > There is a patch that adds "sgx/provision".
> 
> What number in this series?

It's 15/24.

> 
> > Either works for me. Should I flatten them to "sgx_enclave" and
> > "sgx_provision", or keep them as they are?
> 
> Having 2 char nodes in a subdir is better than one, I will give you
> that.  But none is even better, don't you think?

I think that having just "sgx_enclave" and "sgx_provision" would be
better.

I've been thinking about this for a while but at the same time try not
to be too proactive without feedback. One reason would be that "enclave"
and "provision" without the subdir are not good identifiers.

I also recalled this discussion:

https://lkml.org/lkml/2019/12/23/158

and was wondering how that subdir would even play with /sys/class/misc,
if we decide to add attributes? Not enough knowledge to answer this.

Anyway, I'll put a note to my backlog on this, and also to move the
previously discussed cruft to the correct patch.

> thanks,
> 
> greg k-h

Thank you.

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-05 11:50         ` Greg KH
@ 2020-10-05 14:23           ` Jarkko Sakkinen
  2020-10-05 15:02             ` Greg KH
  0 siblings, 1 reply; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-05 14:23 UTC (permalink / raw)
  To: Greg KH
  Cc: Christoph Hellwig, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Mon, Oct 05, 2020 at 01:50:30PM +0200, Greg KH wrote:
> On Mon, Oct 05, 2020 at 02:42:50PM +0300, Jarkko Sakkinen wrote:
> > On Mon, Oct 05, 2020 at 09:45:54AM +0100, Christoph Hellwig wrote:
> > > On Sat, Oct 03, 2020 at 04:39:25PM +0200, Greg KH wrote:
> > > > > @@ -0,0 +1,173 @@
> > > > > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > > > 
> > > > You use gpl-only header files in this file, so how in the world can it
> > > > be bsd-3 licensed?
> > > > 
> > > > Please get your legal department to agree with this, after you explain
> > > > to them how you are mixing gpl2-only code in with this file.
> > > > 
> > > > > +// Copyright(c) 2016-18 Intel Corporation.
> > > > 
> > > > Dates are hard to get right :(
> > > 
> > > As is comment formatting apparently.  Don't use // comments for anything
> > > but the SPDX header, please.
> > 
> > I'll bring some context to this.
> > 
> > When I moved into using SPDX, I took the example from places where I saw
> > also the copyright using "//". That's the reason for the choice.
> > 
> > I.e.
> > 
> > $ git grep "// Copyright" | wc -l
> > 2123
> > 
> > I don't care, which one to use, just wondering is it done in the wrong
> > way in all these sites?
> 
> Probably, but I know at least one subsystem requires their headers to be
> in this manner.  There's no accounting for taste :)

This discussion is a bit confusing [*], so I'll just ask from Git:

➜  linux-sgx (master) ✔ git --no-pager grep "\/\/ Copyright" arch/x86
arch/x86/kernel/cpu/sgx/driver.c:// Copyright(c) 2016-20 Intel Corporation.
arch/x86/kernel/cpu/sgx/encl.c:// Copyright(c) 2016-20 Intel Corporation.
arch/x86/kernel/cpu/sgx/ioctl.c:// Copyright(c) 2016-20 Intel Corporation.
arch/x86/kernel/cpu/sgx/main.c:// Copyright(c) 2016-20 Intel Corporation.

OK, now I think I know what to do :-)

> thanks,
> 
> greg k-h

[*] One thing I've been wondering for a long time is that, why new code
should have the copyright platters in the first place? I get it for
pre-Git era but now there is a cryptographic log of authority.

Copyright platters, remarking the authors to the header and
MODULE_AUTHOR() macro are the three things that I just do not get in the
modern times.

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-05 14:23           ` Jarkko Sakkinen
@ 2020-10-05 15:02             ` Greg KH
  2020-10-05 16:40               ` Dave Hansen
  0 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2020-10-05 15:02 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Christoph Hellwig, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Mon, Oct 05, 2020 at 05:23:45PM +0300, Jarkko Sakkinen wrote:
> [*] One thing I've been wondering for a long time is that, why new code
> should have the copyright platters in the first place? I get it for
> pre-Git era but now there is a cryptographic log of authority.

Go talk to your corporate lawyers about this, it is one of the most
common cargo-cult patterns around :)

good luck!

greg k-h


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-05 15:02             ` Greg KH
@ 2020-10-05 16:40               ` Dave Hansen
  2020-10-05 20:02                 ` Jarkko Sakkinen
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2020-10-05 16:40 UTC (permalink / raw)
  To: Greg KH, Jarkko Sakkinen
  Cc: Christoph Hellwig, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On 10/5/20 8:02 AM, Greg KH wrote:
> On Mon, Oct 05, 2020 at 05:23:45PM +0300, Jarkko Sakkinen wrote:
>> [*] One thing I've been wondering for a long time is that, why new code
>> should have the copyright platters in the first place? I get it for
>> pre-Git era but now there is a cryptographic log of authority.
> Go talk to your corporate lawyers about this, it is one of the most
> common cargo-cult patterns around :)

For this patch, though, it seems like we should just update the dates
instead of removing them.

If I look at the last 1000 "^+.*Copyright" lines added to the kernel,
997 of them have a year.  So, weird or not, it's a pretty standard
convention.  We'd need a slightly more broad conversation before we
decide to nix these dates.

Pure speculation: Copyright protection, at least in the US, is not
forever.  I _think_ it's 75 years or something.  That protection starts
when the work is created and is independent of when it gets merged into
Linux.  So, if we did something weird like merge a driver written 10
years ago, it would only be protected for 65 more years after we merge
it.  In other words, git history _might_ be irrelevant for copyright
protection.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-05 16:40               ` Dave Hansen
@ 2020-10-05 20:02                 ` Jarkko Sakkinen
  0 siblings, 0 replies; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-05 20:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Greg KH, Christoph Hellwig, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	haitao.huang, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx, mikko.ylinen

On Mon, Oct 05, 2020 at 09:40:52AM -0700, Dave Hansen wrote:
> On 10/5/20 8:02 AM, Greg KH wrote:
> > On Mon, Oct 05, 2020 at 05:23:45PM +0300, Jarkko Sakkinen wrote:
> >> [*] One thing I've been wondering for a long time is that, why new code
> >> should have the copyright platters in the first place? I get it for
> >> pre-Git era but now there is a cryptographic log of authority.
> > Go talk to your corporate lawyers about this, it is one of the most
> > common cargo-cult patterns around :)
> 
> For this patch, though, it seems like we should just update the dates
> instead of removing them.

Already done. I updated them yesterday as:

  Copyright(c) 2016-20 Intel Corporation.

Changing from '//' to '/* ... */' is not yet.

> If I look at the last 1000 "^+.*Copyright" lines added to the kernel,
> 997 of them have a year.  So, weird or not, it's a pretty standard
> convention.  We'd need a slightly more broad conversation before we
> decide to nix these dates.
> 
> Pure speculation: Copyright protection, at least in the US, is not
> forever.  I _think_ it's 75 years or something.  That protection starts
> when the work is created and is independent of when it gets merged into
> Linux.  So, if we did something weird like merge a driver written 10
> years ago, it would only be protected for 65 more years after we merge
> it.  In other words, git history _might_ be irrelevant for copyright
> protection.

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 16/24] x86/sgx: Add a page reclaimer
  2020-10-04 22:39         ` Jarkko Sakkinen
@ 2020-10-07 17:25           ` Jarkko Sakkinen
  0 siblings, 0 replies; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-07 17:25 UTC (permalink / raw)
  To: Haitao Huang
  Cc: x86, linux-sgx, linux-kernel, linux-mm, Jethro Beekman,
	Jordan Hand, Nathaniel McCallum, Chunyang Hui, Seth Moore,
	Sean Christopherson, akpm, andriy.shevchenko, asapek, bp,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk,
	rientjes, tglx, yaozhangx, mikko.ylinen, willy

On Mon, Oct 05, 2020 at 01:39:21AM +0300, Jarkko Sakkinen wrote:
> On Sat, Oct 03, 2020 at 01:23:49PM -0500, Haitao Huang wrote:
> > On Sat, 03 Oct 2020 08:32:45 -0500, Jarkko Sakkinen
> > <jarkko.sakkinen@linux.intel.com> wrote:
> > 
> > > On Sat, Oct 03, 2020 at 12:22:47AM -0500, Haitao Huang wrote:
> > > > When I turn on CONFIG_PROVE_LOCKING, kernel reports following
> > > > suspicious RCU
> > > > usages. Not sure if it is an issue. Just reporting here:
> > > 
> > > I'm glad to hear that my tip helped you to get us the data.
> > > 
> > > This does not look like an issue in the page reclaimer, which was not
> > > obvious for me before. That's a good thing. I was really worried about
> > > that because it has been very stable for a long period now. The last
> > > bug fix for the reclaimer was done in June in v31 version of the patch
> > > set and after that it has been unchanged (except possibly some renames
> > > requested by Boris).
> > > 
> > > I wildly guess I have a bad usage pattern for xarray. I migrated to it
> > > in v36, and it is entirely possible that I've misused it. It was the
> > > first time that I ever used it. Before xarray we had radix_tree but
> > > based Matthew Wilcox feedback I did a migration to xarray.
> > > 
> > > What I'd ask you to do next is to, if by any means possible, to try to
> > > run the same test with v35 so we can verify this. That one still has
> > > the radix tree.
> > > 
> > 
> > 
> > v35 does not cause any such warning messages from kernel
> 
> Thank you. Looks like Matthew already located the issue, a fix will
> land soon.

Just acknowledging that this should be fixed in my master branch now.

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-05 12:42         ` Jarkko Sakkinen
@ 2020-10-07 18:09           ` Haitao Huang
  2020-10-07 19:26             ` Greg KH
  0 siblings, 1 reply; 35+ messages in thread
From: Haitao Huang @ 2020-10-07 18:09 UTC (permalink / raw)
  To: Greg KH, Jarkko Sakkinen
  Cc: x86, linux-sgx, linux-kernel, linux-security-module, linux-mm,
	Andrew Morton, Matthew Wilcox, Jethro Beekman, Chunyang Hui,
	Jordan Hand, Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Mon, 05 Oct 2020 07:42:21 -0500, Jarkko Sakkinen  
<jarkko.sakkinen@linux.intel.com> wrote:

> On Mon, Oct 05, 2020 at 11:42:46AM +0200, Greg KH wrote:
>> > > You use gpl-only header files in this file, so how in the world can  
>> it
>> > > be bsd-3 licensed?
>> > >
>> > > Please get your legal department to agree with this, after you  
>> explain
>> > > to them how you are mixing gpl2-only code in with this file.
>> >
>> > I'll do what I already stated that I will do. Should I do something
>> > more?
>>
>> This was written before your previous response.
>
> OK, that is weird, I got this one some time later.
>
>> > > > +	mutex_lock(&encl->lock);
>> > > > +	atomic_or(SGX_ENCL_DEAD, &encl->flags);
>> > >
>> > > So you set a flag that this is dead, and then instantly delete it?   
>> Why
>> > > does that matter?  I see you check for this flag elsewhere, but as  
>> you
>> > > are just about to delete this structure, how can this be an issue?
>> >
>> > It matters because ksgxswapd (sgx_reclaimer_*) might be processing it.
>>
>> I don't see that happening in this patch, did I miss it?
>
> It's implemented in 16/24:
>
> https://lore.kernel.org/linux-sgx/20201004223921.GA48517@linux.intel.com/T/#u
>
>> > It will use the flag to skip the operations that it would do to a  
>> victim
>> > page, when the enclave is still alive.
>>
>> Again, why are you adding flags when the patch does not use them?
>> Please put new functionality in the specific patch that uses it.
>>
>> And can you really rely on this?  How did sgx_reclaimer_* (whatever that
>> is), get the reference on this object in the first place?  Again, I
>> don't see that happening at all in here, and at a quick glance in the
>> other patches I don't see it there either.  What am I missing?
>
> I went through the patch, and yes, they can be migrated to 16/24.
> I agree with this, no excuses.
>
> In 16/24 pages are added to sgx_active_page_list from which they are
> swapped by the reclaimer to the main memory when Enclave Page Cache
> (EPC), the memory where enclave pages reside, gets full.
>
> When a reclaimer thread takes a victim page from that list, it will also
> get a kref to the enclave so that struct sgx_encl instance does not
> get wiped while it's doing its job.
>
>> > Because ksgxswapd needs the alive enclave instance while it is in the
>> > process of swapping a victim page. The reason for this is the
>> > hierarchical nature of the enclave pages.
>> >
>> > As an example, a write operation to main memory, EWB (SDM vol 3D  
>> 40-79)
>>
>> What is that referencing?
>
> https://software.intel.com/content/dam/develop/public/us/en/documents/332831-sdm-vol-3d.pdf
>
>> > needs to access SGX Enclave Control Structure (SECS) page, which is
>> > contains global data for an enclave, like the unswapped child count.
>>
>> Ok, but how did it get access to this structure in the first place, like
>> I ask above?
>
> I guess I answered that, and I also fully agree with your suggestions.
>
> It used to be many iterations ago that enclaves were not file based but
> just memory mappings (long story short: was not great way to make them
> multiprocess, that's why file centered now), and then refcount played a
> bigger role. Having those "extras" in this patch is by no means
> intentional but more like cruft of many iterations of refactoring.
>
> Sometimes when you work long with this kind of pile of code, which has
> converged through many iterations, you really need someone else to point
> some of the simple and obvious things out.
>
>> > There is a patch that adds "sgx/provision".
>>
>> What number in this series?
>
> It's 15/24.
>

Don't know if this is critical. I'd prefer to keep them as is. Directory  
seems natural to me and makes sense to add more under the same dir in case  
there are more to come.

Thanks
Haitao


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-07 18:09           ` Haitao Huang
@ 2020-10-07 19:26             ` Greg KH
  2020-10-09  6:44               ` Jarkko Sakkinen
  0 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2020-10-07 19:26 UTC (permalink / raw)
  To: Haitao Huang
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Chunyang Hui, Jordan Hand, Nathaniel McCallum,
	Seth Moore, Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk, rientjes,
	tglx, yaozhangx, mikko.ylinen

On Wed, Oct 07, 2020 at 01:09:01PM -0500, Haitao Huang wrote:
> > > > There is a patch that adds "sgx/provision".
> > > 
> > > What number in this series?
> > 
> > It's 15/24.
> > 
> 
> Don't know if this is critical. I'd prefer to keep them as is. Directory
> seems natural to me and makes sense to add more under the same dir in case
> there are more to come.

Why is this so special that you need a subdirectory for a single driver
with a mere 2 device nodes?  Do any other misc drivers have a new
subdirectory in /dev/ for them?

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-07 19:26             ` Greg KH
@ 2020-10-09  6:44               ` Jarkko Sakkinen
  2020-10-14 20:16                 ` Dave Hansen
  0 siblings, 1 reply; 35+ messages in thread
From: Jarkko Sakkinen @ 2020-10-09  6:44 UTC (permalink / raw)
  To: Greg KH
  Cc: Haitao Huang, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Chunyang Hui, Jordan Hand, Nathaniel McCallum,
	Seth Moore, Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, kai.huang,
	kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk, rientjes,
	tglx, yaozhangx, mikko.ylinen

On Wed, Oct 07, 2020 at 09:26:55PM +0200, Greg KH wrote:
> On Wed, Oct 07, 2020 at 01:09:01PM -0500, Haitao Huang wrote:
> > > > > There is a patch that adds "sgx/provision".
> > > > 
> > > > What number in this series?
> > > 
> > > It's 15/24.
> > > 
> > 
> > Don't know if this is critical. I'd prefer to keep them as is. Directory
> > seems natural to me and makes sense to add more under the same dir in case
> > there are more to come.
> 
> Why is this so special that you need a subdirectory for a single driver
> with a mere 2 device nodes?  Do any other misc drivers have a new
> subdirectory in /dev/ for them?

Absolutely nothing as far as I'm concerned. Should have done that
already at the time when I switched to misc based on your feedback. I
was acting too reactive I guess. For sure I'll rename.

I also looked at encl->refcount with time. Instead of just "moving the
garbage up to the correct waste pit", I'll address that one by
refactoring it out and making the reclaimer thread to do the reaper's
job.

> thanks,
> 
> greg k-h

/Jarkko


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-03 14:39   ` Greg KH
  2020-10-04 14:32     ` Jarkko Sakkinen
  2020-10-05  8:45     ` Christoph Hellwig
@ 2020-10-09  7:10     ` Pavel Machek
  2020-10-09  7:21       ` Greg KH
  2 siblings, 1 reply; 35+ messages in thread
From: Pavel Machek @ 2020-10-09  7:10 UTC (permalink / raw)
  To: Greg KH
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

[-- Attachment #1: Type: text/plain, Size: 851 bytes --]

Hi!

> > new file mode 100644
> > index 000000000000..f54da5f19c2b
> > --- /dev/null
> > +++ b/arch/x86/kernel/cpu/sgx/driver.c
> > @@ -0,0 +1,173 @@
> > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> 
> You use gpl-only header files in this file, so how in the world can it
> be bsd-3 licensed?
> 
> Please get your legal department to agree with this, after you explain
> to them how you are mixing gpl2-only code in with this file.

This specifies license of driver.c, not of the headers included. Are
you saying that it is impossible to have a kernel driver with anything
else than GPL-2? That would be news to many, and that's not what
current consensus is.

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-09  7:10     ` Pavel Machek
@ 2020-10-09  7:21       ` Greg KH
  2020-10-09  8:21         ` Pavel Machek
  0 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2020-10-09  7:21 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On Fri, Oct 09, 2020 at 09:10:45AM +0200, Pavel Machek wrote:
> Hi!
> 
> > > new file mode 100644
> > > index 000000000000..f54da5f19c2b
> > > --- /dev/null
> > > +++ b/arch/x86/kernel/cpu/sgx/driver.c
> > > @@ -0,0 +1,173 @@
> > > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > 
> > You use gpl-only header files in this file, so how in the world can it
> > be bsd-3 licensed?
> > 
> > Please get your legal department to agree with this, after you explain
> > to them how you are mixing gpl2-only code in with this file.
> 
> This specifies license of driver.c, not of the headers included. Are
> you saying that it is impossible to have a kernel driver with anything
> else than GPL-2? That would be news to many, and that's not what
> current consensus is.

If you want to write any non-GPL-2-only kernel code, you had better be
consulting your lawyers and get very explicit instructions on how to do
this in a way that does not violate any licenses.

I am not a lawyer, and will not be giving you any such advice, as I
think it's not something that people should be doing.

greg k-h


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-09  7:21       ` Greg KH
@ 2020-10-09  8:21         ` Pavel Machek
  0 siblings, 0 replies; 35+ messages in thread
From: Pavel Machek @ 2020-10-09  8:21 UTC (permalink / raw)
  To: Greg KH
  Cc: Pavel Machek, Jarkko Sakkinen, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Darren Kenny,
	Sean Christopherson, Suresh Siddha, andriy.shevchenko, asapek,
	bp, cedric.xing, chenalexchen, conradparker, cyhanish,
	dave.hansen, haitao.huang, kai.huang, kai.svahn, kmoy, ludloff,
	luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

[-- Attachment #1: Type: text/plain, Size: 1671 bytes --]

On Fri 2020-10-09 09:21:41, Greg KH wrote:
> On Fri, Oct 09, 2020 at 09:10:45AM +0200, Pavel Machek wrote:
> > Hi!
> > 
> > > > new file mode 100644
> > > > index 000000000000..f54da5f19c2b
> > > > --- /dev/null
> > > > +++ b/arch/x86/kernel/cpu/sgx/driver.c
> > > > @@ -0,0 +1,173 @@
> > > > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > > 
> > > You use gpl-only header files in this file, so how in the world can it
> > > be bsd-3 licensed?
> > > 
> > > Please get your legal department to agree with this, after you explain
> > > to them how you are mixing gpl2-only code in with this file.
> > 
> > This specifies license of driver.c, not of the headers included. Are
> > you saying that it is impossible to have a kernel driver with anything
> > else than GPL-2? That would be news to many, and that's not what
> > current consensus is.
> 
> If you want to write any non-GPL-2-only kernel code, you had better be
> consulting your lawyers and get very explicit instructions on how to do
> this in a way that does not violate any licenses.
> 
> I am not a lawyer, and will not be giving you any such advice, as I
> think it's not something that people should be doing.

You are pushing view that is well outside accepted community
consensus, then try to hide it by claiming that you are not a lawyer.

Stop it.

Dual licensed drivers are common in the kernel, and are considered
okay by everyone but you. Author is free to select license for his
work.

								Pavel
-- 
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v39 11/24] x86/sgx: Add SGX enclave driver
  2020-10-09  6:44               ` Jarkko Sakkinen
@ 2020-10-14 20:16                 ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2020-10-14 20:16 UTC (permalink / raw)
  To: Jarkko Sakkinen, Greg KH
  Cc: Haitao Huang, x86, linux-sgx, linux-kernel,
	linux-security-module, linux-mm, Andrew Morton, Matthew Wilcox,
	Jethro Beekman, Chunyang Hui, Jordan Hand, Nathaniel McCallum,
	Seth Moore, Darren Kenny, Sean Christopherson, Suresh Siddha,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, haitao.huang, kai.huang, kai.svahn, kmoy,
	ludloff, luto, nhorman, puiterwijk, rientjes, tglx, yaozhangx,
	mikko.ylinen

On 10/8/20 11:44 PM, Jarkko Sakkinen wrote:
>> Why is this so special that you need a subdirectory for a single driver
>> with a mere 2 device nodes?  Do any other misc drivers have a new
>> subdirectory in /dev/ for them?
> Absolutely nothing as far as I'm concerned. Should have done that
> already at the time when I switched to misc based on your feedback. I
> was acting too reactive I guess. For sure I'll rename.

Plus, if anyone *REALLY* cares, they can get their precious directory
back with a couple of lines of udev rules, I believe:

KERNEL=="sgx_provision", SYMLINK+="sgx/provision"
KERNEL=="sgx_enclave", SYMLINK+="sgx/enclave"


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-10-14 20:16 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201003045059.665934-1-jarkko.sakkinen@linux.intel.com>
2020-10-03  4:50 ` [PATCH v39 10/24] mm: Add 'mprotect' hook to struct vm_operations_struct Jarkko Sakkinen
2020-10-03  4:50 ` [PATCH v39 11/24] x86/sgx: Add SGX enclave driver Jarkko Sakkinen
2020-10-03 14:39   ` Greg KH
2020-10-04 14:32     ` Jarkko Sakkinen
2020-10-04 15:01       ` Jarkko Sakkinen
2020-10-05  9:42       ` Greg KH
2020-10-05 12:42         ` Jarkko Sakkinen
2020-10-07 18:09           ` Haitao Huang
2020-10-07 19:26             ` Greg KH
2020-10-09  6:44               ` Jarkko Sakkinen
2020-10-14 20:16                 ` Dave Hansen
2020-10-05  8:45     ` Christoph Hellwig
2020-10-05 11:42       ` Jarkko Sakkinen
2020-10-05 11:50         ` Greg KH
2020-10-05 14:23           ` Jarkko Sakkinen
2020-10-05 15:02             ` Greg KH
2020-10-05 16:40               ` Dave Hansen
2020-10-05 20:02                 ` Jarkko Sakkinen
2020-10-09  7:10     ` Pavel Machek
2020-10-09  7:21       ` Greg KH
2020-10-09  8:21         ` Pavel Machek
2020-10-03 19:54   ` Matthew Wilcox
2020-10-04 21:50     ` Jarkko Sakkinen
2020-10-04 22:02       ` Jarkko Sakkinen
2020-10-04 22:27       ` Matthew Wilcox
2020-10-04 23:41         ` Jarkko Sakkinen
2020-10-05  1:30           ` Matthew Wilcox
2020-10-05  3:06             ` Jarkko Sakkinen
2020-10-03  4:50 ` [PATCH v39 16/24] x86/sgx: Add a page reclaimer Jarkko Sakkinen
2020-10-03  5:22   ` Haitao Huang
2020-10-03 13:32     ` Jarkko Sakkinen
2020-10-03 18:23       ` Haitao Huang
2020-10-04 22:39         ` Jarkko Sakkinen
2020-10-07 17:25           ` Jarkko Sakkinen
2020-10-03  4:50 ` [PATCH v39 17/24] x86/sgx: Add ptrace() support for the SGX driver Jarkko Sakkinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).