All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] x86/sgx: NUMA
@ 2021-03-13 16:01 Jarkko Sakkinen
  2021-03-13 16:01 ` [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages() Jarkko Sakkinen
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jarkko Sakkinen @ 2021-03-13 16:01 UTC (permalink / raw)
  To: linux-sgx; +Cc: Jarkko Sakkinen, Dave Hansen, Borislav Petkov, Haitao Huang

Make sgx_alloc_epc_page() more balanced among the NUMA nodes, refactoring
hierarchical listi into a single global free list managed in FIFO fashion.

Introduce NUMA node local free EPC page lists. In sgx_alloc_epc_page(), try
first to allocate from the local free list, and fallback to global free
list, if the local list is empty.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Haitao Huang <haitao.huang@linux.intel.com>

v3:
* Expanded into a full patch set, which also implements more balanced
  EPC allocation among the NUMA nodes. Since none of the patches fully
  match the previous single patch entry, included earlier change log
  to the cover letter.

v2:
* s/section_list/numa_section_list/
* s/MAX_NUMNODES//num_possible_nodes()/
* Add more verbose inline comments that Dave provided.
* If NUMA mapping fails, print a warning and description of the fallback.
  The physical address is already printed in pr_info(), just before the
  mapping happens.

Jarkko Sakkinen (3):
  x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages()
  x86/sgx: Replace section local dirty page lists with a global list
  x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()

 arch/x86/Kconfig               |   1 +
 arch/x86/kernel/cpu/sgx/main.c | 203 ++++++++++++++++++---------------
 arch/x86/kernel/cpu/sgx/sgx.h  |  23 ++--
 3 files changed, 121 insertions(+), 106 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages()
  2021-03-13 16:01 [PATCH v4 0/3] x86/sgx: NUMA Jarkko Sakkinen
@ 2021-03-13 16:01 ` Jarkko Sakkinen
  2021-03-15 15:32   ` Dave Hansen
  2021-03-13 16:01 ` [PATCH v4 2/3] x86/sgx: Replace section local dirty page lists with a global list Jarkko Sakkinen
  2021-03-13 16:01 ` [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page() Jarkko Sakkinen
  2 siblings, 1 reply; 14+ messages in thread
From: Jarkko Sakkinen @ 2021-03-13 16:01 UTC (permalink / raw)
  To: linux-sgx
  Cc: Jarkko Sakkinen, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, linux-kernel

Replace the ad-hoc code with a sgx_free_epc_page(), in order to make sure
that all the relevant checks and book keeping is done, while freeing a
borrowed EPC page, and remove redundant code. EREMOVE inside
sgx_free_epc_page() does not change the semantics, as EREMOVE to an
uninitialize pages is a nop.

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---

v4:
* Rewrote the commit message.

 arch/x86/kernel/cpu/sgx/main.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 8df81a3ed945..65004fb8a91f 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -305,7 +305,6 @@ static void sgx_reclaim_pages(void)
 {
 	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
 	struct sgx_backing backing[SGX_NR_TO_SCAN];
-	struct sgx_epc_section *section;
 	struct sgx_encl_page *encl_page;
 	struct sgx_epc_page *epc_page;
 	pgoff_t page_index;
@@ -378,11 +377,7 @@ static void sgx_reclaim_pages(void)
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 
-		section = &sgx_epc_sections[epc_page->section];
-		spin_lock(&section->lock);
-		list_add_tail(&epc_page->list, &section->page_list);
-		section->free_cnt++;
-		spin_unlock(&section->lock);
+		sgx_free_epc_page(epc_page);
 	}
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 2/3] x86/sgx: Replace section local dirty page lists with a global list
  2021-03-13 16:01 [PATCH v4 0/3] x86/sgx: NUMA Jarkko Sakkinen
  2021-03-13 16:01 ` [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages() Jarkko Sakkinen
@ 2021-03-13 16:01 ` Jarkko Sakkinen
  2021-03-15 16:03   ` Dave Hansen
  2021-03-13 16:01 ` [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page() Jarkko Sakkinen
  2 siblings, 1 reply; 14+ messages in thread
From: Jarkko Sakkinen @ 2021-03-13 16:01 UTC (permalink / raw)
  To: linux-sgx
  Cc: Jarkko Sakkinen, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, linux-kernel

Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state,
and free them using sgx_free_epc_page(). Do two passes, as for SECS pages
the first round can fail, if all child pages have not yet been removed.
The driver puts all pages on startup first to sgx_dirty_page_list, as the
initialization could be triggered by kexec(), meaning that pages have been
reserved for active enclaves before the operation.

The section local lists are redundant, as sgx_free_epc_page() figures
out the correction by using epc_page->section.

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---

v4:
* Open coded sgx_santize_section() to ksgxd().
* Rewrote the commit message.

 arch/x86/kernel/cpu/sgx/main.c | 81 ++++++++++++++++------------------
 arch/x86/kernel/cpu/sgx/sgx.h  |  7 ---
 2 files changed, 37 insertions(+), 51 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 65004fb8a91f..cb4561444b96 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -27,39 +27,10 @@ static LIST_HEAD(sgx_active_page_list);
 static DEFINE_SPINLOCK(sgx_reclaimer_lock);
 
 /*
- * Reset dirty EPC pages to uninitialized state. Laundry can be left with SECS
- * pages whose child pages blocked EREMOVE.
+ * When the driver initialized, EPC pages go first here, as they could be
+ * initialized to an active enclave, on kexec entry.
  */
-static void sgx_sanitize_section(struct sgx_epc_section *section)
-{
-	struct sgx_epc_page *page;
-	LIST_HEAD(dirty);
-	int ret;
-
-	/* init_laundry_list is thread-local, no need for a lock: */
-	while (!list_empty(&section->init_laundry_list)) {
-		if (kthread_should_stop())
-			return;
-
-		/* needed for access to ->page_list: */
-		spin_lock(&section->lock);
-
-		page = list_first_entry(&section->init_laundry_list,
-					struct sgx_epc_page, list);
-
-		ret = __eremove(sgx_get_epc_virt_addr(page));
-		if (!ret)
-			list_move(&page->list, &section->page_list);
-		else
-			list_move_tail(&page->list, &dirty);
-
-		spin_unlock(&section->lock);
-
-		cond_resched();
-	}
-
-	list_splice(&dirty, &section->init_laundry_list);
-}
+static LIST_HEAD(sgx_dirty_page_list);
 
 static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 {
@@ -400,25 +371,48 @@ static bool sgx_should_reclaim(unsigned long watermark)
 
 static int ksgxd(void *p)
 {
-	int i;
+	struct sgx_epc_page *page;
+	LIST_HEAD(dirty);
+	int i, ret;
 
 	set_freezable();
 
 	/*
-	 * Sanitize pages in order to recover from kexec(). The 2nd pass is
-	 * required for SECS pages, whose child pages blocked EREMOVE.
+	 * Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state,
+	 * and free them using sgx_free_epc_page(). Do two passes, as for SECS pages the
+	 * first round can fail, if all child pages have not yet been removed.  The
+	 * driver puts all pages on startup first to sgx_dirty_page_list, as the
+	 * initialization could be triggered by kexec(), meaning that pages have been
+	 * reserved for active enclaves before the operation.
 	 */
-	for (i = 0; i < sgx_nr_epc_sections; i++)
-		sgx_sanitize_section(&sgx_epc_sections[i]);
 
-	for (i = 0; i < sgx_nr_epc_sections; i++) {
-		sgx_sanitize_section(&sgx_epc_sections[i]);
+	/* sgx_dirty_page_list is thread-local to ksgxd, no need for a lock: */
+	for (i = 0; i < 2 && !list_empty(&sgx_dirty_page_list); i++) {
+		while (!list_empty(&sgx_dirty_page_list)) {
+			if (kthread_should_stop())
+				return 0;
+
+			page = list_first_entry(&sgx_dirty_page_list, struct sgx_epc_page, list);
+
+			ret = __eremove(sgx_get_epc_virt_addr(page));
+			if (!ret) {
+				/* The page is clean - move to the free list. */
+				list_del(&page->list);
+				sgx_free_epc_page(page);
+			} else {
+				/* The page is not yet clean - move to the dirty list. */
+				list_move_tail(&page->list, &dirty);
+			}
+
+			cond_resched();
+		}
 
-		/* Should never happen. */
-		if (!list_empty(&sgx_epc_sections[i].init_laundry_list))
-			WARN(1, "EPC section %d has unsanitized pages.\n", i);
+		list_splice(&dirty, &sgx_dirty_page_list);
 	}
 
+	if (!list_empty(&sgx_dirty_page_list))
+		WARN(1, "EPC section %d has unsanitized pages.\n", i);
+
 	while (!kthread_should_stop()) {
 		if (try_to_freeze())
 			continue;
@@ -632,13 +626,12 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 	section->phys_addr = phys_addr;
 	spin_lock_init(&section->lock);
 	INIT_LIST_HEAD(&section->page_list);
-	INIT_LIST_HEAD(&section->init_laundry_list);
 
 	for (i = 0; i < nr_pages; i++) {
 		section->pages[i].section = index;
 		section->pages[i].flags = 0;
 		section->pages[i].owner = NULL;
-		list_add_tail(&section->pages[i].list, &section->init_laundry_list);
+		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
 	section->free_cnt = nr_pages;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 5fa42d143feb..bc8af0428640 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -45,13 +45,6 @@ struct sgx_epc_section {
 	spinlock_t lock;
 	struct list_head page_list;
 	unsigned long free_cnt;
-
-	/*
-	 * Pages which need EREMOVE run on them before they can be
-	 * used.  Only safe to be accessed in ksgxd and init code.
-	 * Not protected by locks.
-	 */
-	struct list_head init_laundry_list;
 };
 
 extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()
  2021-03-13 16:01 [PATCH v4 0/3] x86/sgx: NUMA Jarkko Sakkinen
  2021-03-13 16:01 ` [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages() Jarkko Sakkinen
  2021-03-13 16:01 ` [PATCH v4 2/3] x86/sgx: Replace section local dirty page lists with a global list Jarkko Sakkinen
@ 2021-03-13 16:01 ` Jarkko Sakkinen
  2021-03-14 11:56   ` Jarkko Sakkinen
  2021-03-15 16:35   ` Dave Hansen
  2 siblings, 2 replies; 14+ messages in thread
From: Jarkko Sakkinen @ 2021-03-13 16:01 UTC (permalink / raw)
  To: linux-sgx
  Cc: Jarkko Sakkinen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H. Peter Anvin, Dave Hansen, linux-kernel

Background
==========

EPC section is covered by one or more SRAT entries that are associated with
one and only one PXM (NUMA node). The motivation behind this patch is to
provide basic elements of building allocation scheme based on this premise.

Just like normal RAM, enclave memory (EPC) should be covered by entries
in the ACPI SRAT table.  These entries allow each EPC section to be
associated with a NUMA node.

Use this information to implement a simple NUMA-aware allocator for
enclave memory.

Solution
========

Use phys_to_target_node() to associate each NUMA node with the EPC
sections contained within its range. In sgx_alloc_epc_page(), first try
to allocate from the NUMA node, where the CPU is executing. If that
fails, allocate from other nodes, iterating them from the current node
in order.

Other
=====

NUMA_KEEP_MEMINFO dependency is required for phys_to_target_node().

Link: https://lore.kernel.org/lkml/158188326978.894464.217282995221175417.stgit@dwillia2-desk3.amr.corp.intel.com/
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---

v4:
* Cycle nodes instead of a global page list, starting from the node
  of the current thread.
* Documented NUMA_KEEP_MEMINFO dependency to the commit message.
* Added NUMA node pointer to struct sgx_epc_section. EPC page should
  reference to a section, since potentially a node could have multiple
  sections (Intel SDM does not say anything explicit about this).
  This the safest play.
* Remove nodes_clear(sgx_numa_node_mask).
* Appended Dave's additions to the commit message for the background
  section.

 arch/x86/Kconfig               |   1 +
 arch/x86/kernel/cpu/sgx/main.c | 117 ++++++++++++++++++++-------------
 arch/x86/kernel/cpu/sgx/sgx.h  |  16 +++--
 3 files changed, 84 insertions(+), 50 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 513895af8ee7..3e6152a8dd2b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1930,6 +1930,7 @@ config X86_SGX
 	depends on CRYPTO_SHA256=y
 	select SRCU
 	select MMU_NOTIFIER
+	select NUMA_KEEP_MEMINFO if NUMA
 	help
 	  Intel(R) Software Guard eXtensions (SGX) is a set of CPU instructions
 	  that can be used by applications to set aside private regions of code
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index cb4561444b96..3b524a1361d6 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -18,14 +18,23 @@ static int sgx_nr_epc_sections;
 static struct task_struct *ksgxd_tsk;
 static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
 
-/*
- * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
- */
+/* The reclaimer lock protected variables prepend the lock. */
 static LIST_HEAD(sgx_active_page_list);
-
 static DEFINE_SPINLOCK(sgx_reclaimer_lock);
 
+/* The free page list lock protected variables prepend the lock. */
+static unsigned long sgx_nr_free_pages;
+
+/* Nodes with one or more EPC sections. */
+static nodemask_t sgx_numa_mask;
+
+/*
+ * Array with one list_head for each possible NUMA node.  Each
+ * list contains all the sgx_epc_section's which are on that
+ * node.
+ */
+static struct sgx_numa_node *sgx_numa_nodes;
+
 /*
  * When the driver initialized, EPC pages go first here, as they could be
  * initialized to an active enclave, on kexec entry.
@@ -352,21 +361,9 @@ static void sgx_reclaim_pages(void)
 	}
 }
 
-static unsigned long sgx_nr_free_pages(void)
-{
-	unsigned long cnt = 0;
-	int i;
-
-	for (i = 0; i < sgx_nr_epc_sections; i++)
-		cnt += sgx_epc_sections[i].free_cnt;
-
-	return cnt;
-}
-
 static bool sgx_should_reclaim(unsigned long watermark)
 {
-	return sgx_nr_free_pages() < watermark &&
-	       !list_empty(&sgx_active_page_list);
+	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
 }
 
 static int ksgxd(void *p)
@@ -443,50 +440,63 @@ static bool __init sgx_page_reclaimer_init(void)
 	return true;
 }
 
-static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_section *section)
+static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
 {
-	struct sgx_epc_page *page;
+	struct sgx_numa_node *node = &sgx_numa_nodes[nid];
+	struct sgx_epc_page *page = NULL;
+
+	if (!node_isset(nid, sgx_numa_mask))
+		return NULL;
 
-	spin_lock(&section->lock);
+	spin_lock(&node->lock);
 
-	if (list_empty(&section->page_list)) {
-		spin_unlock(&section->lock);
+	if (list_empty(&node->free_page_list)) {
+		spin_unlock(&node->lock);
 		return NULL;
 	}
 
-	page = list_first_entry(&section->page_list, struct sgx_epc_page, list);
+	page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list);
 	list_del_init(&page->list);
-	section->free_cnt--;
+	sgx_nr_free_pages--;
+
+	spin_unlock(&node->lock);
 
-	spin_unlock(&section->lock);
 	return page;
 }
 
 /**
  * __sgx_alloc_epc_page() - Allocate an EPC page
  *
- * Iterate through EPC sections and borrow a free EPC page to the caller. When a
- * page is no longer needed it must be released with sgx_free_epc_page().
+ * Iterate through NUMA nodes and borrow a free EPC page to the caller. When a
+ * page is no longer needed it must be released with sgx_free_epc_page(). Start
+ * from the NUMA node, where the caller is executing.
  *
  * Return:
- *   an EPC page,
- *   -errno on error
+ * - an EPC page:	Free EPC pages were available.
+ * - ERR_PTR(-ENOMEM):	Run out of EPC pages.
  */
 struct sgx_epc_page *__sgx_alloc_epc_page(void)
 {
-	struct sgx_epc_section *section;
 	struct sgx_epc_page *page;
-	int i;
+	int nid = numa_node_id();
 
-	for (i = 0; i < sgx_nr_epc_sections; i++) {
-		section = &sgx_epc_sections[i];
+	/* Try to allocate EPC from the current node, first: */
+	page = __sgx_alloc_epc_page_from_node(nid);
+	if (page)
+		return page;
 
-		page = __sgx_alloc_epc_page_from_section(section);
+	/* Then, go through the other nodes: */
+	while (true) {
+		nid = next_node_in(nid, sgx_numa_mask);
+		if (nid == numa_node_id())
+			break;
+
+		page = __sgx_alloc_epc_page_from_node(nid);
 		if (page)
-			return page;
+			break;
 	}
 
-	return ERR_PTR(-ENOMEM);
+	return page;
 }
 
 /**
@@ -592,6 +602,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 void sgx_free_epc_page(struct sgx_epc_page *page)
 {
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
+	struct sgx_numa_node *node = section->node;
 	int ret;
 
 	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
@@ -600,10 +611,12 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
 		return;
 
-	spin_lock(&section->lock);
-	list_add_tail(&page->list, &section->page_list);
-	section->free_cnt++;
-	spin_unlock(&section->lock);
+	spin_lock(&node->lock);
+
+	list_add_tail(&page->list, &node->free_page_list);
+	sgx_nr_free_pages++;
+
+	spin_unlock(&node->lock);
 }
 
 static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
@@ -624,8 +637,6 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 	}
 
 	section->phys_addr = phys_addr;
-	spin_lock_init(&section->lock);
-	INIT_LIST_HEAD(&section->page_list);
 
 	for (i = 0; i < nr_pages; i++) {
 		section->pages[i].section = index;
@@ -634,7 +645,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
-	section->free_cnt = nr_pages;
+	sgx_nr_free_pages += nr_pages;
 	return true;
 }
 
@@ -653,8 +664,11 @@ static bool __init sgx_page_cache_init(void)
 {
 	u32 eax, ebx, ecx, edx, type;
 	u64 pa, size;
+	int nid;
 	int i;
 
+	sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL);
+
 	for (i = 0; i < ARRAY_SIZE(sgx_epc_sections); i++) {
 		cpuid_count(SGX_CPUID, i + SGX_CPUID_EPC, &eax, &ebx, &ecx, &edx);
 
@@ -677,6 +691,21 @@ static bool __init sgx_page_cache_init(void)
 			break;
 		}
 
+		nid = numa_map_to_online_node(phys_to_target_node(pa));
+		if (nid == NUMA_NO_NODE) {
+			/* The physical address is already printed above. */
+			pr_warn(FW_BUG "Unable to map EPC section to online node. Fallback to the NUMA node 0.\n");
+			nid = 0;
+		}
+
+		if (!node_isset(nid, sgx_numa_mask)) {
+			spin_lock_init(&sgx_numa_nodes[nid].lock);
+			INIT_LIST_HEAD(&sgx_numa_nodes[nid].free_page_list);
+			node_set(nid, sgx_numa_mask);
+		}
+
+		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
+
 		sgx_nr_epc_sections++;
 	}
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index bc8af0428640..653af8ca1a25 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -29,22 +29,26 @@ struct sgx_epc_page {
 	struct list_head list;
 };
 
+/*
+ * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
+ * the free page list local to the node is stored here.
+ */
+struct sgx_numa_node {
+	struct list_head free_page_list;
+	spinlock_t lock;
+};
+
 /*
  * The firmware can define multiple chunks of EPC to the different areas of the
  * physical memory e.g. for memory areas of the each node. This structure is
  * used to store EPC pages for one EPC section and virtual memory area where
  * the pages have been mapped.
- *
- * 'lock' must be held before accessing 'page_list' or 'free_cnt'.
  */
 struct sgx_epc_section {
 	unsigned long phys_addr;
 	void *virt_addr;
 	struct sgx_epc_page *pages;
-
-	spinlock_t lock;
-	struct list_head page_list;
-	unsigned long free_cnt;
+	struct sgx_numa_node *node;
 };
 
 extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()
  2021-03-13 16:01 ` [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page() Jarkko Sakkinen
@ 2021-03-14 11:56   ` Jarkko Sakkinen
  2021-03-15 16:35   ` Dave Hansen
  1 sibling, 0 replies; 14+ messages in thread
From: Jarkko Sakkinen @ 2021-03-14 11:56 UTC (permalink / raw)
  To: linux-sgx
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Dave Hansen, linux-kernel

On Sat, Mar 13, 2021 at 06:01:19PM +0200, Jarkko Sakkinen wrote:
> Background
> ==========
> 
> EPC section is covered by one or more SRAT entries that are associated with
> one and only one PXM (NUMA node). The motivation behind this patch is to
> provide basic elements of building allocation scheme based on this premise.
> 
> Just like normal RAM, enclave memory (EPC) should be covered by entries
> in the ACPI SRAT table.  These entries allow each EPC section to be
> associated with a NUMA node.
> 
> Use this information to implement a simple NUMA-aware allocator for
> enclave memory.
> 
> Solution
> ========
> 
> Use phys_to_target_node() to associate each NUMA node with the EPC
> sections contained within its range. In sgx_alloc_epc_page(), first try
> to allocate from the NUMA node, where the CPU is executing. If that
> fails, allocate from other nodes, iterating them from the current node
> in order.
> 
> Other
> =====
> 
> NUMA_KEEP_MEMINFO dependency is required for phys_to_target_node().
> 
> Link: https://lore.kernel.org/lkml/158188326978.894464.217282995221175417.stgit@dwillia2-desk3.amr.corp.intel.com/
> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>

We *can* have also epc_page->node by:

- Considering the first section as the EPC of that node.
- Printing a warning if more sections hit the same node, and
  ignoring them.
- Merging sgx_numa_node and sgx_epc_section

I think this would be a decent idea. I think it's a sane assumption that
a node has a single EPC section, but it's good to have that warning just
in case.

I did not want to do this into this version because it's faster for me
to refactor into this assumption than revert back.

/Jarkko

> ---
> 
> v4:
> * Cycle nodes instead of a global page list, starting from the node
>   of the current thread.
> * Documented NUMA_KEEP_MEMINFO dependency to the commit message.
> * Added NUMA node pointer to struct sgx_epc_section. EPC page should
>   reference to a section, since potentially a node could have multiple
>   sections (Intel SDM does not say anything explicit about this).
>   This the safest play.
> * Remove nodes_clear(sgx_numa_node_mask).
> * Appended Dave's additions to the commit message for the background
>   section.
> 
>  arch/x86/Kconfig               |   1 +
>  arch/x86/kernel/cpu/sgx/main.c | 117 ++++++++++++++++++++-------------
>  arch/x86/kernel/cpu/sgx/sgx.h  |  16 +++--
>  3 files changed, 84 insertions(+), 50 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 513895af8ee7..3e6152a8dd2b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1930,6 +1930,7 @@ config X86_SGX
>  	depends on CRYPTO_SHA256=y
>  	select SRCU
>  	select MMU_NOTIFIER
> +	select NUMA_KEEP_MEMINFO if NUMA
>  	help
>  	  Intel(R) Software Guard eXtensions (SGX) is a set of CPU instructions
>  	  that can be used by applications to set aside private regions of code
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index cb4561444b96..3b524a1361d6 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -18,14 +18,23 @@ static int sgx_nr_epc_sections;
>  static struct task_struct *ksgxd_tsk;
>  static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
>  
> -/*
> - * These variables are part of the state of the reclaimer, and must be accessed
> - * with sgx_reclaimer_lock acquired.
> - */
> +/* The reclaimer lock protected variables prepend the lock. */
>  static LIST_HEAD(sgx_active_page_list);
> -
>  static DEFINE_SPINLOCK(sgx_reclaimer_lock);
>  
> +/* The free page list lock protected variables prepend the lock. */
> +static unsigned long sgx_nr_free_pages;
> +
> +/* Nodes with one or more EPC sections. */
> +static nodemask_t sgx_numa_mask;
> +
> +/*
> + * Array with one list_head for each possible NUMA node.  Each
> + * list contains all the sgx_epc_section's which are on that
> + * node.
> + */
> +static struct sgx_numa_node *sgx_numa_nodes;
> +
>  /*
>   * When the driver initialized, EPC pages go first here, as they could be
>   * initialized to an active enclave, on kexec entry.
> @@ -352,21 +361,9 @@ static void sgx_reclaim_pages(void)
>  	}
>  }
>  
> -static unsigned long sgx_nr_free_pages(void)
> -{
> -	unsigned long cnt = 0;
> -	int i;
> -
> -	for (i = 0; i < sgx_nr_epc_sections; i++)
> -		cnt += sgx_epc_sections[i].free_cnt;
> -
> -	return cnt;
> -}
> -
>  static bool sgx_should_reclaim(unsigned long watermark)
>  {
> -	return sgx_nr_free_pages() < watermark &&
> -	       !list_empty(&sgx_active_page_list);
> +	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
>  }
>  
>  static int ksgxd(void *p)
> @@ -443,50 +440,63 @@ static bool __init sgx_page_reclaimer_init(void)
>  	return true;
>  }
>  
> -static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_section *section)
> +static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
>  {
> -	struct sgx_epc_page *page;
> +	struct sgx_numa_node *node = &sgx_numa_nodes[nid];
> +	struct sgx_epc_page *page = NULL;
> +
> +	if (!node_isset(nid, sgx_numa_mask))
> +		return NULL;
>  
> -	spin_lock(&section->lock);
> +	spin_lock(&node->lock);
>  
> -	if (list_empty(&section->page_list)) {
> -		spin_unlock(&section->lock);
> +	if (list_empty(&node->free_page_list)) {
> +		spin_unlock(&node->lock);
>  		return NULL;
>  	}
>  
> -	page = list_first_entry(&section->page_list, struct sgx_epc_page, list);
> +	page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list);
>  	list_del_init(&page->list);
> -	section->free_cnt--;
> +	sgx_nr_free_pages--;
> +
> +	spin_unlock(&node->lock);
>  
> -	spin_unlock(&section->lock);
>  	return page;
>  }
>  
>  /**
>   * __sgx_alloc_epc_page() - Allocate an EPC page
>   *
> - * Iterate through EPC sections and borrow a free EPC page to the caller. When a
> - * page is no longer needed it must be released with sgx_free_epc_page().
> + * Iterate through NUMA nodes and borrow a free EPC page to the caller. When a
> + * page is no longer needed it must be released with sgx_free_epc_page(). Start
> + * from the NUMA node, where the caller is executing.
>   *
>   * Return:
> - *   an EPC page,
> - *   -errno on error
> + * - an EPC page:	Free EPC pages were available.
> + * - ERR_PTR(-ENOMEM):	Run out of EPC pages.
>   */
>  struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  {
> -	struct sgx_epc_section *section;
>  	struct sgx_epc_page *page;
> -	int i;
> +	int nid = numa_node_id();
>  
> -	for (i = 0; i < sgx_nr_epc_sections; i++) {
> -		section = &sgx_epc_sections[i];
> +	/* Try to allocate EPC from the current node, first: */
> +	page = __sgx_alloc_epc_page_from_node(nid);
> +	if (page)
> +		return page;
>  
> -		page = __sgx_alloc_epc_page_from_section(section);
> +	/* Then, go through the other nodes: */
> +	while (true) {
> +		nid = next_node_in(nid, sgx_numa_mask);
> +		if (nid == numa_node_id())
> +			break;
> +
> +		page = __sgx_alloc_epc_page_from_node(nid);
>  		if (page)
> -			return page;
> +			break;
>  	}
>  
> -	return ERR_PTR(-ENOMEM);
> +	return page;
>  }
>  
>  /**
> @@ -592,6 +602,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  void sgx_free_epc_page(struct sgx_epc_page *page)
>  {
>  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> +	struct sgx_numa_node *node = section->node;
>  	int ret;
>  
>  	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> @@ -600,10 +611,12 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
>  		return;
>  
> -	spin_lock(&section->lock);
> -	list_add_tail(&page->list, &section->page_list);
> -	section->free_cnt++;
> -	spin_unlock(&section->lock);
> +	spin_lock(&node->lock);
> +
> +	list_add_tail(&page->list, &node->free_page_list);
> +	sgx_nr_free_pages++;
> +
> +	spin_unlock(&node->lock);
>  }
>  
>  static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> @@ -624,8 +637,6 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  	}
>  
>  	section->phys_addr = phys_addr;
> -	spin_lock_init(&section->lock);
> -	INIT_LIST_HEAD(&section->page_list);
>  
>  	for (i = 0; i < nr_pages; i++) {
>  		section->pages[i].section = index;
> @@ -634,7 +645,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
>  
> -	section->free_cnt = nr_pages;
> +	sgx_nr_free_pages += nr_pages;
>  	return true;
>  }
>  
> @@ -653,8 +664,11 @@ static bool __init sgx_page_cache_init(void)
>  {
>  	u32 eax, ebx, ecx, edx, type;
>  	u64 pa, size;
> +	int nid;
>  	int i;
>  
> +	sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL);
> +
>  	for (i = 0; i < ARRAY_SIZE(sgx_epc_sections); i++) {
>  		cpuid_count(SGX_CPUID, i + SGX_CPUID_EPC, &eax, &ebx, &ecx, &edx);
>  
> @@ -677,6 +691,21 @@ static bool __init sgx_page_cache_init(void)
>  			break;
>  		}
>  
> +		nid = numa_map_to_online_node(phys_to_target_node(pa));
> +		if (nid == NUMA_NO_NODE) {
> +			/* The physical address is already printed above. */
> +			pr_warn(FW_BUG "Unable to map EPC section to online node. Fallback to the NUMA node 0.\n");
> +			nid = 0;
> +		}
> +
> +		if (!node_isset(nid, sgx_numa_mask)) {
> +			spin_lock_init(&sgx_numa_nodes[nid].lock);
> +			INIT_LIST_HEAD(&sgx_numa_nodes[nid].free_page_list);
> +			node_set(nid, sgx_numa_mask);
> +		}
> +
> +		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
> +
>  		sgx_nr_epc_sections++;
>  	}
>  
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index bc8af0428640..653af8ca1a25 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -29,22 +29,26 @@ struct sgx_epc_page {
>  	struct list_head list;
>  };
>  
> +/*
> + * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
> + * the free page list local to the node is stored here.
> + */
> +struct sgx_numa_node {
> +	struct list_head free_page_list;
> +	spinlock_t lock;
> +};
> +
>  /*
>   * The firmware can define multiple chunks of EPC to the different areas of the
>   * physical memory e.g. for memory areas of the each node. This structure is
>   * used to store EPC pages for one EPC section and virtual memory area where
>   * the pages have been mapped.
> - *
> - * 'lock' must be held before accessing 'page_list' or 'free_cnt'.
>   */
>  struct sgx_epc_section {
>  	unsigned long phys_addr;
>  	void *virt_addr;
>  	struct sgx_epc_page *pages;
> -
> -	spinlock_t lock;
> -	struct list_head page_list;
> -	unsigned long free_cnt;
> +	struct sgx_numa_node *node;
>  };
>  
>  extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages()
  2021-03-13 16:01 ` [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages() Jarkko Sakkinen
@ 2021-03-15 15:32   ` Dave Hansen
  2021-03-15 19:06     ` Jarkko Sakkinen
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2021-03-15 15:32 UTC (permalink / raw)
  To: Jarkko Sakkinen, linux-sgx
  Cc: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel

On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> Replace the ad-hoc code with a sgx_free_epc_page(), in order to make sure
> that all the relevant checks and book keeping is done, while freeing a
> borrowed EPC page, and remove redundant code. EREMOVE inside
> sgx_free_epc_page() does not change the semantics, as EREMOVE to an
> uninitialize pages is a nop.

  ^ uninitialized

I know this is a short patch, but this changelog still falls a bit short
for me.

Why is this patch a part of _this_ series?  What *problem* does it
solve, related to this series?

It would also be nice to remind me why the EREMOVE is redundant.  Why
didn't we need one before?  What put the page in the uninitialized
state?  Is EREMOVE guaranteed to do nothing?  How expensive is it?



> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 8df81a3ed945..65004fb8a91f 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -305,7 +305,6 @@ static void sgx_reclaim_pages(void)
>  {
>  	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
>  	struct sgx_backing backing[SGX_NR_TO_SCAN];
> -	struct sgx_epc_section *section;
>  	struct sgx_encl_page *encl_page;
>  	struct sgx_epc_page *epc_page;
>  	pgoff_t page_index;
> @@ -378,11 +377,7 @@ static void sgx_reclaim_pages(void)
>  		kref_put(&encl_page->encl->refcount, sgx_encl_release);
>  		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>  
> -		section = &sgx_epc_sections[epc_page->section];
> -		spin_lock(&section->lock);
> -		list_add_tail(&epc_page->list, &section->page_list);
> -		section->free_cnt++;
> -		spin_unlock(&section->lock);
> +		sgx_free_epc_page(epc_page);
>  	}
>  }
>  
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/3] x86/sgx: Replace section local dirty page lists with a global list
  2021-03-13 16:01 ` [PATCH v4 2/3] x86/sgx: Replace section local dirty page lists with a global list Jarkko Sakkinen
@ 2021-03-15 16:03   ` Dave Hansen
  2021-03-15 19:14     ` Jarkko Sakkinen
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2021-03-15 16:03 UTC (permalink / raw)
  To: Jarkko Sakkinen, linux-sgx
  Cc: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, linux-kernel

On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state,
> and free them using sgx_free_epc_page(). Do two passes, as for SECS pages
> the first round can fail, if all child pages have not yet been removed.
> The driver puts all pages on startup first to sgx_dirty_page_list, as the
> initialization could be triggered by kexec(), meaning that pages have been
> reserved for active enclaves before the operation.
> 
> The section local lists are redundant, as sgx_free_epc_page() figures
> out the correction by using epc_page->section.

During normal runtime, the "ksgxd" daemon behaves like a  version of
kswapd just for SGX.  But, its first job is to initialize enclave
memory.  This is done in a a separate thread because this initialization
can be quite slow.

Currently, the SGX boot code places each enclave page on a
sgx_section-local list (init_laundry_list).  Once it starts up, the
ksgxd code walks over that list and populates the actual SGX page allocator.

However, the per-section structures are going away to make way for the
SGX NUMA allocator.  There's also little need to have a per-section
structure; the enclave pages are all treated identically, and they can
be placed on the correct allocator list from metadata stoered in the
enclave page itself.

> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 65004fb8a91f..cb4561444b96 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -27,39 +27,10 @@ static LIST_HEAD(sgx_active_page_list);
>  static DEFINE_SPINLOCK(sgx_reclaimer_lock);
>  
>  /*
> - * Reset dirty EPC pages to uninitialized state. Laundry can be left with SECS
> - * pages whose child pages blocked EREMOVE.
> + * When the driver initialized, EPC pages go first here, as they could be
> + * initialized to an active enclave, on kexec entry.
>   */
> -static void sgx_sanitize_section(struct sgx_epc_section *section)
> -{
> -	struct sgx_epc_page *page;
> -	LIST_HEAD(dirty);
> -	int ret;
> -
> -	/* init_laundry_list is thread-local, no need for a lock: */
> -	while (!list_empty(&section->init_laundry_list)) {
> -		if (kthread_should_stop())
> -			return;
> -
> -		/* needed for access to ->page_list: */
> -		spin_lock(&section->lock);
> -
> -		page = list_first_entry(&section->init_laundry_list,
> -					struct sgx_epc_page, list);
> -
> -		ret = __eremove(sgx_get_epc_virt_addr(page));
> -		if (!ret)
> -			list_move(&page->list, &section->page_list);
> -		else
> -			list_move_tail(&page->list, &dirty);
> -
> -		spin_unlock(&section->lock);
> -
> -		cond_resched();
> -	}
> -
> -	list_splice(&dirty, &section->init_laundry_list);
> -}
> +static LIST_HEAD(sgx_dirty_page_list);
>  
>  static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
>  {
> @@ -400,25 +371,48 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  
>  static int ksgxd(void *p)
>  {
> -	int i;
> +	struct sgx_epc_page *page;
> +	LIST_HEAD(dirty);
> +	int i, ret;
>  
>  	set_freezable();
>  
>  	/*
> -	 * Sanitize pages in order to recover from kexec(). The 2nd pass is
> -	 * required for SECS pages, whose child pages blocked EREMOVE.
> +	 * Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state,
> +	 * and free them using sgx_free_epc_page(). 

I'm not a fan of comments that tell us verbatim what the code does.

>							Do two passes, as for SECS pages the
> +	 * first round can fail, if all child pages have not yet been removed.  The
> +	 * driver puts all pages on startup first to sgx_dirty_page_list, as the
> +	 * initialization could be triggered by kexec(), meaning that pages have been
> +	 * reserved for active enclaves before the operation.
>  	 */



> -	for (i = 0; i < sgx_nr_epc_sections; i++)
> -		sgx_sanitize_section(&sgx_epc_sections[i]);
>  
> -	for (i = 0; i < sgx_nr_epc_sections; i++) {
> -		sgx_sanitize_section(&sgx_epc_sections[i]);

FWIW, I don't like the removal of the helper here.  I really like kernel
threads' top-level function to be very understandable and clean.  This
makes it quite a bit harder to figure out what is going on.

For instance, we could just have a sgx_sanitize_pages() which has a
local dirty list and just calls:

void sgx_santitize_pages(void)
{
	LIST_HEAD(dirty);

	/*
	 * Comment about two passes
	 */
	__sgx_sanitize_pages(&dirty)
	__sgx_sanitize_pages(&dirty)
}

> +	/* sgx_dirty_page_list is thread-local to ksgxd, no need for a lock: */
> +	for (i = 0; i < 2 && !list_empty(&sgx_dirty_page_list); i++) {
> +		while (!list_empty(&sgx_dirty_page_list)) {
> +			if (kthread_should_stop())
> +				return 0;
> +
> +			page = list_first_entry(&sgx_dirty_page_list, struct sgx_epc_page, list);
> +
> +			ret = __eremove(sgx_get_epc_virt_addr(page));
> +			if (!ret) {
> +				/* The page is clean - move to the free list. */

I'd even say:
				/*
				 * page is now sanitized.  Make it
				 * available via the SGX page allocator:
				 */

See what that does?  It actually links the "cleaning" to the freeing.

> +				list_del(&page->list);
> +				sgx_free_epc_page(page);
> +			} else {
> +				/* The page is not yet clean - move to the dirty list. */
> +				list_move_tail(&page->list, &dirty);
> +			}
> +
> +			cond_resched();
> +		}
>  
> -		/* Should never happen. */
> -		if (!list_empty(&sgx_epc_sections[i].init_laundry_list))
> -			WARN(1, "EPC section %d has unsanitized pages.\n", i);
> +		list_splice(&dirty, &sgx_dirty_page_list);
>  	}
>  
> +	if (!list_empty(&sgx_dirty_page_list))
> +		WARN(1, "EPC section %d has unsanitized pages.\n", i);
> +
>  	while (!kthread_should_stop()) {
>  		if (try_to_freeze())
>  			continue;
> @@ -632,13 +626,12 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  	section->phys_addr = phys_addr;
>  	spin_lock_init(&section->lock);
>  	INIT_LIST_HEAD(&section->page_list);
> -	INIT_LIST_HEAD(&section->init_laundry_list);
>  
>  	for (i = 0; i < nr_pages; i++) {
>  		section->pages[i].section = index;
>  		section->pages[i].flags = 0;
>  		section->pages[i].owner = NULL;
> -		list_add_tail(&section->pages[i].list, &section->init_laundry_list);
> +		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()
  2021-03-13 16:01 ` [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page() Jarkko Sakkinen
  2021-03-14 11:56   ` Jarkko Sakkinen
@ 2021-03-15 16:35   ` Dave Hansen
  2021-03-15 19:23     ` Jarkko Sakkinen
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2021-03-15 16:35 UTC (permalink / raw)
  To: Jarkko Sakkinen, linux-sgx
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Dave Hansen, linux-kernel

On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> Background
> ==========
> 
> EPC section is covered by one or more SRAT entries that are associated with
> one and only one PXM (NUMA node). The motivation behind this patch is to
> provide basic elements of building allocation scheme based on this premise.

A better set of background information here would also remind folks what
an 'EPC section' is.

> Just like normal RAM, enclave memory (EPC) should be covered by entries
> in the ACPI SRAT table.  These entries allow each EPC section to be
> associated with a NUMA node.
> 
> Use this information to implement a simple NUMA-aware allocator for
> enclave memory.

SGX enclave memory is enumerated by the processor in contiguous physical
ranges called "EPC sections".  Currently, there is a free list per
section, but allocations simply target the lowest-numbered sections.
This is functional, but has no NUMA awareness.

Fortunately, EPC sections are covered by entries in the ACPI SRAT table.
 These entries allow each EPC section to be associated with a NUMA node,
just like normal RAM.


> Solution
> ========
> 
> Use phys_to_target_node() to associate each NUMA node with the EPC
> sections contained within its range. In sgx_alloc_epc_page(), first try
> to allocate from the NUMA node, where the CPU is executing. If that
> fails, allocate from other nodes, iterating them from the current node
> in order.

To me, this is just telling us what the code does.  It's not very
useful.  I'd say:

Implement a NUMA-aware enclave page allocator.  Mirror the buddy
allocator and maintain a list of enclave pages for each NUMA node.
Attempt to allocate enclave memory first from local nodes, then fall
back to other nodes.

Note that the fallback is not as sophisticated as the buddy allocator
and is itself not aware of NUMA distances.  When a node's free list is
empty, it searches for the next-highest node with enclave pages (and
will wrap if necessary).  This could be improved in the future.

> Other
> =====
> 
> NUMA_KEEP_MEMINFO dependency is required for phys_to_target_node().

Reading the changelog, it's not obvious that you're talking about a
Kconfig variable here.

I was also thinking, this says:

	# Keep arch NUMA mapping infrastructure post-init.
	config NUMA_KEEP_MEMINFO
	        bool

But we only need it during SGX init.  Could you explain a bit why it's
needed here specifically?  Superficially it seems like we only need this
info *during* init.

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 513895af8ee7..3e6152a8dd2b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1930,6 +1930,7 @@ config X86_SGX
>  	depends on CRYPTO_SHA256=y
>  	select SRCU
>  	select MMU_NOTIFIER
> +	select NUMA_KEEP_MEMINFO if NUMA
>  	help
>  	  Intel(R) Software Guard eXtensions (SGX) is a set of CPU instructions
>  	  that can be used by applications to set aside private regions of code
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index cb4561444b96..3b524a1361d6 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -18,14 +18,23 @@ static int sgx_nr_epc_sections;
>  static struct task_struct *ksgxd_tsk;
>  static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
>  
> -/*
> - * These variables are part of the state of the reclaimer, and must be accessed
> - * with sgx_reclaimer_lock acquired.
> - */
> +/* The reclaimer lock protected variables prepend the lock. */
>  static LIST_HEAD(sgx_active_page_list);
> -
>  static DEFINE_SPINLOCK(sgx_reclaimer_lock);
>  
> +/* The free page list lock protected variables prepend the lock. */
> +static unsigned long sgx_nr_free_pages;
> +
> +/* Nodes with one or more EPC sections. */
> +static nodemask_t sgx_numa_mask;
> +
> +/*
> + * Array with one list_head for each possible NUMA node.  Each
> + * list contains all the sgx_epc_section's which are on that
> + * node.
> + */
> +static struct sgx_numa_node *sgx_numa_nodes;
> +
>  /*
>   * When the driver initialized, EPC pages go first here, as they could be
>   * initialized to an active enclave, on kexec entry.
> @@ -352,21 +361,9 @@ static void sgx_reclaim_pages(void)
>  	}
>  }
>  
> -static unsigned long sgx_nr_free_pages(void)
> -{
> -	unsigned long cnt = 0;
> -	int i;
> -
> -	for (i = 0; i < sgx_nr_epc_sections; i++)
> -		cnt += sgx_epc_sections[i].free_cnt;
> -
> -	return cnt;
> -}
> -
>  static bool sgx_should_reclaim(unsigned long watermark)
>  {
> -	return sgx_nr_free_pages() < watermark &&
> -	       !list_empty(&sgx_active_page_list);
> +	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
>  }
>  
>  static int ksgxd(void *p)
> @@ -443,50 +440,63 @@ static bool __init sgx_page_reclaimer_init(void)
>  	return true;
>  }
>  
> -static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_section *section)
> +static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
>  {
> -	struct sgx_epc_page *page;
> +	struct sgx_numa_node *node = &sgx_numa_nodes[nid];
> +	struct sgx_epc_page *page = NULL;
> +
> +	if (!node_isset(nid, sgx_numa_mask))
> +		return NULL;

I don't think this mask check is super necessary.  It won't ever trigger
for the "fallback" path.  For the node-local allocation, I guess it
saves a cacheline (node->free_page_list).

> -	spin_lock(&section->lock);
> +	spin_lock(&node->lock);
>  
> -	if (list_empty(&section->page_list)) {
> -		spin_unlock(&section->lock);
> +	if (list_empty(&node->free_page_list)) {
> +		spin_unlock(&node->lock);
>  		return NULL;
>  	}
>  
> -	page = list_first_entry(&section->page_list, struct sgx_epc_page, list);
> +	page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list);
>  	list_del_init(&page->list);
> -	section->free_cnt--;
> +	sgx_nr_free_pages--;
> +
> +	spin_unlock(&node->lock);
>  
> -	spin_unlock(&section->lock);
>  	return page;
>  }
>  
>  /**
>   * __sgx_alloc_epc_page() - Allocate an EPC page
>   *
> - * Iterate through EPC sections and borrow a free EPC page to the caller. When a
> - * page is no longer needed it must be released with sgx_free_epc_page().
> + * Iterate through NUMA nodes and borrow a free EPC page to the caller. When a

"borrow" is a weird word to use here.

> + * page is no longer needed it must be released with sgx_free_epc_page(). Start
> + * from the NUMA node, where the caller is executing.

I don't think we need to tell folks how allocators work: that "free" and
"alloc" are opposites.

>   * Return:
> - *   an EPC page,
> - *   -errno on error
> + * - an EPC page:	Free EPC pages were available.
> + * - ERR_PTR(-ENOMEM):	Run out of EPC pages.
>   */

I think returning NULL from a allocator is probably OK.  Folks
understand that NULL means ENOMEM.

>  struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  {
> -	struct sgx_epc_section *section;
>  	struct sgx_epc_page *page;
> -	int i;
> +	int nid = numa_node_id();
>  
> -	for (i = 0; i < sgx_nr_epc_sections; i++) {
> -		section = &sgx_epc_sections[i];
> +	/* Try to allocate EPC from the current node, first: */
> +	page = __sgx_alloc_epc_page_from_node(nid);
> +	if (page)
> +		return page;
>  
> -		page = __sgx_alloc_epc_page_from_section(section);
> +	/* Then, go through the other nodes: */
> +	while (true) {
> +		nid = next_node_in(nid, sgx_numa_mask);
> +		if (nid == numa_node_id())
> +			break;

One nit: This can _theoretically_ livelock.  Preemption is enabled here
and numa_node_id() can change.  It's theoretically possible that
numa_node_id()!=nid forever.

The way to prevent this is to read numa_node_id() once and then compare
'nid' against that single read.

> +		page = __sgx_alloc_epc_page_from_node(nid);
>  		if (page)
> -			return page;
> +			break;
>  	}
>  
> -	return ERR_PTR(-ENOMEM);
> +	return page;
>  }

I guess you probably wrote it this way because you prefer it.  But, this
can be written with a single call to __sgx_alloc_epc_page_from_node().

>  /**
> @@ -592,6 +602,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  void sgx_free_epc_page(struct sgx_epc_page *page)
>  {
>  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> +	struct sgx_numa_node *node = section->node;
>  	int ret;
>  
>  	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> @@ -600,10 +611,12 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
>  		return;
>  
> -	spin_lock(&section->lock);
> -	list_add_tail(&page->list, &section->page_list);
> -	section->free_cnt++;
> -	spin_unlock(&section->lock);
> +	spin_lock(&node->lock);
> +
> +	list_add_tail(&page->list, &node->free_page_list);
> +	sgx_nr_free_pages++;
> +
> +	spin_unlock(&node->lock);
>  }
>  
>  static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> @@ -624,8 +637,6 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  	}
>  
>  	section->phys_addr = phys_addr;
> -	spin_lock_init(&section->lock);
> -	INIT_LIST_HEAD(&section->page_list);
>  
>  	for (i = 0; i < nr_pages; i++) {
>  		section->pages[i].section = index;
> @@ -634,7 +645,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
>  
> -	section->free_cnt = nr_pages;
> +	sgx_nr_free_pages += nr_pages;
>  	return true;
>  }
>  
> @@ -653,8 +664,11 @@ static bool __init sgx_page_cache_init(void)
>  {
>  	u32 eax, ebx, ecx, edx, type;
>  	u64 pa, size;
> +	int nid;
>  	int i;
>  
> +	sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL);

Needs a NULL check.

>  	for (i = 0; i < ARRAY_SIZE(sgx_epc_sections); i++) {
>  		cpuid_count(SGX_CPUID, i + SGX_CPUID_EPC, &eax, &ebx, &ecx, &edx);
>  
> @@ -677,6 +691,21 @@ static bool __init sgx_page_cache_init(void)
>  			break;
>  		}
>  
> +		nid = numa_map_to_online_node(phys_to_target_node(pa));
> +		if (nid == NUMA_NO_NODE) {
> +			/* The physical address is already printed above. */
> +			pr_warn(FW_BUG "Unable to map EPC section to online node. Fallback to the NUMA node 0.\n");
> +			nid = 0;
> +		}
> +
> +		if (!node_isset(nid, sgx_numa_mask)) {
> +			spin_lock_init(&sgx_numa_nodes[nid].lock);
> +			INIT_LIST_HEAD(&sgx_numa_nodes[nid].free_page_list);
> +			node_set(nid, sgx_numa_mask);
> +		}
> +
> +		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
> +
>  		sgx_nr_epc_sections++;
>  	}

The rest looks OK.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages()
  2021-03-15 15:32   ` Dave Hansen
@ 2021-03-15 19:06     ` Jarkko Sakkinen
  2021-03-15 19:27       ` Jarkko Sakkinen
  0 siblings, 1 reply; 14+ messages in thread
From: Jarkko Sakkinen @ 2021-03-15 19:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-sgx, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, linux-kernel

On Mon, Mar 15, 2021 at 08:32:13AM -0700, Dave Hansen wrote:
> On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> > Replace the ad-hoc code with a sgx_free_epc_page(), in order to make sure
> > that all the relevant checks and book keeping is done, while freeing a
> > borrowed EPC page, and remove redundant code. EREMOVE inside
> > sgx_free_epc_page() does not change the semantics, as EREMOVE to an
> > uninitialize pages is a nop.
> 
>   ^ uninitialized
> 
> I know this is a short patch, but this changelog still falls a bit short
> for me.
> 
> Why is this patch a part of _this_ series?  What *problem* does it
> solve, related to this series?

I'm thinking of merging sgx_epc_section and sgx_numa_node. That's why I
kept it as part of the series. 

Also, in any case it's better to clean up duplicate functionality. The
code is essentially open coded implementation of sgx_free_epc_page()
without EREMOVE.

> It would also be nice to remind me why the EREMOVE is redundant.  Why
> didn't we need one before?  What put the page in the uninitialized
> state?  Is EREMOVE guaranteed to do nothing?  How expensive is it?

EREMOVE gets removed by KVM series from sgx_free_epc_page() anyway.

Maybe should re-send this patch, or series, after KVM series is merged.
Then there is no explaining with EREMOVE, as sgx_free_epc_page() won't
contain it.

/Jarkko

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/3] x86/sgx: Replace section local dirty page lists with a global list
  2021-03-15 16:03   ` Dave Hansen
@ 2021-03-15 19:14     ` Jarkko Sakkinen
  2021-03-15 19:47       ` Dave Hansen
  0 siblings, 1 reply; 14+ messages in thread
From: Jarkko Sakkinen @ 2021-03-15 19:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-sgx, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, linux-kernel

On Mon, Mar 15, 2021 at 09:03:21AM -0700, Dave Hansen wrote:
> On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> > Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state,
> > and free them using sgx_free_epc_page(). Do two passes, as for SECS pages
> > the first round can fail, if all child pages have not yet been removed.
> > The driver puts all pages on startup first to sgx_dirty_page_list, as the
> > initialization could be triggered by kexec(), meaning that pages have been
> > reserved for active enclaves before the operation.
> > 
> > The section local lists are redundant, as sgx_free_epc_page() figures
> > out the correction by using epc_page->section.
> 
> During normal runtime, the "ksgxd" daemon behaves like a  version of
> kswapd just for SGX.  But, its first job is to initialize enclave
> memory.  This is done in a a separate thread because this initialization
> can be quite slow.
> 
> Currently, the SGX boot code places each enclave page on a
> sgx_section-local list (init_laundry_list).  Once it starts up, the
> ksgxd code walks over that list and populates the actual SGX page allocator.
> 
> However, the per-section structures are going away to make way for the
> SGX NUMA allocator.  There's also little need to have a per-section
> structure; the enclave pages are all treated identically, and they can
> be placed on the correct allocator list from metadata stoered in the
> enclave page itself.
> 
Is this a suggestion how to rephrase the commit message? :-)

> > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > index 65004fb8a91f..cb4561444b96 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -27,39 +27,10 @@ static LIST_HEAD(sgx_active_page_list);
> >  static DEFINE_SPINLOCK(sgx_reclaimer_lock);
> >  
> >  /*
> > - * Reset dirty EPC pages to uninitialized state. Laundry can be left with SECS
> > - * pages whose child pages blocked EREMOVE.
> > + * When the driver initialized, EPC pages go first here, as they could be
> > + * initialized to an active enclave, on kexec entry.
> >   */
> > -static void sgx_sanitize_section(struct sgx_epc_section *section)
> > -{
> > -	struct sgx_epc_page *page;
> > -	LIST_HEAD(dirty);
> > -	int ret;
> > -
> > -	/* init_laundry_list is thread-local, no need for a lock: */
> > -	while (!list_empty(&section->init_laundry_list)) {
> > -		if (kthread_should_stop())
> > -			return;
> > -
> > -		/* needed for access to ->page_list: */
> > -		spin_lock(&section->lock);
> > -
> > -		page = list_first_entry(&section->init_laundry_list,
> > -					struct sgx_epc_page, list);
> > -
> > -		ret = __eremove(sgx_get_epc_virt_addr(page));
> > -		if (!ret)
> > -			list_move(&page->list, &section->page_list);
> > -		else
> > -			list_move_tail(&page->list, &dirty);
> > -
> > -		spin_unlock(&section->lock);
> > -
> > -		cond_resched();
> > -	}
> > -
> > -	list_splice(&dirty, &section->init_laundry_list);
> > -}
> > +static LIST_HEAD(sgx_dirty_page_list);
> >  
> >  static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
> >  {
> > @@ -400,25 +371,48 @@ static bool sgx_should_reclaim(unsigned long watermark)
> >  
> >  static int ksgxd(void *p)
> >  {
> > -	int i;
> > +	struct sgx_epc_page *page;
> > +	LIST_HEAD(dirty);
> > +	int i, ret;
> >  
> >  	set_freezable();
> >  
> >  	/*
> > -	 * Sanitize pages in order to recover from kexec(). The 2nd pass is
> > -	 * required for SECS pages, whose child pages blocked EREMOVE.
> > +	 * Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state,
> > +	 * and free them using sgx_free_epc_page(). 
> 
> I'm not a fan of comments that tell us verbatim what the code does.

So, what are you suggesting? Remove the whole comment or parts of it? I'm
presuming that you suggest removing the "top-level" part.

> 
> >							Do two passes, as for SECS pages the
> > +	 * first round can fail, if all child pages have not yet been removed.  The
> > +	 * driver puts all pages on startup first to sgx_dirty_page_list, as the
> > +	 * initialization could be triggered by kexec(), meaning that pages have been
> > +	 * reserved for active enclaves before the operation.
> >  	 */
> 
> 
> 
> > -	for (i = 0; i < sgx_nr_epc_sections; i++)
> > -		sgx_sanitize_section(&sgx_epc_sections[i]);
> >  
> > -	for (i = 0; i < sgx_nr_epc_sections; i++) {
> > -		sgx_sanitize_section(&sgx_epc_sections[i]);
> 
> FWIW, I don't like the removal of the helper here.  I really like kernel
> threads' top-level function to be very understandable and clean.  This
> makes it quite a bit harder to figure out what is going on.
> 
> For instance, we could just have a sgx_sanitize_pages() which has a
> local dirty list and just calls:
> 
> void sgx_santitize_pages(void)
> {
> 	LIST_HEAD(dirty);
> 
> 	/*
> 	 * Comment about two passes
> 	 */
> 	__sgx_sanitize_pages(&dirty)
> 	__sgx_sanitize_pages(&dirty)

OK.

> }
> 
> > +	/* sgx_dirty_page_list is thread-local to ksgxd, no need for a lock: */
> > +	for (i = 0; i < 2 && !list_empty(&sgx_dirty_page_list); i++) {
> > +		while (!list_empty(&sgx_dirty_page_list)) {
> > +			if (kthread_should_stop())
> > +				return 0;
> > +
> > +			page = list_first_entry(&sgx_dirty_page_list, struct sgx_epc_page, list);
> > +
> > +			ret = __eremove(sgx_get_epc_virt_addr(page));
> > +			if (!ret) {
> > +				/* The page is clean - move to the free list. */
> 
> I'd even say:
> 				/*
> 				 * page is now sanitized.  Make it
> 				 * available via the SGX page allocator:
> 				 */
> 
> See what that does?  It actually links the "cleaning" to the freeing.

I agree, thanks.

> > +				list_del(&page->list);
> > +				sgx_free_epc_page(page);
> > +			} else {
> > +				/* The page is not yet clean - move to the dirty list. */
> > +				list_move_tail(&page->list, &dirty);
> > +			}
> > +
> > +			cond_resched();
> > +		}
> >  
> > -		/* Should never happen. */
> > -		if (!list_empty(&sgx_epc_sections[i].init_laundry_list))
> > -			WARN(1, "EPC section %d has unsanitized pages.\n", i);
> > +		list_splice(&dirty, &sgx_dirty_page_list);
> >  	}
> >  
> > +	if (!list_empty(&sgx_dirty_page_list))
> > +		WARN(1, "EPC section %d has unsanitized pages.\n", i);
> > +
> >  	while (!kthread_should_stop()) {
> >  		if (try_to_freeze())
> >  			continue;
> > @@ -632,13 +626,12 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> >  	section->phys_addr = phys_addr;
> >  	spin_lock_init(&section->lock);
> >  	INIT_LIST_HEAD(&section->page_list);
> > -	INIT_LIST_HEAD(&section->init_laundry_list);
> >  
> >  	for (i = 0; i < nr_pages; i++) {
> >  		section->pages[i].section = index;
> >  		section->pages[i].flags = 0;
> >  		section->pages[i].owner = NULL;
> > -		list_add_tail(&section->pages[i].list, &section->init_laundry_list);
> > +		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
> >  	}
> ...
> 

/Jarkko

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()
  2021-03-15 16:35   ` Dave Hansen
@ 2021-03-15 19:23     ` Jarkko Sakkinen
  0 siblings, 0 replies; 14+ messages in thread
From: Jarkko Sakkinen @ 2021-03-15 19:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-sgx, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Dave Hansen, linux-kernel

On Mon, Mar 15, 2021 at 09:35:03AM -0700, Dave Hansen wrote:
> On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> > Background
> > ==========
> > 
> > EPC section is covered by one or more SRAT entries that are associated with
> > one and only one PXM (NUMA node). The motivation behind this patch is to
> > provide basic elements of building allocation scheme based on this premise.
> 
> A better set of background information here would also remind folks what
> an 'EPC section' is.

I don't actually know what exactly they are, because SDM's definition
is not very rigorous ;-)

I.e. it is not too precise between NUMA node and section relationship.

> 
> > Just like normal RAM, enclave memory (EPC) should be covered by entries
> > in the ACPI SRAT table.  These entries allow each EPC section to be
> > associated with a NUMA node.
> > 
> > Use this information to implement a simple NUMA-aware allocator for
> > enclave memory.
> 
> SGX enclave memory is enumerated by the processor in contiguous physical
> ranges called "EPC sections".  Currently, there is a free list per
> section, but allocations simply target the lowest-numbered sections.
> This is functional, but has no NUMA awareness.
> 
> Fortunately, EPC sections are covered by entries in the ACPI SRAT table.
>  These entries allow each EPC section to be associated with a NUMA node,
> just like normal RAM.

Thanks!

> > Solution
> > ========
> > 
> > Use phys_to_target_node() to associate each NUMA node with the EPC
> > sections contained within its range. In sgx_alloc_epc_page(), first try
> > to allocate from the NUMA node, where the CPU is executing. If that
> > fails, allocate from other nodes, iterating them from the current node
> > in order.
> 
> To me, this is just telling us what the code does.  It's not very
> useful.  I'd say:
> 
> Implement a NUMA-aware enclave page allocator.  Mirror the buddy
> allocator and maintain a list of enclave pages for each NUMA node.
> Attempt to allocate enclave memory first from local nodes, then fall
> back to other nodes.
> 
> Note that the fallback is not as sophisticated as the buddy allocator
> and is itself not aware of NUMA distances.  When a node's free list is
> empty, it searches for the next-highest node with enclave pages (and
> will wrap if necessary).  This could be improved in the future.

Thanks.

> > Other
> > =====
> > 
> > NUMA_KEEP_MEMINFO dependency is required for phys_to_target_node().
> 
> Reading the changelog, it's not obvious that you're talking about a
> Kconfig variable here.
> 
> I was also thinking, this says:
> 
> 	# Keep arch NUMA mapping infrastructure post-init.
> 	config NUMA_KEEP_MEMINFO
> 	        bool
> 
> But we only need it during SGX init.  Could you explain a bit why it's
> needed here specifically?  Superficially it seems like we only need this
> info *during* init.

Well, numa_meminfo is a static variable of mm/numa.c, so we cannot directly
access it. And phys_to_numa_node(), which accesses its data, does not
exist, unless NUMA_KEEP_MEMINFO is defined.

So, yes, theoretically we could pull the data withotu NUMA_KEEP_MEMINFO,
but thise would require to numa_meminfo a global variable.

We do not care about post-init part, in that sense you are right.

> 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 513895af8ee7..3e6152a8dd2b 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1930,6 +1930,7 @@ config X86_SGX
> >  	depends on CRYPTO_SHA256=y
> >  	select SRCU
> >  	select MMU_NOTIFIER
> > +	select NUMA_KEEP_MEMINFO if NUMA
> >  	help
> >  	  Intel(R) Software Guard eXtensions (SGX) is a set of CPU instructions
> >  	  that can be used by applications to set aside private regions of code
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > index cb4561444b96..3b524a1361d6 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -18,14 +18,23 @@ static int sgx_nr_epc_sections;
> >  static struct task_struct *ksgxd_tsk;
> >  static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
> >  
> > -/*
> > - * These variables are part of the state of the reclaimer, and must be accessed
> > - * with sgx_reclaimer_lock acquired.
> > - */
> > +/* The reclaimer lock protected variables prepend the lock. */
> >  static LIST_HEAD(sgx_active_page_list);
> > -
> >  static DEFINE_SPINLOCK(sgx_reclaimer_lock);
> >  
> > +/* The free page list lock protected variables prepend the lock. */
> > +static unsigned long sgx_nr_free_pages;
> > +
> > +/* Nodes with one or more EPC sections. */
> > +static nodemask_t sgx_numa_mask;
> > +
> > +/*
> > + * Array with one list_head for each possible NUMA node.  Each
> > + * list contains all the sgx_epc_section's which are on that
> > + * node.
> > + */
> > +static struct sgx_numa_node *sgx_numa_nodes;
> > +
> >  /*
> >   * When the driver initialized, EPC pages go first here, as they could be
> >   * initialized to an active enclave, on kexec entry.
> > @@ -352,21 +361,9 @@ static void sgx_reclaim_pages(void)
> >  	}
> >  }
> >  
> > -static unsigned long sgx_nr_free_pages(void)
> > -{
> > -	unsigned long cnt = 0;
> > -	int i;
> > -
> > -	for (i = 0; i < sgx_nr_epc_sections; i++)
> > -		cnt += sgx_epc_sections[i].free_cnt;
> > -
> > -	return cnt;
> > -}
> > -
> >  static bool sgx_should_reclaim(unsigned long watermark)
> >  {
> > -	return sgx_nr_free_pages() < watermark &&
> > -	       !list_empty(&sgx_active_page_list);
> > +	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
> >  }
> >  
> >  static int ksgxd(void *p)
> > @@ -443,50 +440,63 @@ static bool __init sgx_page_reclaimer_init(void)
> >  	return true;
> >  }
> >  
> > -static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_section *section)
> > +static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
> >  {
> > -	struct sgx_epc_page *page;
> > +	struct sgx_numa_node *node = &sgx_numa_nodes[nid];
> > +	struct sgx_epc_page *page = NULL;
> > +
> > +	if (!node_isset(nid, sgx_numa_mask))
> > +		return NULL;
> 
> I don't think this mask check is super necessary.  It won't ever trigger
> for the "fallback" path.  For the node-local allocation, I guess it
> saves a cacheline (node->free_page_list).
> 
> > -	spin_lock(&section->lock);
> > +	spin_lock(&node->lock);
> >  
> > -	if (list_empty(&section->page_list)) {
> > -		spin_unlock(&section->lock);
> > +	if (list_empty(&node->free_page_list)) {
> > +		spin_unlock(&node->lock);
> >  		return NULL;
> >  	}
> >  
> > -	page = list_first_entry(&section->page_list, struct sgx_epc_page, list);
> > +	page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list);
> >  	list_del_init(&page->list);
> > -	section->free_cnt--;
> > +	sgx_nr_free_pages--;
> > +
> > +	spin_unlock(&node->lock);
> >  
> > -	spin_unlock(&section->lock);
> >  	return page;
> >  }
> >  
> >  /**
> >   * __sgx_alloc_epc_page() - Allocate an EPC page
> >   *
> > - * Iterate through EPC sections and borrow a free EPC page to the caller. When a
> > - * page is no longer needed it must be released with sgx_free_epc_page().
> > + * Iterate through NUMA nodes and borrow a free EPC page to the caller. When a
> 
> "borrow" is a weird word to use here.
> 
> > + * page is no longer needed it must be released with sgx_free_epc_page(). Start
> > + * from the NUMA node, where the caller is executing.
> 
> I don't think we need to tell folks how allocators work: that "free" and
> "alloc" are opposites.
> 
> >   * Return:
> > - *   an EPC page,
> > - *   -errno on error
> > + * - an EPC page:	Free EPC pages were available.
> > + * - ERR_PTR(-ENOMEM):	Run out of EPC pages.
> >   */
> 
> I think returning NULL from a allocator is probably OK.  Folks
> understand that NULL means ENOMEM.
> 
> >  struct sgx_epc_page *__sgx_alloc_epc_page(void)
> >  {
> > -	struct sgx_epc_section *section;
> >  	struct sgx_epc_page *page;
> > -	int i;
> > +	int nid = numa_node_id();
> >  
> > -	for (i = 0; i < sgx_nr_epc_sections; i++) {
> > -		section = &sgx_epc_sections[i];
> > +	/* Try to allocate EPC from the current node, first: */
> > +	page = __sgx_alloc_epc_page_from_node(nid);
> > +	if (page)
> > +		return page;
> >  
> > -		page = __sgx_alloc_epc_page_from_section(section);
> > +	/* Then, go through the other nodes: */
> > +	while (true) {
> > +		nid = next_node_in(nid, sgx_numa_mask);
> > +		if (nid == numa_node_id())
> > +			break;
> 
> One nit: This can _theoretically_ livelock.  Preemption is enabled here
> and numa_node_id() can change.  It's theoretically possible that
> numa_node_id()!=nid forever.
> 
> The way to prevent this is to read numa_node_id() once and then compare
> 'nid' against that single read.
> 
> > +		page = __sgx_alloc_epc_page_from_node(nid);
> >  		if (page)
> > -			return page;
> > +			break;
> >  	}
> >  
> > -	return ERR_PTR(-ENOMEM);
> > +	return page;
> >  }
> 
> I guess you probably wrote it this way because you prefer it.  But, this
> can be written with a single call to __sgx_alloc_epc_page_from_node().
> 
> >  /**
> > @@ -592,6 +602,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> >  void sgx_free_epc_page(struct sgx_epc_page *page)
> >  {
> >  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> > +	struct sgx_numa_node *node = section->node;
> >  	int ret;
> >  
> >  	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> > @@ -600,10 +611,12 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> >  	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
> >  		return;
> >  
> > -	spin_lock(&section->lock);
> > -	list_add_tail(&page->list, &section->page_list);
> > -	section->free_cnt++;
> > -	spin_unlock(&section->lock);
> > +	spin_lock(&node->lock);
> > +
> > +	list_add_tail(&page->list, &node->free_page_list);
> > +	sgx_nr_free_pages++;
> > +
> > +	spin_unlock(&node->lock);
> >  }
> >  
> >  static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> > @@ -624,8 +637,6 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> >  	}
> >  
> >  	section->phys_addr = phys_addr;
> > -	spin_lock_init(&section->lock);
> > -	INIT_LIST_HEAD(&section->page_list);
> >  
> >  	for (i = 0; i < nr_pages; i++) {
> >  		section->pages[i].section = index;
> > @@ -634,7 +645,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> >  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
> >  	}
> >  
> > -	section->free_cnt = nr_pages;
> > +	sgx_nr_free_pages += nr_pages;
> >  	return true;
> >  }
> >  
> > @@ -653,8 +664,11 @@ static bool __init sgx_page_cache_init(void)
> >  {
> >  	u32 eax, ebx, ecx, edx, type;
> >  	u64 pa, size;
> > +	int nid;
> >  	int i;
> >  
> > +	sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL);
> 
> Needs a NULL check.

Ugh, lol, I did spot this but forgot to fix it. Darn, well, I'll fix it for
the next version :-)

> 
> >  	for (i = 0; i < ARRAY_SIZE(sgx_epc_sections); i++) {
> >  		cpuid_count(SGX_CPUID, i + SGX_CPUID_EPC, &eax, &ebx, &ecx, &edx);
> >  
> > @@ -677,6 +691,21 @@ static bool __init sgx_page_cache_init(void)
> >  			break;
> >  		}
> >  
> > +		nid = numa_map_to_online_node(phys_to_target_node(pa));
> > +		if (nid == NUMA_NO_NODE) {
> > +			/* The physical address is already printed above. */
> > +			pr_warn(FW_BUG "Unable to map EPC section to online node. Fallback to the NUMA node 0.\n");
> > +			nid = 0;
> > +		}
> > +
> > +		if (!node_isset(nid, sgx_numa_mask)) {
> > +			spin_lock_init(&sgx_numa_nodes[nid].lock);
> > +			INIT_LIST_HEAD(&sgx_numa_nodes[nid].free_page_list);
> > +			node_set(nid, sgx_numa_mask);
> > +		}
> > +
> > +		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
> > +
> >  		sgx_nr_epc_sections++;
> >  	}
> 
> The rest looks OK.
> 

/Jarkko

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages()
  2021-03-15 19:06     ` Jarkko Sakkinen
@ 2021-03-15 19:27       ` Jarkko Sakkinen
  2021-03-16 12:50         ` Jarkko Sakkinen
  0 siblings, 1 reply; 14+ messages in thread
From: Jarkko Sakkinen @ 2021-03-15 19:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-sgx, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, linux-kernel

On Mon, Mar 15, 2021 at 09:06:29PM +0200, Jarkko Sakkinen wrote:
> On Mon, Mar 15, 2021 at 08:32:13AM -0700, Dave Hansen wrote:
> > On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> > > Replace the ad-hoc code with a sgx_free_epc_page(), in order to make sure
> > > that all the relevant checks and book keeping is done, while freeing a
> > > borrowed EPC page, and remove redundant code. EREMOVE inside
> > > sgx_free_epc_page() does not change the semantics, as EREMOVE to an
> > > uninitialize pages is a nop.
> > 
> >   ^ uninitialized
> > 
> > I know this is a short patch, but this changelog still falls a bit short
> > for me.
> > 
> > Why is this patch a part of _this_ series?  What *problem* does it
> > solve, related to this series?
> 
> I'm thinking of merging sgx_epc_section and sgx_numa_node. That's why I
> kept it as part of the series. 
> 
> Also, in any case it's better to clean up duplicate functionality. The
> code is essentially open coded implementation of sgx_free_epc_page()
> without EREMOVE.
> 
> > It would also be nice to remind me why the EREMOVE is redundant.  Why
> > didn't we need one before?  What put the page in the uninitialized
> > state?  Is EREMOVE guaranteed to do nothing?  How expensive is it?
> 
> EREMOVE gets removed by KVM series from sgx_free_epc_page() anyway.
> 
> Maybe should re-send this patch, or series, after KVM series is merged.
> Then there is no explaining with EREMOVE, as sgx_free_epc_page() won't
> contain it.

Anyway, forgot to put the end statement: I'm cool with dropping this but
I'll also send this right after KVM SGX series has landed as separate
patch, if I drop this now.

/Jarkko

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/3] x86/sgx: Replace section local dirty page lists with a global list
  2021-03-15 19:14     ` Jarkko Sakkinen
@ 2021-03-15 19:47       ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2021-03-15 19:47 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: linux-sgx, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, linux-kernel

On 3/15/21 12:14 PM, Jarkko Sakkinen wrote:
> On Mon, Mar 15, 2021 at 09:03:21AM -0700, Dave Hansen wrote:
>> On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
>>> Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state,
>>> and free them using sgx_free_epc_page(). Do two passes, as for SECS pages
>>> the first round can fail, if all child pages have not yet been removed.
>>> The driver puts all pages on startup first to sgx_dirty_page_list, as the
>>> initialization could be triggered by kexec(), meaning that pages have been
>>> reserved for active enclaves before the operation.
>>>
>>> The section local lists are redundant, as sgx_free_epc_page() figures
>>> out the correction by using epc_page->section.
>>
>> During normal runtime, the "ksgxd" daemon behaves like a  version of
>> kswapd just for SGX.  But, its first job is to initialize enclave
>> memory.  This is done in a a separate thread because this initialization
>> can be quite slow.
>>
>> Currently, the SGX boot code places each enclave page on a
>> sgx_section-local list (init_laundry_list).  Once it starts up, the
>> ksgxd code walks over that list and populates the actual SGX page allocator.
>>
>> However, the per-section structures are going away to make way for the
>> SGX NUMA allocator.  There's also little need to have a per-section
>> structure; the enclave pages are all treated identically, and they can
>> be placed on the correct allocator list from metadata stoered in the
>> enclave page itself.
>>
> Is this a suggestion how to rephrase the commit message? :-)

Yep.

>>>  static int ksgxd(void *p)
>>>  {
>>> -	int i;
>>> +	struct sgx_epc_page *page;
>>> +	LIST_HEAD(dirty);
>>> +	int i, ret;
>>>  
>>>  	set_freezable();
>>>  
>>>  	/*
>>> -	 * Sanitize pages in order to recover from kexec(). The 2nd pass is
>>> -	 * required for SECS pages, whose child pages blocked EREMOVE.
>>> +	 * Reset initialized EPC pages in sgx_dirty_page_list to uninitialized state,
>>> +	 * and free them using sgx_free_epc_page(). 
>>
>> I'm not a fan of comments that tell us verbatim what the code does.
> 
> So, what are you suggesting? Remove the whole comment or parts of it? I'm
> presuming that you suggest removing the "top-level" part.

Just please make it more relevant and informative.  I can tell this is
processing 'sgx_dirty_page_list' and calling sgx_free_epc_page() from
the code.  I don't need a comment to tell me that.

A better comment would say:

	Take enclave pages from the init code, ensure they are clean,
	and free the back into the main SGX page allocator

That tells what *role* this code plays (connecting init code to the
allocator) in a way that just verbatim telling what code is called does not.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages()
  2021-03-15 19:27       ` Jarkko Sakkinen
@ 2021-03-16 12:50         ` Jarkko Sakkinen
  0 siblings, 0 replies; 14+ messages in thread
From: Jarkko Sakkinen @ 2021-03-16 12:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-sgx, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, linux-kernel

On Mon, Mar 15, 2021 at 09:27:00PM +0200, Jarkko Sakkinen wrote:
> On Mon, Mar 15, 2021 at 09:06:29PM +0200, Jarkko Sakkinen wrote:
> > On Mon, Mar 15, 2021 at 08:32:13AM -0700, Dave Hansen wrote:
> > > On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> > > > Replace the ad-hoc code with a sgx_free_epc_page(), in order to make sure
> > > > that all the relevant checks and book keeping is done, while freeing a
> > > > borrowed EPC page, and remove redundant code. EREMOVE inside
> > > > sgx_free_epc_page() does not change the semantics, as EREMOVE to an
> > > > uninitialize pages is a nop.
> > > 
> > >   ^ uninitialized
> > > 
> > > I know this is a short patch, but this changelog still falls a bit short
> > > for me.
> > > 
> > > Why is this patch a part of _this_ series?  What *problem* does it
> > > solve, related to this series?
> > 
> > I'm thinking of merging sgx_epc_section and sgx_numa_node. That's why I
> > kept it as part of the series. 
> > 
> > Also, in any case it's better to clean up duplicate functionality. The
> > code is essentially open coded implementation of sgx_free_epc_page()
> > without EREMOVE.
> > 
> > > It would also be nice to remind me why the EREMOVE is redundant.  Why
> > > didn't we need one before?  What put the page in the uninitialized
> > > state?  Is EREMOVE guaranteed to do nothing?  How expensive is it?
> > 
> > EREMOVE gets removed by KVM series from sgx_free_epc_page() anyway.
> > 
> > Maybe should re-send this patch, or series, after KVM series is merged.
> > Then there is no explaining with EREMOVE, as sgx_free_epc_page() won't
> > contain it.
> 
> Anyway, forgot to put the end statement: I'm cool with dropping this but
> I'll also send this right after KVM SGX series has landed as separate
> patch, if I drop this now.

HOLD ON :-)

I recalled why I added this patch to this patch set. I had a reason for
it.

It's because of the NUMA patch. I have duplicate all the NUMA changes
here if I don't refactor this somewhat redundant code out.

So, if I add a note about this to the commit message? IMHO, this is good
enough reason to carry the patch.

/Jarkko

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-16 12:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-13 16:01 [PATCH v4 0/3] x86/sgx: NUMA Jarkko Sakkinen
2021-03-13 16:01 ` [PATCH v4 1/3] x86/sgx: Use sgx_free_epc_page() in sgx_reclaim_pages() Jarkko Sakkinen
2021-03-15 15:32   ` Dave Hansen
2021-03-15 19:06     ` Jarkko Sakkinen
2021-03-15 19:27       ` Jarkko Sakkinen
2021-03-16 12:50         ` Jarkko Sakkinen
2021-03-13 16:01 ` [PATCH v4 2/3] x86/sgx: Replace section local dirty page lists with a global list Jarkko Sakkinen
2021-03-15 16:03   ` Dave Hansen
2021-03-15 19:14     ` Jarkko Sakkinen
2021-03-15 19:47       ` Dave Hansen
2021-03-13 16:01 ` [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page() Jarkko Sakkinen
2021-03-14 11:56   ` Jarkko Sakkinen
2021-03-15 16:35   ` Dave Hansen
2021-03-15 19:23     ` Jarkko Sakkinen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.