linux-sgx.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/28]  Add Cgroup support for SGX EPC memory
@ 2023-07-12 23:01 Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 01/28] x86/sgx: Store struct sgx_encl when allocating new VA pages Haitao Huang
                   ` (30 more replies)
  0 siblings, 31 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc, zhanb,
	anakrish, mikko.ylinen

SGX EPC memory allocations are separate from normal RAM allocations, and is
managed solely by the SGX subsystem. The existing cgroup memory controller
cannot be used to limit or account for SGX EPC memory, which is a desirable
feature in some environments, e.g., support for pod level control in a
Kubernates cluster on a VM or baremetal host [1,2] in those environments.

This patchset implements the support for sgx_epc memory within the misc
cgroup controller. The user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.

This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to work with more recent
kernels, and to utilize the misc cgroup controller rather than a custom
controller. Now I updated the patches based on review comments on the V2
series[3], simplified a few aspects of the implementation/design and fixed
some stability issues found from testing, while keeping the same user space
facing interfaces.

The patchset adds support for multiple LRUs to track both reclaimable EPC
pages (i.e. pages the reclaimer knows about), as well as unreclaimable EPC
pages (i.e.  pages which the reclaimer isn't aware of, such as VA pages).
These pages are assigned to an LRU, as well as an enclave, so that an
enclave's full EPC usage can be tracked, and limited to a max value. During
OOM events, an enclave can be have its memory zapped, and all the EPC pages
not tracked by the reclaimer can be freed.

I appreciate your comments and feedback.

Summary of changes from v2: (more details in commit logs)

* Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
* Unrolled wrappers for cond_resched, list (Dave)
* Separate patches for adding reclaimable and unreclaimable lists. (Dave)
* Other improvments on patch flow, commit messages, styles. (Dave, Jarkko)
* Simplified the cgroup tree walking with plain
  css_for_each_descendant_pre.
* Fixed race conditions and crashes.
* OOM killer to wait for the victim enclave pages being reclaimed.
* Unblock the user by handling misc_max_write callback asynchronously.
* Rebased onto 6.4 and no longer base this series on the MCA patchset.
* Fix an overflow in misc_try_charge.
* Fix a NULL pointer in SGX PF handler.
* Updated and included the SGX selftest patches previously reviewed. Those
  patches fix issues triggered in high EPC pressure required for cgroup
  testing.
* Added test scripts to help setup and test SGX EPC cgroups.

[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/
[4]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"

Haitao Huang (6):
  x86/sgx: Store struct sgx_encl when allocating new VA pages
  x86/sgx: Introduce EPC page states
  x86/sgx: fix a NULL pointer
  cgroup/misc: Fix an overflow
  selftests/sgx: Retry the ioctl()'s returned with EAGAIN
  selftests/sgx: Add scripts for epc cgroup testing

Jarkko Sakkinen (3):
  selftests/sgx: Move ENCL_HEAP_SIZE_DEFAULT to main.c
  selftests/sgx: Use encl->encl_size in sigstruct.c
  selftests/sgx: Include the dynamic heap size to the ELRANGE
    calculation

Kristen Carlson Accardi (9):
  x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  x86/sgx: Use sgx_epc_lru_lists for existing active page list
  x86/sgx: Store reclaimable epc pages in sgx_epc_lru_lists
  x86/sgx: store unreclaimable EPC pages in sgx_epc_lru_lists
  x86/sgx: Use a list to track to-be-reclaimed pages
  cgroup/misc: Add per resource callbacks for CSS events
  cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
  x86/sgx: Limit process EPC usage with misc cgroup controller
  Docs/x86/sgx: Add description for cgroup support

Sean Christopherson (9):
  x86/sgx: Add EPC page flags to identify owner type
  x86/sgx: Introduce RECLAIM_IN_PROGRESS state
  x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
  x85/sgx: Return the number of EPC pages that were successfully
    reclaimed
  x86/sgx: Add option to ignore age of page during EPC reclaim
  x86/sgx: Prepare for multiple LRUs
  x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
  x86/sgx: Add EPC OOM path to forcefully reclaim EPC

Vijay Dhanraj (1):
  selftests/sgx: Add SGX selftest augment_via_eaccept_long

 Documentation/arch/x86/sgx.rst                |  77 ++++
 arch/x86/Kconfig                              |  13 +
 arch/x86/kernel/cpu/sgx/Makefile              |   1 +
 arch/x86/kernel/cpu/sgx/driver.c              |  27 +-
 arch/x86/kernel/cpu/sgx/encl.c                |  95 +++-
 arch/x86/kernel/cpu/sgx/encl.h                |   4 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c          | 406 ++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h          |  60 +++
 arch/x86/kernel/cpu/sgx/ioctl.c               |  25 +-
 arch/x86/kernel/cpu/sgx/main.c                | 406 ++++++++++++++----
 arch/x86/kernel/cpu/sgx/sgx.h                 | 113 ++++-
 include/linux/misc_cgroup.h                   |  34 ++
 kernel/cgroup/misc.c                          |  63 ++-
 tools/testing/selftests/sgx/load.c            |   8 +-
 tools/testing/selftests/sgx/main.c            | 177 +++++++-
 tools/testing/selftests/sgx/main.h            |   6 +-
 .../selftests/sgx/run_tests_in_misc_cg.sh     |  68 +++
 tools/testing/selftests/sgx/setup_epc_cg.sh   |  29 ++
 tools/testing/selftests/sgx/sigstruct.c       |   8 +-
 .../selftests/sgx/watch_misc_for_tests.sh     |  13 +
 20 files changed, 1446 insertions(+), 187 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
 create mode 100755 tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
 create mode 100755 tools/testing/selftests/sgx/setup_epc_cg.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

-- 
2.25.1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v3 01/28] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-17 11:14   ` Jarkko Sakkinen
  2023-07-12 23:01 ` [PATCH v3 02/28] x86/sgx: Add EPC page flags to identify owner type Haitao Huang
                   ` (29 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

In a later patch, when a cgroup has exceeded the max capacity for EPC pages
and there are no more Enclave EPC pages associated with the cgroup that can
be reclaimed, the only pages still associated with an enclave will be the
unreclaimable Version Array (VA) pages or SECS pages, and the entire
enclave will need to be killed to free up those pages.

Currently, given an enclave pointer it is easy to find the associated VA
pages and free them, however, OOM killing an enclave based on cgroup limits
will require examining a cgroup's unreclaimable page list, and finding an
enclave given a SECS page or a VA page. This will require a backpointer
from a page to an enclave, including for VA pages.

When allocating new Version Array (VA) pages, pass the struct sgx_encl of
the enclave that is allocating the page. sgx_alloc_epc_page() will store
this value in the owner field of the struct sgx_epc_page.  In a later
patch, VA pages will be placed in an unreclaimable queue, and then when the
cgroup max limit is reached and there are no more reclaimable pages and the
enclave must be OOM killed, all the VA pages associated with that enclave
can be uncharged and freed.

To avoid casting needed to access the two types of owners: sgx_encl for VA
pages, sgx_encl_page for other pages, replace 'owner' field in sgx_epc_page
with a union of the two types.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>

V3:
- rename encl_owner to encl_page.
- revise commit messages
---
 arch/x86/kernel/cpu/sgx/encl.c  |  5 +++--
 arch/x86/kernel/cpu/sgx/encl.h  |  2 +-
 arch/x86/kernel/cpu/sgx/ioctl.c |  2 +-
 arch/x86/kernel/cpu/sgx/main.c  | 20 ++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h   |  5 ++++-
 5 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 2a0e90fe2abc..98e1086eab07 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -1210,6 +1210,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
+ * @encl:    The enclave that this page is allocated to.
  * @reclaim: Reclaim EPC pages directly if none available. Enclave
  *           mutex should not be held if this is set.
  *
@@ -1219,12 +1220,12 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 {
 	struct sgx_epc_page *epc_page;
 	int ret;
 
-	epc_page = sgx_alloc_epc_page(NULL, reclaim);
+	epc_page = sgx_alloc_epc_page(encl, reclaim);
 	if (IS_ERR(epc_page))
 		return ERR_CAST(epc_page);
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..831d63f80f5a 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,7 +116,7 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
 					  unsigned long offset,
 					  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 21ca0a831b70..fa8c3f32ccf6 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -30,7 +30,7 @@ struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
 		if (!va_page)
 			return ERR_PTR(-ENOMEM);
 
-		va_page->epc_page = sgx_alloc_va_page(reclaim);
+		va_page->epc_page = sgx_alloc_va_page(encl, reclaim);
 		if (IS_ERR(va_page->epc_page)) {
 			err = ERR_CAST(va_page->epc_page);
 			kfree(va_page);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..39939b7496b0 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -108,7 +108,7 @@ static unsigned long __sgx_sanitize_pages(struct list_head *dirty_page_list)
 
 static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_page;
 	struct sgx_encl *encl = page->encl;
 	struct sgx_encl_mm *encl_mm;
 	bool ret = true;
@@ -140,7 +140,7 @@ static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 
 static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_page;
 	unsigned long addr = page->desc & PAGE_MASK;
 	struct sgx_encl *encl = page->encl;
 	int ret;
@@ -197,7 +197,7 @@ void sgx_ipi_cb(void *info)
 static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 			 struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_page;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_va_page *va_page;
 	unsigned int va_offset;
@@ -250,7 +250,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 				struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_page;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_backing secs_backing;
 	int ret;
@@ -312,7 +312,7 @@ static void sgx_reclaim_pages(void)
 		epc_page = list_first_entry(&sgx_active_page_list,
 					    struct sgx_epc_page, list);
 		list_del_init(&epc_page->list);
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
 			chunk[cnt++] = epc_page;
@@ -326,7 +326,7 @@ static void sgx_reclaim_pages(void)
 
 	for (i = 0; i < cnt; i++) {
 		epc_page = chunk[i];
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 
 		if (!sgx_reclaimer_age(epc_page))
 			goto skip;
@@ -365,7 +365,7 @@ static void sgx_reclaim_pages(void)
 		if (!epc_page)
 			continue;
 
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_page;
 		sgx_reclaimer_write(epc_page, &backing[i]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -563,7 +563,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
 		if (!IS_ERR(page)) {
-			page->owner = owner;
+			page->encl_page = owner;
 			break;
 		}
 
@@ -606,7 +606,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	spin_lock(&node->lock);
 
-	page->owner = NULL;
+	page->encl_page = NULL;
 	if (page->poison)
 		list_add(&page->list, &node->sgx_poison_page_list);
 	else
@@ -641,7 +641,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 	for (i = 0; i < nr_pages; i++) {
 		section->pages[i].section = index;
 		section->pages[i].flags = 0;
-		section->pages[i].owner = NULL;
+		section->pages[i].encl_page = NULL;
 		section->pages[i].poison = 0;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index d2dad21259a8..dc1cbcfcf2d4 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -33,7 +33,10 @@ struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
 	u16 poison;
-	struct sgx_encl_page *owner;
+	union {
+		struct sgx_encl_page *encl_page;
+		struct sgx_encl *encl;
+	};
 	struct list_head list;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 02/28] x86/sgx: Add EPC page flags to identify owner type
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 01/28] x86/sgx: Store struct sgx_encl when allocating new VA pages Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-17 12:41   ` Jarkko Sakkinen
  2023-07-12 23:01 ` [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s) Haitao Huang
                   ` (28 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

From: Sean Christopherson <sean.j.christopherson@intel.com>

Two types of owners, 'sgx_encl' for VA pages and 'sgx_encl_page' for other,
can be stored in the union field in sgx_epc_page struct introduced in the
previous patch.

When cgroup OOM support is added in a later patch, the owning enclave of a
page will need to be identified. Retrieving the sgx_encl struct from a
sgx_epc_page will be different if the page is a VA page vs. other enclave
pages.

Add 2 flags which will identify the type of the owner and apply them
accordingly to newly allocated pages.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>

V3:
- Renamed the flags to clarify they are used to identify the type
of the owner.
---
 arch/x86/kernel/cpu/sgx/encl.c  | 4 ++++
 arch/x86/kernel/cpu/sgx/ioctl.c | 4 ++++
 arch/x86/kernel/cpu/sgx/sgx.h   | 6 ++++++
 3 files changed, 14 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 98e1086eab07..3bc2f95b1da2 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -252,6 +252,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		epc_page = sgx_encl_eldu(&encl->secs, NULL);
 		if (IS_ERR(epc_page))
 			return ERR_CAST(epc_page);
+		epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
 	}
 
 	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
@@ -260,6 +261,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 
 	encl->secs_child_cnt++;
 	sgx_mark_page_reclaimable(entry->epc_page);
+	entry->epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
 
 	return entry;
 }
@@ -379,6 +381,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl->secs_child_cnt++;
 
 	sgx_mark_page_reclaimable(encl_page->epc_page);
+	encl_page->epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -1235,6 +1238,7 @@ struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
+	epc_page->flags |= SGX_EPC_OWNER_ENCL;
 
 	return epc_page;
 }
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index fa8c3f32ccf6..fe3e89cf013f 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -113,6 +113,8 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes = secs->attributes;
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
+	encl->secs.epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
+
 	/* Set only after completion, as encl->lock has not been taken. */
 	set_bit(SGX_ENCL_CREATED, &encl->flags);
 
@@ -323,6 +325,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 	}
 
 	sgx_mark_page_reclaimable(encl_page->epc_page);
+	encl_page->epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -977,6 +980,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 			mutex_lock(&encl->lock);
 
 			sgx_mark_page_reclaimable(entry->epc_page);
+			entry->epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
 		}
 
 		/* Change EPC type */
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index dc1cbcfcf2d4..f6e3c5810eef 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -29,6 +29,12 @@
 /* Pages on free list */
 #define SGX_EPC_PAGE_IS_FREE		BIT(1)
 
+/* flag for pages owned by a sgx_encl_page */
+#define SGX_EPC_OWNER_ENCL_PAGE		BIT(3)
+
+/* flag for pages owned by a sgx_encl struct */
+#define SGX_EPC_OWNER_ENCL		BIT(4)
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 01/28] x86/sgx: Store struct sgx_encl when allocating new VA pages Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 02/28] x86/sgx: Add EPC page flags to identify owner type Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-17 12:45   ` Jarkko Sakkinen
  2023-07-12 23:01 ` [PATCH v3 04/28] x86/sgx: Use sgx_epc_lru_lists for existing active page list Haitao Huang
                   ` (27 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

From: Kristen Carlson Accardi <kristen@linux.intel.com>

Introduce a data structure to wrap the existing reclaimable list
and its spinlock in a struct to minimize the code changes needed
to handle multiple LRUs as well as reclaimable and non-reclaimable
lists. The new structure will be used in a following set of patches to
implement SGX EPC cgroups.

The changes to the structure needed for unreclaimable lists will be
added in later patches.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>

V3:
Removed the helper functions and revised commit messages
---
 arch/x86/kernel/cpu/sgx/sgx.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index f6e3c5810eef..77fceba73a25 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -92,6 +92,23 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
 	return section->virt_addr + index * PAGE_SIZE;
 }
 
+/*
+ * This data structure wraps a list of reclaimable EPC pages, and a list of
+ * non-reclaimable EPC pages and is used to implement a LRU policy during
+ * reclamation.
+ */
+struct sgx_epc_lru_lists {
+	/* Must acquire this lock to access */
+	spinlock_t lock;
+	struct list_head reclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
+{
+	spin_lock_init(&lrus->lock);
+	INIT_LIST_HEAD(&lrus->reclaimable);
+}
+
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 04/28] x86/sgx: Use sgx_epc_lru_lists for existing active page list
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (2 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s) Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-17 12:47   ` Jarkko Sakkinen
  2023-07-12 23:01 ` [PATCH v3 05/28] x86/sgx: Store reclaimable epc pages in sgx_epc_lru_lists Haitao Huang
                   ` (26 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

From: Kristen Carlson Accardi <kristen@linux.intel.com>

Replace the existing sgx_active_page_list and its spinlock with
a global sgx_epc_lru_lists struct.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>

V3:
- Remove usage of list wrapper
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 39939b7496b0..71c3386ccf23 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -26,10 +26,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_lists sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -304,13 +303,13 @@ static void sgx_reclaim_pages(void)
 	int ret;
 	int i;
 
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		if (list_empty(&sgx_active_page_list))
+		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+						    struct sgx_epc_page, list);
+		if (!epc_page)
 			break;
 
-		epc_page = list_first_entry(&sgx_active_page_list,
-					    struct sgx_epc_page, list);
 		list_del_init(&epc_page->list);
 		encl_page = epc_page->encl_page;
 
@@ -322,7 +321,7 @@ static void sgx_reclaim_pages(void)
 			 */
 			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	for (i = 0; i < cnt; i++) {
 		epc_page = chunk[i];
@@ -345,9 +344,9 @@ static void sgx_reclaim_pages(void)
 		continue;
 
 skip:
-		spin_lock(&sgx_reclaimer_lock);
-		list_add_tail(&epc_page->list, &sgx_active_page_list);
-		spin_unlock(&sgx_reclaimer_lock);
+		spin_lock(&sgx_global_lru.lock);
+		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 
@@ -378,7 +377,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_active_page_list);
+	       !list_empty(&sgx_global_lru.reclaimable);
 }
 
 /*
@@ -430,6 +429,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
 	ksgxd_tsk = tsk;
 
+	sgx_lru_init(&sgx_global_lru);
+
 	return true;
 }
 
@@ -505,10 +506,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_active_page_list);
-	spin_unlock(&sgx_reclaimer_lock);
+	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
@@ -523,18 +524,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_reclaimer_lock);
+	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_reclaimer_lock);
+			spin_unlock(&sgx_global_lru.lock);
 			return -EBUSY;
 		}
 
 		list_del(&page->list);
 		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
 	}
-	spin_unlock(&sgx_reclaimer_lock);
+	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
 }
@@ -567,7 +568,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_active_page_list))
+		if (list_empty(&sgx_global_lru.reclaimable))
 			return ERR_PTR(-ENOMEM);
 
 		if (!reclaim) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 05/28] x86/sgx: Store reclaimable epc pages in sgx_epc_lru_lists
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (3 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 04/28] x86/sgx: Use sgx_epc_lru_lists for existing active page list Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 06/28] x86/sgx: store unreclaimable EPC " Haitao Huang
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

From: Kristen Carlson Accardi <kristen@linux.intel.com>

When an OOM event occurs, it becomes necessary to free all pages
associated with an enclave, including those not currently tracked by the
reclaimer. As a result, each page must eventually be added to the
cgroup's LRU list struct, regardless of whether it is tracked by the
reclaimer or not.

This patch prepares for the inclusion of currently untracked pages by
replacing the functions sgx_mark_page_reclaimable() and
sgx_unmark_page_reclaimable() with sgx_record_epc_page() and
sgx_drop_epc_page(). The sgx_record_epc_page() function adds the
epc_page to the "reclaimable" list in the sgx_epc_lru_lists struct,
while sgx_drop_epc_page() removes the page from the LRU list.

For now, this change serves as a straightforward replacement of the two
functions for pages tracked by the reclaimer. A subsequent patch will
introduce the capability to track unreclaimable pages using these same
functions.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/encl.c  | 10 +++++-----
 arch/x86/kernel/cpu/sgx/ioctl.c | 12 ++++++------
 arch/x86/kernel/cpu/sgx/main.c  | 22 ++++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h   |  4 ++--
 4 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 3bc2f95b1da2..f68af9e37daa 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -260,8 +260,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return ERR_CAST(epc_page);
 
 	encl->secs_child_cnt++;
-	sgx_mark_page_reclaimable(entry->epc_page);
-	entry->epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL_PAGE |
+			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
 
 	return entry;
 }
@@ -380,8 +380,8 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl_page->type = SGX_PAGE_TYPE_REG;
 	encl->secs_child_cnt++;
 
-	sgx_mark_page_reclaimable(encl_page->epc_page);
-	encl_page->epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL_PAGE |
+			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -697,7 +697,7 @@ void sgx_encl_release(struct kref *ref)
 			 * The page and its radix tree entry cannot be freed
 			 * if the page is being held by the reclaimer.
 			 */
-			if (sgx_unmark_page_reclaimable(entry->epc_page))
+			if (sgx_drop_epc_page(entry->epc_page))
 				continue;
 
 			sgx_encl_free_epc_page(entry->epc_page);
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index fe3e89cf013f..dd7ab1c80db6 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -324,8 +324,8 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 			goto err_out;
 	}
 
-	sgx_mark_page_reclaimable(encl_page->epc_page);
-	encl_page->epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL_PAGE |
+			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -964,7 +964,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 			 * Prevent page from being reclaimed while mutex
 			 * is released.
 			 */
-			if (sgx_unmark_page_reclaimable(entry->epc_page)) {
+			if (sgx_drop_epc_page(entry->epc_page)) {
 				ret = -EAGAIN;
 				goto out_entry_changed;
 			}
@@ -979,8 +979,8 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 
 			mutex_lock(&encl->lock);
 
-			sgx_mark_page_reclaimable(entry->epc_page);
-			entry->epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
+			sgx_record_epc_page(entry->epc_page, SGX_EPC_OWNER_ENCL_PAGE |
+					    SGX_EPC_PAGE_RECLAIMER_TRACKED);
 		}
 
 		/* Change EPC type */
@@ -1137,7 +1137,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
 			goto out_unlock;
 		}
 
-		if (sgx_unmark_page_reclaimable(entry->epc_page)) {
+		if (sgx_drop_epc_page(entry->epc_page)) {
 			ret = -EBUSY;
 			goto out_unlock;
 		}
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 71c3386ccf23..371135665ff7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -268,7 +268,6 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 			goto out;
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
-
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
@@ -498,31 +497,34 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_record_epc_page() - Add a page to the appropriate LRU list
  * @page:	EPC page
+ * @flags:	The type of page that is being recorded
  *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
+ * Mark a page with the specified flags and add it to the appropriate
+ * list.
  */
-void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
+void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
 	spin_lock(&sgx_global_lru.lock);
-	page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-	list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	page->flags |= flags;
+	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
+		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_drop_epc_page() - Remove a page from a LRU list
  * @page:	EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
  *   -EBUSY if the page is in the process of being reclaimed
  */
-int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
+int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
 	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 77fceba73a25..c60bbd995942 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -113,8 +113,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
 void sgx_reclaim_direct(void);
-void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
-int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
+void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
+int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 
 void sgx_ipi_cb(void *info);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 06/28] x86/sgx: store unreclaimable EPC pages in sgx_epc_lru_lists
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (4 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 05/28] x86/sgx: Store reclaimable epc pages in sgx_epc_lru_lists Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 07/28] x86/sgx: Introduce EPC page states Haitao Huang
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

From: Kristen Carlson Accardi <kristen@linux.intel.com>

When an OOM event occurs, all pages associated with an enclave will
need to be freed, including pages that are not currently tracked by
the reclaimer.

A previous patch converted the SGX code to use a pair of generic
"sgx_record/drop_epc_pages()" for storing the EPC pages that are
tracked by the reclaimer. This patch utilizes those functions to
store the remaining untracked pages to a new "unreclaimable" list
stored with the struct sgx_epc_lru_lists struct.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

V3:
- Removed tracking virtual EPC pages in unreclaimable list as host
kernel does not reclaim them. The EPC cgroups implemented later only
blocks allocating for a guest if the limit is reached by returning
-ENOMEM from sgx_alloc_epc_page() called by virt_epc, and does nothing
else. Therefore, no need to track those in LRU lists.
---
 arch/x86/kernel/cpu/sgx/encl.c  | 8 ++++++--
 arch/x86/kernel/cpu/sgx/ioctl.c | 4 +++-
 arch/x86/kernel/cpu/sgx/main.c  | 3 +++
 arch/x86/kernel/cpu/sgx/sgx.h   | 5 +++++
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f68af9e37daa..edb8d8c1c229 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -252,7 +252,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		epc_page = sgx_encl_eldu(&encl->secs, NULL);
 		if (IS_ERR(epc_page))
 			return ERR_CAST(epc_page);
-		epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
+		sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL_PAGE |
+				    SGX_EPC_PAGE_RECLAIMER_UNTRACKED);
 	}
 
 	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
@@ -724,6 +725,7 @@ void sgx_encl_release(struct kref *ref)
 	xa_destroy(&encl->page_array);
 
 	if (!encl->secs_child_cnt && encl->secs.epc_page) {
+		sgx_drop_epc_page(encl->secs.epc_page);
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 	}
@@ -732,6 +734,7 @@ void sgx_encl_release(struct kref *ref)
 		va_page = list_first_entry(&encl->va_pages, struct sgx_va_page,
 					   list);
 		list_del(&va_page->list);
+		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		kfree(va_page);
 	}
@@ -1238,7 +1241,8 @@ struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 		sgx_encl_free_epc_page(epc_page);
 		return ERR_PTR(-EFAULT);
 	}
-	epc_page->flags |= SGX_EPC_OWNER_ENCL;
+	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL |
+			    SGX_EPC_PAGE_RECLAIMER_UNTRACKED);
 
 	return epc_page;
 }
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index dd7ab1c80db6..4e6d0c9d043a 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -48,6 +48,7 @@ void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
 	encl->page_cnt--;
 
 	if (va_page) {
+		sgx_drop_epc_page(va_page->epc_page);
 		sgx_encl_free_epc_page(va_page->epc_page);
 		list_del(&va_page->list);
 		kfree(va_page);
@@ -113,7 +114,8 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes = secs->attributes;
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
-	encl->secs.epc_page->flags |= SGX_EPC_OWNER_ENCL_PAGE;
+	sgx_record_epc_page(encl->secs.epc_page, SGX_EPC_OWNER_ENCL_PAGE |
+			    SGX_EPC_PAGE_RECLAIMER_UNTRACKED);
 
 	/* Set only after completion, as encl->lock has not been taken. */
 	set_bit(SGX_ENCL_CREATED, &encl->flags);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 371135665ff7..9252728865fa 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -268,6 +268,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 			goto out;
 
 		sgx_encl_ewb(encl->secs.epc_page, &secs_backing);
+		sgx_drop_epc_page(encl->secs.epc_page);
 		sgx_encl_free_epc_page(encl->secs.epc_page);
 		encl->secs.epc_page = NULL;
 
@@ -511,6 +512,8 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 	page->flags |= flags;
 	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
 		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+	else
+		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
 	spin_unlock(&sgx_global_lru.lock);
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index c60bbd995942..9f780b2c4cfe 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -23,6 +23,9 @@
 #define SGX_NR_LOW_PAGES		32
 #define SGX_NR_HIGH_PAGES		64
 
+/* Pages, which are not tracked by the page reclaimer. */
+#define SGX_EPC_PAGE_RECLAIMER_UNTRACKED 0
+
 /* Pages, which are being tracked by the page reclaimer. */
 #define SGX_EPC_PAGE_RECLAIMER_TRACKED	BIT(0)
 
@@ -101,12 +104,14 @@ struct sgx_epc_lru_lists {
 	/* Must acquire this lock to access */
 	spinlock_t lock;
 	struct list_head reclaimable;
+	struct list_head unreclaimable;
 };
 
 static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
 {
 	spin_lock_init(&lrus->lock);
 	INIT_LIST_HEAD(&lrus->reclaimable);
+	INIT_LIST_HEAD(&lrus->unreclaimable);
 }
 
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 07/28] x86/sgx: Introduce EPC page states
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (5 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 06/28] x86/sgx: store unreclaimable EPC " Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 08/28] x86/sgx: Introduce RECLAIM_IN_PROGRESS state Haitao Huang
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

Use the lower 3 bits in the flags field of sgx_epc_page struct to
track EPC states in its life cycle and define an enum for possible
states. More state(s) will be added later.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

V3:
- This is new in V3 to replace the bit mask based approach (requested by Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c  | 10 +++----
 arch/x86/kernel/cpu/sgx/ioctl.c |  6 ++--
 arch/x86/kernel/cpu/sgx/main.c  | 19 +++++++------
 arch/x86/kernel/cpu/sgx/sgx.h   | 50 +++++++++++++++++++++++++++++----
 4 files changed, 63 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index edb8d8c1c229..e7319209fc4a 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -253,7 +253,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		if (IS_ERR(epc_page))
 			return ERR_CAST(epc_page);
 		sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL_PAGE |
-				    SGX_EPC_PAGE_RECLAIMER_UNTRACKED);
+				    SGX_EPC_PAGE_UNRECLAIMABLE);
 	}
 
 	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
@@ -262,7 +262,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 
 	encl->secs_child_cnt++;
 	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL_PAGE |
-			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
+			    SGX_EPC_PAGE_RECLAIMABLE);
 
 	return entry;
 }
@@ -382,7 +382,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 	encl->secs_child_cnt++;
 
 	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL_PAGE |
-			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
+			    SGX_EPC_PAGE_RECLAIMABLE);
 
 	phys_addr = sgx_get_epc_phys_addr(epc_page);
 	/*
@@ -1242,7 +1242,7 @@ struct sgx_epc_page *sgx_alloc_va_page(struct sgx_encl *encl, bool reclaim)
 		return ERR_PTR(-EFAULT);
 	}
 	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL |
-			    SGX_EPC_PAGE_RECLAIMER_UNTRACKED);
+			    SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	return epc_page;
 }
@@ -1302,7 +1302,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
 {
 	int ret;
 
-	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
 
 	ret = __eremove(sgx_get_epc_virt_addr(page));
 	if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 4e6d0c9d043a..4f95096c9786 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -115,7 +115,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
 	encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
 
 	sgx_record_epc_page(encl->secs.epc_page, SGX_EPC_OWNER_ENCL_PAGE |
-			    SGX_EPC_PAGE_RECLAIMER_UNTRACKED);
+			    SGX_EPC_PAGE_UNRECLAIMABLE);
 
 	/* Set only after completion, as encl->lock has not been taken. */
 	set_bit(SGX_ENCL_CREATED, &encl->flags);
@@ -327,7 +327,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
 	}
 
 	sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL_PAGE |
-			    SGX_EPC_PAGE_RECLAIMER_TRACKED);
+			    SGX_EPC_PAGE_RECLAIMABLE);
 	mutex_unlock(&encl->lock);
 	mmap_read_unlock(current->mm);
 	return ret;
@@ -982,7 +982,7 @@ static long sgx_enclave_modify_types(struct sgx_encl *encl,
 			mutex_lock(&encl->lock);
 
 			sgx_record_epc_page(entry->epc_page, SGX_EPC_OWNER_ENCL_PAGE |
-					    SGX_EPC_PAGE_RECLAIMER_TRACKED);
+					    SGX_EPC_PAGE_RECLAIMABLE);
 		}
 
 		/* Change EPC type */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 9252728865fa..02c358f10383 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -319,7 +319,7 @@ static void sgx_reclaim_pages(void)
 			/* The owner is freeing the page. No need to add the
 			 * page back to the list of reclaimable pages.
 			 */
-			epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+			sgx_epc_page_reset_state(epc_page);
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -345,6 +345,7 @@ static void sgx_reclaim_pages(void)
 
 skip:
 		spin_lock(&sgx_global_lru.lock);
+		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
 		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
 		spin_unlock(&sgx_global_lru.lock);
 
@@ -368,7 +369,7 @@ static void sgx_reclaim_pages(void)
 		sgx_reclaimer_write(epc_page, &backing[i]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
-		epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+		sgx_epc_page_reset_state(epc_page);
 
 		sgx_free_epc_page(epc_page);
 	}
@@ -508,9 +509,9 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
 	spin_lock(&sgx_global_lru.lock);
-	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
 	page->flags |= flags;
-	if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
+	if (sgx_epc_page_reclaimable(flags))
 		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
 	else
 		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
@@ -530,7 +531,7 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
-	if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
+	if (sgx_epc_page_reclaimable(page->flags)) {
 		/* The page is being reclaimed. */
 		if (list_empty(&page->list)) {
 			spin_unlock(&sgx_global_lru.lock);
@@ -538,7 +539,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
 		}
 
 		list_del(&page->list);
-		page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+		sgx_epc_page_reset_state(page);
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -610,6 +611,8 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
 	struct sgx_numa_node *node = section->node;
 
+	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
+
 	spin_lock(&node->lock);
 
 	page->encl_page = NULL;
@@ -617,7 +620,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 		list_add(&page->list, &node->sgx_poison_page_list);
 	else
 		list_add_tail(&page->list, &node->free_page_list);
-	page->flags = SGX_EPC_PAGE_IS_FREE;
+	page->flags = SGX_EPC_PAGE_FREE;
 
 	spin_unlock(&node->lock);
 	atomic_long_inc(&sgx_nr_free_pages);
@@ -718,7 +721,7 @@ int arch_memory_failure(unsigned long pfn, int flags)
 	 * If the page is on a free list, move it to the per-node
 	 * poison page list.
 	 */
-	if (page->flags & SGX_EPC_PAGE_IS_FREE) {
+	if (page->flags == SGX_EPC_PAGE_FREE) {
 		list_move(&page->list, &node->sgx_poison_page_list);
 		goto out;
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 9f780b2c4cfe..057905eba466 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -23,14 +23,36 @@
 #define SGX_NR_LOW_PAGES		32
 #define SGX_NR_HIGH_PAGES		64
 
-/* Pages, which are not tracked by the page reclaimer. */
-#define SGX_EPC_PAGE_RECLAIMER_UNTRACKED 0
+enum sgx_epc_page_state {
+	/* Not tracked by the reclaimer:
+	 * Pages allocated for virtual EPC which are never tracked by the host
+	 * reclaimer; pages just allocated from free list but not yet put in
+	 * use; pages just reclaimed, but not yet returned to the free list.
+	 * Becomes FREE after sgx_free_epc()
+	 * Becomes RECLAIMABLE or UNRECLAIMABLE after sgx_record_epc()
+	 */
+	SGX_EPC_PAGE_NOT_TRACKED = 0,
+
+	/* Page is in the free list, ready for allocation
+	 * Becomes NOT_TRACKED after sgx_alloc_epc_page()
+	 */
+	SGX_EPC_PAGE_FREE = 1,
+
+	/* Page is in use and tracked in a reclaimable LRU list
+	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 */
+	SGX_EPC_PAGE_RECLAIMABLE = 2,
+
+	/* Page is in use but tracked in an unreclaimable LRU list. These are
+	 * only reclaimable when the whole enclave is OOM killed or the enclave
+	 * is released, e.g., VA, SECS pages
+	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 */
+	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
 
-/* Pages, which are being tracked by the page reclaimer. */
-#define SGX_EPC_PAGE_RECLAIMER_TRACKED	BIT(0)
+};
 
-/* Pages on free list */
-#define SGX_EPC_PAGE_IS_FREE		BIT(1)
+#define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
 
 /* flag for pages owned by a sgx_encl_page */
 #define SGX_EPC_OWNER_ENCL_PAGE		BIT(3)
@@ -49,6 +71,22 @@ struct sgx_epc_page {
 	struct list_head list;
 };
 
+static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
+{
+	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
+}
+
+static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned long flags)
+{
+	page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
+	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
+static inline bool sgx_epc_page_reclaimable(unsigned long flags)
+{
+	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
 /*
  * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
  * the free page list local to the node is stored here.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 08/28] x86/sgx: Introduce RECLAIM_IN_PROGRESS state
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (6 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 07/28] x86/sgx: Introduce EPC page states Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 09/28] x86/sgx: Use a list to track to-be-reclaimed pages Haitao Huang
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

From: Sean Christopherson <sean.j.christopherson@intel.com>

When a page is being reclaimed from the page pool (sgx_global_lru),
there is an intermediate stage where a page may have been identified
as a candidate for reclaiming, but has not yet been reclaimed.
Currently such pages are list_del_init()'d from the global LRU, and
stored in a an array on stack. To prevent another thread from dropping
the same page in the middle of reclaiming, sgx_drop_epc_page() checks
for list_empty(&page->list).

In future patches these pages need be list_move()'d into a temporary
list that is shared with multiple cgroup reclaimers. so list_empty()
should no longer be used for this purpose. Add a RECLAIM_IN_PROGRESS
state to explicitly designate such intermediate state of EPC in the
reclaiming process. Do not drop any page in this state in
sgx_drop_epc_page().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>

V3:
- Extend the sgx_epc_page_state enum introduced earlier to replace the
flag based approach.
---
 arch/x86/kernel/cpu/sgx/main.c | 21 ++++++++++-----------
 arch/x86/kernel/cpu/sgx/sgx.h  | 16 ++++++++++++++++
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 02c358f10383..9eea9038758f 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -313,13 +313,15 @@ static void sgx_reclaim_pages(void)
 		list_del_init(&epc_page->list);
 		encl_page = epc_page->encl_page;
 
-		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
+		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
+			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
 			chunk[cnt++] = epc_page;
-		else
+		} else {
 			/* The owner is freeing the page. No need to add the
 			 * page back to the list of reclaimable pages.
 			 */
 			sgx_epc_page_reset_state(epc_page);
+		}
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
@@ -531,16 +533,13 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
 	spin_lock(&sgx_global_lru.lock);
-	if (sgx_epc_page_reclaimable(page->flags)) {
-		/* The page is being reclaimed. */
-		if (list_empty(&page->list)) {
-			spin_unlock(&sgx_global_lru.lock);
-			return -EBUSY;
-		}
-
-		list_del(&page->list);
-		sgx_epc_page_reset_state(page);
+	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
+		spin_unlock(&sgx_global_lru.lock);
+		return -EBUSY;
 	}
+
+	list_del(&page->list);
+	sgx_epc_page_reset_state(page);
 	spin_unlock(&sgx_global_lru.lock);
 
 	return 0;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 057905eba466..f26ed4c0d12f 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -40,6 +40,8 @@ enum sgx_epc_page_state {
 
 	/* Page is in use and tracked in a reclaimable LRU list
 	 * Becomes NOT_TRACKED after sgx_drop_epc()
+	 * Becomes RECLAIM_IN_PROGRESS in sgx_reclaim_pages() when identified
+	 * for reclaiming
 	 */
 	SGX_EPC_PAGE_RECLAIMABLE = 2,
 
@@ -50,6 +52,14 @@ enum sgx_epc_page_state {
 	 */
 	SGX_EPC_PAGE_UNRECLAIMABLE = 3,
 
+	/* Page is being prepared for reclaimation, tracked in a temporary
+	 * isolated list by the reclaimer.
+	 * Changes in sgx_reclaim_pages() back to RECLAIMABLE if preparation
+	 * fails for any reason.
+	 * Becomes NOT_TRACKED if reclaimed successfully in sgx_reclaim_pages()
+	 * and immediately sgx_free_epc() is called to make it FREE.
+	 */
+	SGX_EPC_PAGE_RECLAIM_IN_PROGRESS = 4,
 };
 
 #define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
@@ -82,6 +92,12 @@ static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned lo
 	page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
 }
 
+static inline bool sgx_epc_page_reclaim_in_progress(unsigned long flags)
+{
+	return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags &
+						    SGX_EPC_PAGE_STATE_MASK);
+}
+
 static inline bool sgx_epc_page_reclaimable(unsigned long flags)
 {
 	return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 09/28] x86/sgx: Use a list to track to-be-reclaimed pages
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (7 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 08/28] x86/sgx: Introduce RECLAIM_IN_PROGRESS state Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 10/28] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default Haitao Huang
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

From: Kristen Carlson Accardi <kristen@linux.intel.com>

Change sgx_reclaim_pages() to use a list rather than an array for
storing the epc_pages which will be reclaimed. This change is needed
to transition to the LRU implementation for EPC cgroup support, which
uses lists to store reclaimable and unreclaimable pages.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang<haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>

V3:
- Removed list wrappers
---
 arch/x86/kernel/cpu/sgx/main.c | 40 +++++++++++++++-------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 9eea9038758f..f3a3ed894616 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -294,12 +294,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  */
 static void sgx_reclaim_pages(void)
 {
-	struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
 	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
-	struct sgx_epc_page *epc_page;
 	pgoff_t page_index;
-	int cnt = 0;
+	LIST_HEAD(iso);
 	int ret;
 	int i;
 
@@ -315,18 +314,22 @@ static void sgx_reclaim_pages(void)
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
 			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-			chunk[cnt++] = epc_page;
+			list_move_tail(&epc_page->list, &iso);
 		} else {
-			/* The owner is freeing the page. No need to add the
-			 * page back to the list of reclaimable pages.
+			/* The owner is freeing the page, remove it from the
+			 * LRU list
 			 */
 			sgx_epc_page_reset_state(epc_page);
+			list_del_init(&epc_page->list);
 		}
 	}
 	spin_unlock(&sgx_global_lru.lock);
 
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
+	if (list_empty(&iso))
+		return;
+
+	i = 0;
+	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_page;
 
 		if (!sgx_reclaimer_age(epc_page))
@@ -341,6 +344,7 @@ static void sgx_reclaim_pages(void)
 			goto skip;
 		}
 
+		i++;
 		encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
 		mutex_unlock(&encl_page->encl->lock);
 		continue;
@@ -348,27 +352,19 @@ static void sgx_reclaim_pages(void)
 skip:
 		spin_lock(&sgx_global_lru.lock);
 		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
-		list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+		list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
 		spin_unlock(&sgx_global_lru.lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
-
-		chunk[i] = NULL;
-	}
-
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
-		if (epc_page)
-			sgx_reclaimer_block(epc_page);
 	}
 
-	for (i = 0; i < cnt; i++) {
-		epc_page = chunk[i];
-		if (!epc_page)
-			continue;
+	list_for_each_entry(epc_page, &iso, list)
+		sgx_reclaimer_block(epc_page);
 
+	i = 0;
+	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_page;
-		sgx_reclaimer_write(epc_page, &backing[i]);
+		sgx_reclaimer_write(epc_page, &backing[i++]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 		sgx_epc_page_reset_state(epc_page);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 10/28] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (8 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 09/28] x86/sgx: Use a list to track to-be-reclaimed pages Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 11/28] x85/sgx: Return the number of EPC pages that were successfully reclaimed Haitao Huang
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

From: Sean Christopherson <sean.j.christopherson@intel.com>

Modify sgx_reclaim_pages() to take a parameter that specifies the
number of pages to scan for reclaiming. Specify a max value of
32, but scan 16 in the usual case. This allows the number of pages
sgx_reclaim_pages() scans to be specified by the caller, and adjusted
in future patches.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index f3a3ed894616..cd5e5517866a 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -17,6 +17,10 @@
 #include "driver.h"
 #include "encl.h"
 #include "encls.h"
+/**
+ * Maximum number of pages to scan for reclaiming.
+ */
+#define SGX_NR_TO_SCAN_MAX	32
 
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
@@ -279,7 +283,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
-/*
+/**
+ * sgx_reclaim_pages() - Reclaim EPC pages from the consumers
+ * @nr_to_scan:		 Number of EPC pages to scan for reclaim
+ *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
  * been accessed since the last scan. Move those pages to the tail of active
@@ -292,9 +299,9 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static void sgx_reclaim_pages(void)
+static void sgx_reclaim_pages(int nr_to_scan)
 {
-	struct sgx_backing backing[SGX_NR_TO_SCAN];
+	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
 	pgoff_t page_index;
@@ -332,7 +339,7 @@ static void sgx_reclaim_pages(void)
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_page;
 
-		if (!sgx_reclaimer_age(epc_page))
+		if (i == SGX_NR_TO_SCAN_MAX || !sgx_reclaimer_age(epc_page))
 			goto skip;
 
 		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
@@ -387,7 +394,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages();
+		sgx_reclaim_pages(SGX_NR_TO_SCAN);
 }
 
 static int ksgxd(void *p)
@@ -410,7 +417,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages();
+			sgx_reclaim_pages(SGX_NR_TO_SCAN);
 
 		cond_resched();
 	}
@@ -582,7 +589,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages();
+		sgx_reclaim_pages(SGX_NR_TO_SCAN);
 		cond_resched();
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 11/28] x85/sgx: Return the number of EPC pages that were successfully reclaimed
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (9 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 10/28] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-29 12:47   ` Pavel Machek
  2023-07-12 23:01 ` [PATCH v3 12/28] x86/sgx: Add option to ignore age of page during EPC reclaim Haitao Huang
                   ` (19 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

From: Sean Christopherson <sean.j.christopherson@intel.com>

Return the number of reclaimed pages from sgx_reclaim_pages(), the EPC
cgroup will use the result to track the success rate of its reclaim
calls, e.g. to escalate to a more forceful reclaiming mode if necessary.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index cd5e5517866a..4fc931156972 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -299,15 +299,15 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static void sgx_reclaim_pages(int nr_to_scan)
+static size_t sgx_reclaim_pages(size_t nr_to_scan)
 {
 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
 	pgoff_t page_index;
 	LIST_HEAD(iso);
-	int ret;
-	int i;
+	size_t ret;
+	size_t i;
 
 	spin_lock(&sgx_global_lru.lock);
 	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
@@ -333,7 +333,7 @@ static void sgx_reclaim_pages(int nr_to_scan)
 	spin_unlock(&sgx_global_lru.lock);
 
 	if (list_empty(&iso))
-		return;
+		return 0;
 
 	i = 0;
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
@@ -378,6 +378,7 @@ static void sgx_reclaim_pages(int nr_to_scan)
 
 		sgx_free_epc_page(epc_page);
 	}
+	return i;
 }
 
 static bool sgx_should_reclaim(unsigned long watermark)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 12/28] x86/sgx: Add option to ignore age of page during EPC reclaim
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (10 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 11/28] x85/sgx: Return the number of EPC pages that were successfully reclaimed Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 13/28] x86/sgx: Prepare for multiple LRUs Haitao Huang
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add a flag to sgx_reclaim_pages() to instruct it to ignore the age of
page, i.e. reclaim the page even if it's young.  The EPC cgroup will use
the flag to enforce its limits by draining the reclaimable lists before
resorting to other measures, e.g. forcefully reclaimable "unreclaimable"
pages by killing enclaves.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 44 +++++++++++++++++++++-------------
 1 file changed, 28 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 4fc931156972..ea0698db8698 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,6 +34,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_lists sgx_global_lru;
 
+static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
+{
+	return &sgx_global_lru;
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -286,6 +291,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 /**
  * sgx_reclaim_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
+ * @ignore_age:		 Reclaim a page even if it is young
  *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
@@ -299,11 +305,12 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static size_t sgx_reclaim_pages(size_t nr_to_scan)
+static size_t sgx_reclaim_pages(size_t nr_to_scan, bool ignore_age)
 {
 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
 	struct sgx_encl_page *encl_page;
+	struct sgx_epc_lru_lists *lru;
 	pgoff_t page_index;
 	LIST_HEAD(iso);
 	size_t ret;
@@ -339,7 +346,8 @@ static size_t sgx_reclaim_pages(size_t nr_to_scan)
 	list_for_each_entry_safe(epc_page, tmp, &iso, list) {
 		encl_page = epc_page->encl_page;
 
-		if (i == SGX_NR_TO_SCAN_MAX || !sgx_reclaimer_age(epc_page))
+		if (i == SGX_NR_TO_SCAN_MAX ||
+		    (!ignore_age && !sgx_reclaimer_age(epc_page)))
 			goto skip;
 
 		page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
@@ -357,10 +365,11 @@ static size_t sgx_reclaim_pages(size_t nr_to_scan)
 		continue;
 
 skip:
-		spin_lock(&sgx_global_lru.lock);
+		lru = sgx_lru_lists(epc_page);
+		spin_lock(&lru->lock);
 		sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
-		list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
-		spin_unlock(&sgx_global_lru.lock);
+		list_move_tail(&epc_page->list, &lru->reclaimable);
+		spin_unlock(&lru->lock);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
 	}
@@ -395,7 +404,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages(SGX_NR_TO_SCAN);
+		sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
 }
 
 static int ksgxd(void *p)
@@ -418,7 +427,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages(SGX_NR_TO_SCAN);
+			sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
 
 		cond_resched();
 	}
@@ -514,14 +523,16 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
+
+	spin_lock(&lru->lock);
 	WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
 	page->flags |= flags;
 	if (sgx_epc_page_reclaimable(flags))
-		list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+		list_add_tail(&page->list, &lru->reclaimable);
 	else
-		list_add_tail(&page->list, &sgx_global_lru.unreclaimable);
-	spin_unlock(&sgx_global_lru.lock);
+		list_add_tail(&page->list, &lru->unreclaimable);
+	spin_unlock(&lru->lock);
 }
 
 /**
@@ -536,15 +547,16 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
  */
 int sgx_drop_epc_page(struct sgx_epc_page *page)
 {
-	spin_lock(&sgx_global_lru.lock);
+	struct sgx_epc_lru_lists *lru = sgx_lru_lists(page);
+
+	spin_lock(&lru->lock);
 	if (sgx_epc_page_reclaim_in_progress(page->flags)) {
-		spin_unlock(&sgx_global_lru.lock);
+		spin_unlock(&lru->lock);
 		return -EBUSY;
 	}
-
 	list_del(&page->list);
 	sgx_epc_page_reset_state(page);
-	spin_unlock(&sgx_global_lru.lock);
+	spin_unlock(&lru->lock);
 
 	return 0;
 }
@@ -590,7 +602,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages(SGX_NR_TO_SCAN);
+		sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
 		cond_resched();
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 13/28] x86/sgx: Prepare for multiple LRUs
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (11 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 12/28] x86/sgx: Add option to ignore age of page during EPC reclaim Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 14/28] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Haitao Huang
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add sgx_can_reclaim() wrapper so that in a subsequent patch, multiple LRUs
can be used cleanly.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index ea0698db8698..a829555b9675 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -390,10 +390,15 @@ static size_t sgx_reclaim_pages(size_t nr_to_scan, bool ignore_age)
 	return i;
 }
 
+static bool sgx_can_reclaim(void)
+{
+	return !list_empty(&sgx_global_lru.reclaimable);
+}
+
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-	       !list_empty(&sgx_global_lru.reclaimable);
+		sgx_can_reclaim();
 }
 
 /*
@@ -589,7 +594,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (list_empty(&sgx_global_lru.reclaimable))
+		if (!sgx_can_reclaim())
 			return ERR_PTR(-ENOMEM);
 
 		if (!reclaim) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 14/28] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (12 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 13/28] x86/sgx: Prepare for multiple LRUs Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 15/28] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Haitao Huang
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

From: Sean Christopherson <sean.j.christopherson@intel.com>

Expose the top-level reclaim function as sgx_reclaim_epc_pages() for use
by the upcoming EPC cgroup, which will initiate reclaim to enforce
changes to high/max limits.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 10 +++++-----
 arch/x86/kernel/cpu/sgx/sgx.h  |  1 +
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index a829555b9675..e9c9e0d97300 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -289,7 +289,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 }
 
 /**
- * sgx_reclaim_pages() - Reclaim EPC pages from the consumers
+ * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
  * @ignore_age:		 Reclaim a page even if it is young
  *
@@ -305,7 +305,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static size_t sgx_reclaim_pages(size_t nr_to_scan, bool ignore_age)
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 {
 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
@@ -409,7 +409,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 }
 
 static int ksgxd(void *p)
@@ -432,7 +432,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
+			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 
 		cond_resched();
 	}
@@ -607,7 +607,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
 		cond_resched();
 	}
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index f26ed4c0d12f..98d3b15341b1 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -175,6 +175,7 @@ void sgx_reclaim_direct(void);
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 15/28] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (13 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 14/28] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 16/28] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Haitao Huang
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

From: Sean Christopherson <sean.j.christopherson@intel.com>

Move the isolation loop into a helper, sgx_isolate_pages(), in
preparation for existence of multiple LRUs.  Expose the helper to other
SGX code so that it can be called from the EPC cgroup code, e.g.  to
isolate pages from a single cgroup LRU. Exposing the isolation loop
allows the cgroup iteration logic to be wholly encapsulated within the
cgroup code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/cpu/sgx/main.c | 60 +++++++++++++++++++++-------------
 arch/x86/kernel/cpu/sgx/sgx.h  |  2 ++
 2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index e9c9e0d97300..883470062514 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -288,6 +288,43 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 	mutex_unlock(&encl->lock);
 }
 
+/**
+ * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
+ * @lru:	LRU from which to reclaim
+ * @nr_to_scan:	Number of pages to scan for reclaim
+ * @dst:	Destination list to hold the isolated pages
+ */
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
+			   struct list_head *dst)
+{
+	struct sgx_encl_page *encl_page;
+	struct sgx_epc_page *epc_page;
+
+	spin_lock(&lru->lock);
+	for (; nr_to_scan > 0; --nr_to_scan) {
+		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
+		if (!epc_page)
+			break;
+
+		encl_page = epc_page->encl_page;
+
+		if (WARN_ON_ONCE(!(epc_page->flags & SGX_EPC_OWNER_ENCL_PAGE)))
+			continue;
+
+		if (kref_get_unless_zero(&encl_page->encl->refcount)) {
+			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
+			list_move_tail(&epc_page->list, dst);
+		} else {
+			/* The owner is freeing the page, remove it from the
+			 * LRU list
+			 */
+			sgx_epc_page_reset_state(epc_page);
+			list_del_init(&epc_page->list);
+		}
+	}
+	spin_unlock(&lru->lock);
+}
+
 /**
  * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
@@ -316,28 +353,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	size_t ret;
 	size_t i;
 
-	spin_lock(&sgx_global_lru.lock);
-	for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-		epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
-						    struct sgx_epc_page, list);
-		if (!epc_page)
-			break;
-
-		list_del_init(&epc_page->list);
-		encl_page = epc_page->encl_page;
-
-		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
-			sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-			list_move_tail(&epc_page->list, &iso);
-		} else {
-			/* The owner is freeing the page, remove it from the
-			 * LRU list
-			 */
-			sgx_epc_page_reset_state(epc_page);
-			list_del_init(&epc_page->list);
-		}
-	}
-	spin_unlock(&sgx_global_lru.lock);
+	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
 
 	if (list_empty(&iso))
 		return 0;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 98d3b15341b1..25db815f5add 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -176,6 +176,8 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
+			   struct list_head *dst);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 16/28] x86/sgx: Add EPC OOM path to forcefully reclaim EPC
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (14 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 15/28] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 17/28] x86/sgx: fix a NULL pointer Haitao Huang
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

From: Sean Christopherson <sean.j.christopherson@intel.com>

Introduce the OOM path for killing an enclave with the reclaimer is no
longer able to reclaim enough EPC pages. Find a victim enclave, which
will be an enclave with EPC pages remaining that are not accessible to
the reclaimer ("unreclaimable"). Once a victim is identified, mark the
enclave as OOM and zap the enclaves entire page range, and drain all mm
references in encl->mm_list. Block allocating any EPC pages in #PF
handler, or reloading any pages in all paths, or creating any new mappings.

The OOM killing path may race with the reclaimers: in some cases, the
victim enclave is in the process of reclaiming the last EPC pages when
OOM happens, that is, all pages other than SECS and VA pages are in
RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
the enclave backing, VA pages as well as SECS. So the OOM killer does
not directly release those enclave resources, instead, it lets all
reclaiming in progress to finish, and relies (as currently done) on
kref_put on encl->refcount to trigger sgx_encl_release() to do the
final cleanup.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>

V3:
- Rebased to use the new VMA_ITERATOR to zap VMAs.
- Fixed the racing cases by blocking new page allocation/mapping and
reloading when enclave is marked for OOM. And do not release any enclave
resources other than draining mm_list entries, and let pages in
RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
- Due to above changes, also removed the no-longer needed encl->lock in
the OOM path which was causing deadlocks reported by the lock prover.
---
 arch/x86/kernel/cpu/sgx/driver.c |  27 +-----
 arch/x86/kernel/cpu/sgx/encl.c   |  48 ++++++++++-
 arch/x86/kernel/cpu/sgx/encl.h   |   2 +
 arch/x86/kernel/cpu/sgx/ioctl.c  |   9 ++
 arch/x86/kernel/cpu/sgx/main.c   | 140 +++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/sgx.h    |   1 +
 6 files changed, 200 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index 262f5fb18d74..ff42d649c7b6 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -44,7 +44,6 @@ static int sgx_open(struct inode *inode, struct file *file)
 static int sgx_release(struct inode *inode, struct file *file)
 {
 	struct sgx_encl *encl = file->private_data;
-	struct sgx_encl_mm *encl_mm;
 
 	/*
 	 * Drain the remaining mm_list entries. At this point the list contains
@@ -52,31 +51,7 @@ static int sgx_release(struct inode *inode, struct file *file)
 	 * not exited yet. The processes, which have exited, are gone from the
 	 * list by sgx_mmu_notifier_release().
 	 */
-	for ( ; ; )  {
-		spin_lock(&encl->mm_lock);
-
-		if (list_empty(&encl->mm_list)) {
-			encl_mm = NULL;
-		} else {
-			encl_mm = list_first_entry(&encl->mm_list,
-						   struct sgx_encl_mm, list);
-			list_del_rcu(&encl_mm->list);
-		}
-
-		spin_unlock(&encl->mm_lock);
-
-		/* The enclave is no longer mapped by any mm. */
-		if (!encl_mm)
-			break;
-
-		synchronize_srcu(&encl->srcu);
-		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
-		kfree(encl_mm);
-
-		/* 'encl_mm' is gone, put encl_mm->encl reference: */
-		kref_put(&encl->refcount, sgx_encl_release);
-	}
-
+	sgx_encl_mm_drain(encl);
 	kref_put(&encl->refcount, sgx_encl_release);
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index e7319209fc4a..c321c848baa9 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -430,6 +430,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
 	if (unlikely(!encl))
 		return VM_FAULT_SIGBUS;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return VM_FAULT_SIGBUS;
+
 	/*
 	 * The page_array keeps track of all enclave pages, whether they
 	 * are swapped out or not. If there is no entry for this page and
@@ -628,7 +631,8 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
 	if (!encl)
 		return -EFAULT;
 
-	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags))
+	if (!test_bit(SGX_ENCL_DEBUG, &encl->flags) ||
+	    test_bit(SGX_ENCL_OOM, &encl->flags))
 		return -EFAULT;
 
 	for (i = 0; i < len; i += cnt) {
@@ -753,6 +757,45 @@ void sgx_encl_release(struct kref *ref)
 	kfree(encl);
 }
 
+/**
+ * sgx_encl_mm_drain - drain all mm_list entries
+ * @encl:	address of the sgx_encl to drain
+ *
+ * Used during oom kill to empty the mm_list entries after they have been
+ * zapped. Or used by sgx_release to drain the remaining mm_list entries when
+ * the enclave fd is closing. After this call, sgx_encl_release will be called
+ * with kref_put.
+ */
+void sgx_encl_mm_drain(struct sgx_encl *encl)
+{
+	struct sgx_encl_mm *encl_mm;
+
+	for ( ; ; )  {
+		spin_lock(&encl->mm_lock);
+
+		if (list_empty(&encl->mm_list)) {
+			encl_mm = NULL;
+		} else {
+			encl_mm = list_first_entry(&encl->mm_list,
+						   struct sgx_encl_mm, list);
+			list_del_rcu(&encl_mm->list);
+		}
+
+		spin_unlock(&encl->mm_lock);
+
+		/* The enclave is no longer mapped by any mm. */
+		if (!encl_mm)
+			break;
+
+		synchronize_srcu(&encl->srcu);
+		mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+		kfree(encl_mm);
+
+		/* 'encl_mm' is gone, put encl_mm->encl reference: */
+		kref_put(&encl->refcount, sgx_encl_release);
+	}
+}
+
 /*
  * 'mm' is exiting and no longer needs mmu notifications.
  */
@@ -822,6 +865,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
 	struct sgx_encl_mm *encl_mm;
 	int ret;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return -ENOMEM;
+
 	/*
 	 * Even though a single enclave may be mapped into an mm more than once,
 	 * each 'mm' only appears once on encl->mm_list. This is guaranteed by
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 831d63f80f5a..47792fb00cee 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -39,6 +39,7 @@ enum sgx_encl_flags {
 	SGX_ENCL_DEBUG		= BIT(1),
 	SGX_ENCL_CREATED	= BIT(2),
 	SGX_ENCL_INITIALIZED	= BIT(3),
+	SGX_ENCL_OOM		= BIT(4),
 };
 
 struct sgx_encl_mm {
@@ -125,5 +126,6 @@ struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 					 unsigned long addr);
 struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
+void sgx_encl_mm_drain(struct sgx_encl *encl);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 4f95096c9786..2c159168f346 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -420,6 +420,9 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
 	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return -ENOMEM;
+
 	if (copy_from_user(&add_arg, arg, sizeof(add_arg)))
 		return -EFAULT;
 
@@ -605,6 +608,9 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, void __user *arg)
 	    test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return -ENOMEM;
+
 	if (copy_from_user(&init_arg, arg, sizeof(init_arg)))
 		return -EFAULT;
 
@@ -681,6 +687,9 @@ static int sgx_ioc_sgx2_ready(struct sgx_encl *encl)
 	if (!test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
 		return -EINVAL;
 
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		return -ENOMEM;
+
 	return 0;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 883470062514..9ea487469e4c 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -662,6 +662,146 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
+static bool sgx_oom_get_ref(struct sgx_epc_page *epc_page)
+{
+	struct sgx_encl *encl;
+
+	if (epc_page->flags & SGX_EPC_OWNER_ENCL_PAGE)
+		encl = epc_page->encl_page->encl;
+	else if (epc_page->flags & SGX_EPC_OWNER_ENCL)
+		encl = epc_page->encl;
+	else
+		return false;
+
+	return kref_get_unless_zero(&encl->refcount);
+}
+
+static struct sgx_epc_page *sgx_oom_get_victim(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *epc_page, *tmp;
+
+	if (list_empty(&lru->unreclaimable))
+		return NULL;
+
+	list_for_each_entry_safe(epc_page, tmp, &lru->unreclaimable, list) {
+		list_del_init(&epc_page->list);
+
+		if (sgx_oom_get_ref(epc_page))
+			return epc_page;
+	}
+	return NULL;
+}
+
+static void sgx_epc_oom_zap(void *owner, struct mm_struct *mm, unsigned long start,
+			    unsigned long end, const struct vm_operations_struct *ops)
+{
+	VMA_ITERATOR(vmi, mm, start);
+	struct vm_area_struct *vma;
+
+	/**
+	 * Use end because start can be zero and not mapped into
+	 * enclave even if encl->base = 0.
+	 */
+	for_each_vma_range(vmi, vma, end) {
+		if (vma->vm_ops == ops && vma->vm_private_data == owner &&
+		    vma->vm_start < end) {
+			zap_vma_pages(vma);
+		}
+	}
+}
+
+static bool sgx_oom_encl(struct sgx_encl *encl)
+{
+	unsigned long mm_list_version;
+	struct sgx_encl_mm *encl_mm;
+	bool ret = false;
+	int idx;
+
+	if (!test_bit(SGX_ENCL_CREATED, &encl->flags))
+		goto out_put;
+
+	/* Done OOM on this enclave previously, do not redo it.
+	 * This may happen when the SECS page is still UNCLRAIMABLE because
+	 * another page is in RECLAIM_IN_PROGRESS. Still return true so OOM
+	 * killer can wait until the reclaimer done with the hold-up page and
+	 * SECS before it move on to find another victim.
+	 */
+	if (test_bit(SGX_ENCL_OOM, &encl->flags))
+		goto out;
+
+	set_bit(SGX_ENCL_OOM, &encl->flags);
+
+	do {
+		mm_list_version = encl->mm_list_version;
+
+		/* Pairs with smp_rmb() in sgx_encl_mm_add(). */
+		smp_rmb();
+
+		idx = srcu_read_lock(&encl->srcu);
+
+		list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+			if (!mmget_not_zero(encl_mm->mm))
+				continue;
+
+			mmap_read_lock(encl_mm->mm);
+
+			sgx_epc_oom_zap(encl, encl_mm->mm, encl->base,
+					encl->base + encl->size, &sgx_vm_ops);
+
+			mmap_read_unlock(encl_mm->mm);
+
+			mmput_async(encl_mm->mm);
+		}
+
+		srcu_read_unlock(&encl->srcu, idx);
+	} while (WARN_ON_ONCE(encl->mm_list_version != mm_list_version));
+
+	sgx_encl_mm_drain(encl);
+out:
+	ret = true;
+
+out_put:
+	/*
+	 * This puts the refcount we took when we identified this enclave as
+	 * an OOM victim.
+	 */
+	kref_put(&encl->refcount, sgx_encl_release);
+	return ret;
+}
+
+static inline bool sgx_oom_encl_page(struct sgx_encl_page *encl_page)
+{
+	return sgx_oom_encl(encl_page->encl);
+}
+
+/**
+ * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
+ * @lru:	LRU that is low
+ *
+ * Return:	%true if a victim was found and kicked.
+ */
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
+{
+	struct sgx_epc_page *victim;
+
+	spin_lock(&lru->lock);
+	victim = sgx_oom_get_victim(lru);
+	spin_unlock(&lru->lock);
+
+	if (!victim)
+		return false;
+
+	if (victim->flags & SGX_EPC_OWNER_ENCL_PAGE)
+		return sgx_oom_encl_page(victim->encl_page);
+
+	if (victim->flags & SGX_EPC_OWNER_ENCL)
+		return sgx_oom_encl(victim->encl);
+
+	/*Will never happen unless we add more owner types in future */
+	WARN_ON_ONCE(1);
+	return false;
+}
+
 static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 					 unsigned long index,
 					 struct sgx_epc_section *section)
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 25db815f5add..c6b3c90db0fa 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -178,6 +178,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
 size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
 void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
 			   struct list_head *dst);
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
 
 void sgx_ipi_cb(void *info);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 17/28] x86/sgx: fix a NULL pointer
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (15 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 16/28] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-17 12:48   ` Jarkko Sakkinen
  2023-07-12 23:01 ` [PATCH v3 18/28] cgroup/misc: Fix an overflow Haitao Huang
                   ` (13 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

Under heavy load, the SGX EPC reclaimers (ksgxd or future EPC cgroup
worker) may reclaim SECS EPC page for an enclave and set
encl->secs.epc_page to NULL. But the SECS EPC page is required for EAUG
in #PF handler and is used without checking for NULL and reloading.

Fix this by checking if SECS is loaded before EAUG and load it if it was
reclaimed.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
 arch/x86/kernel/cpu/sgx/encl.c | 30 +++++++++++++++++++++++-------
 arch/x86/kernel/cpu/sgx/main.c |  4 ++++
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index c321c848baa9..028d1b9d6572 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -235,6 +235,19 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
 	return epc_page;
 }
 
+static struct sgx_epc_page *sgx_encl_load_secs(struct sgx_encl *encl)
+{
+	struct sgx_epc_page *epc_page = encl->secs.epc_page;
+
+	if (!epc_page) {
+		epc_page = sgx_encl_eldu(&encl->secs, NULL);
+		if (!IS_ERR(epc_page))
+			sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL_PAGE |
+					    SGX_EPC_PAGE_UNRECLAIMABLE);
+	}
+	return epc_page;
+}
+
 static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 						  struct sgx_encl_page *entry)
 {
@@ -248,13 +261,9 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
 		return entry;
 	}
 
-	if (!(encl->secs.epc_page)) {
-		epc_page = sgx_encl_eldu(&encl->secs, NULL);
-		if (IS_ERR(epc_page))
-			return ERR_CAST(epc_page);
-		sgx_record_epc_page(epc_page, SGX_EPC_OWNER_ENCL_PAGE |
-				    SGX_EPC_PAGE_UNRECLAIMABLE);
-	}
+	epc_page = sgx_encl_load_secs(encl);
+	if (IS_ERR(epc_page))
+		return ERR_CAST(epc_page);
 
 	epc_page = sgx_encl_eldu(entry, encl->secs.epc_page);
 	if (IS_ERR(epc_page))
@@ -342,6 +351,13 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
 
 	mutex_lock(&encl->lock);
 
+	epc_page = sgx_encl_load_secs(encl);
+	if (IS_ERR(epc_page)) {
+		if (PTR_ERR(epc_page) == -EBUSY)
+			vmret =  VM_FAULT_NOPAGE;
+		goto err_out_unlock;
+	}
+
 	epc_page = sgx_alloc_epc_page(encl_page, false);
 	if (IS_ERR(epc_page)) {
 		if (PTR_ERR(epc_page) == -EBUSY)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 9ea487469e4c..68c89d575abc 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -265,6 +265,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 
 	mutex_lock(&encl->lock);
 
+	/* Should not be possible */
+	if (WARN_ON(!(encl->secs.epc_page)))
+		goto out;
+
 	sgx_encl_ewb(epc_page, backing);
 	encl_page->epc_page = NULL;
 	encl->secs_child_cnt--;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 18/28] cgroup/misc: Fix an overflow
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (16 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 17/28] x86/sgx: fix a NULL pointer Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-17 13:15   ` Jarkko Sakkinen
  2023-07-12 23:01 ` [PATCH v3 19/28] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
                   ` (12 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Zefan Li, Johannes Weiner
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen

Overflow may happen in misc_cg_try_charge if new_usage becomes above
INT_MAX, for example, on platforms with large SGX EPC sizes.

Change type of new_usage to long from int and check overflow.

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
 kernel/cgroup/misc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index fe3e8a0eb7ed..ff9f900981a3 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -143,7 +143,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
 	struct misc_cg *i, *j;
 	int ret;
 	struct misc_res *res;
-	int new_usage;
+	long new_usage;
 
 	if (!(valid_type(type) && cg && READ_ONCE(misc_res_capacity[type])))
 		return -EINVAL;
@@ -153,10 +153,10 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
 
 	for (i = cg; i; i = parent_misc(i)) {
 		res = &i->res[type];
-
 		new_usage = atomic_long_add_return(amount, &res->usage);
 		if (new_usage > READ_ONCE(res->max) ||
-		    new_usage > READ_ONCE(misc_res_capacity[type])) {
+		    new_usage > READ_ONCE(misc_res_capacity[type]) ||
+		    new_usage < 0) {
 			ret = -EBUSY;
 			goto err_charge;
 		}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 19/28] cgroup/misc: Add per resource callbacks for CSS events
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (17 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 18/28] cgroup/misc: Fix an overflow Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-17 13:16   ` Jarkko Sakkinen
  2023-07-12 23:01 ` [PATCH v3 20/28] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver Haitao Huang
                   ` (11 subsequent siblings)
  30 siblings, 1 reply; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Zefan Li, Johannes Weiner
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li

From: Kristen Carlson Accardi <kristen@linux.intel.com>

Consumers of the misc cgroup controller might need to perform separate
actions for Cgroups Subsystem State(CSS) events: cgroup alloc and free.
In addition, writes to the max value may also need separate action. Add
the ability to allow downstream users to setup callbacks for these
operations, and call the corresponding per-resource-type callback when
appropriate.

This code will be utilized by the SGX driver in a future patch.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

Changes from V2:
- Removed the released() callback
---
 include/linux/misc_cgroup.h |  5 +++++
 kernel/cgroup/misc.c        | 32 +++++++++++++++++++++++++++++---
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index c238207d1615..9962b870d382 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -37,6 +37,11 @@ struct misc_res {
 	unsigned long max;
 	atomic_long_t usage;
 	atomic_long_t events;
+
+	/* per resource callback ops */
+	int (*misc_cg_alloc)(struct misc_cg *cg);
+	void (*misc_cg_free)(struct misc_cg *cg);
+	void (*misc_cg_max_write)(struct misc_cg *cg);
 };
 
 /**
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index ff9f900981a3..4736db3cd418 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -278,10 +278,13 @@ static ssize_t misc_cg_max_write(struct kernfs_open_file *of, char *buf,
 
 	cg = css_misc(of_css(of));
 
-	if (READ_ONCE(misc_res_capacity[type]))
+	if (READ_ONCE(misc_res_capacity[type])) {
 		WRITE_ONCE(cg->res[type].max, max);
-	else
+		if (cg->res[type].misc_cg_max_write)
+			cg->res[type].misc_cg_max_write(cg);
+	} else {
 		ret = -EINVAL;
+	}
 
 	return ret ? ret : nbytes;
 }
@@ -385,23 +388,39 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
+	struct misc_cg *parent_cg;
 	enum misc_res_type i;
 	struct misc_cg *cg;
+	int ret;
 
 	if (!parent_css) {
 		cg = &root_cg;
+		parent_cg = &root_cg;
 	} else {
 		cg = kzalloc(sizeof(*cg), GFP_KERNEL);
 		if (!cg)
 			return ERR_PTR(-ENOMEM);
+		parent_cg = css_misc(parent_css);
 	}
 
 	for (i = 0; i < MISC_CG_RES_TYPES; i++) {
 		WRITE_ONCE(cg->res[i].max, MAX_NUM);
 		atomic_long_set(&cg->res[i].usage, 0);
+		if (parent_cg->res[i].misc_cg_alloc) {
+			ret = parent_cg->res[i].misc_cg_alloc(cg);
+			if (ret)
+				goto alloc_err;
+		}
 	}
 
 	return &cg->css;
+
+alloc_err:
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (parent_cg->res[i].misc_cg_free)
+			cg->res[i].misc_cg_free(cg);
+	kfree(cg);
+	return ERR_PTR(ret);
 }
 
 /**
@@ -412,7 +431,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
  */
 static void misc_cg_free(struct cgroup_subsys_state *css)
 {
-	kfree(css_misc(css));
+	struct misc_cg *cg = css_misc(css);
+	enum misc_res_type i;
+
+	for (i = 0; i < MISC_CG_RES_TYPES; i++)
+		if (cg->res[i].misc_cg_free)
+			cg->res[i].misc_cg_free(cg);
+
+	kfree(cg);
 }
 
 /* Cgroup controller callbacks */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 20/28] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (18 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 19/28] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 21/28] x86/sgx: Limit process EPC usage with misc cgroup controller Haitao Huang
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Zefan Li, Johannes Weiner
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li

From: Kristen Carlson Accardi <kristen@linux.intel.com>

The SGX driver will need to get access to the root misc_cg object
to do iterative walks and also determine if a charge will be
towards the root cgroup or not.

To manage the SGX EPC memory via the misc controller, the SGX
driver will also need to be able to iterate over the misc cgroup
hierarchy.

Move parent_misc() into misc_cgroup.h and make inline to make this
function available to SGX, rename it to misc_cg_parent(), and update
misc.c to use the new name.

Add per resource type private data so that SGX can store additional
per cgroup data with the misc_cg struct.

Allow SGX EPC memory to be a valid resource type for the misc
controller.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
 include/linux/misc_cgroup.h | 29 +++++++++++++++++++++++++++++
 kernel/cgroup/misc.c        | 25 ++++++++++++-------------
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 9962b870d382..8bef9d92e36a 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
 	MISC_CG_RES_SEV,
 	/* AMD SEV-ES ASIDs resource */
 	MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+	/* SGX EPC memory resource */
+	MISC_CG_RES_SGX_EPC,
 #endif
 	MISC_CG_RES_TYPES
 };
@@ -37,6 +41,7 @@ struct misc_res {
 	unsigned long max;
 	atomic_long_t usage;
 	atomic_long_t events;
+	void *priv;
 
 	/* per resource callback ops */
 	int (*misc_cg_alloc)(struct misc_cg *cg);
@@ -58,6 +63,7 @@ struct misc_cg {
 	struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 unsigned long misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, unsigned long capacity);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
@@ -79,6 +85,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
 	return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -103,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+	return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+	return NULL;
+}
 
 static inline unsigned long misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 4736db3cd418..ea18eae862a4 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
 	/* AMD SEV-ES ASIDs resource */
 	"sev_es",
 #endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+	/* Intel SGX EPC memory bytes */
+	"sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
@@ -40,18 +44,13 @@ static struct misc_cg root_cg;
 static unsigned long misc_res_capacity[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-	return cgroup ? css_misc(cgroup->css.parent) : NULL;
+	return &root_cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -151,7 +150,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
 	if (!amount)
 		return 0;
 
-	for (i = cg; i; i = parent_misc(i)) {
+	for (i = cg; i; i = misc_cg_parent(i)) {
 		res = &i->res[type];
 		new_usage = atomic_long_add_return(amount, &res->usage);
 		if (new_usage > READ_ONCE(res->max) ||
@@ -164,12 +163,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
 	return 0;
 
 err_charge:
-	for (j = i; j; j = parent_misc(j)) {
+	for (j = i; j; j = misc_cg_parent(j)) {
 		atomic_long_inc(&j->res[type].events);
 		cgroup_file_notify(&j->events_file);
 	}
 
-	for (j = cg; j != i; j = parent_misc(j))
+	for (j = cg; j != i; j = misc_cg_parent(j))
 		misc_cg_cancel_charge(type, j, amount);
 	misc_cg_cancel_charge(type, i, amount);
 	return ret;
@@ -192,7 +191,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg,
 	if (!(amount && valid_type(type) && cg))
 		return;
 
-	for (i = cg; i; i = parent_misc(i))
+	for (i = cg; i; i = misc_cg_parent(i))
 		misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 21/28] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (19 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 20/28] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-13  0:03   ` Randy Dunlap
  2023-08-17 15:12   ` Mikko Ylinen
  2023-07-12 23:01 ` [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support Haitao Huang
                   ` (9 subsequent siblings)
  30 siblings, 2 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

From: Kristen Carlson Accardi <kristen@linux.intel.com>

Implement support for cgroup control of SGX Enclave Page Cache (EPC)
memory using the misc cgroup controller. EPC memory is independent
from normal system memory, e.g. must be reserved at boot from RAM and
cannot be converted between EPC and normal memory while the system is
running. EPC is managed by the SGX subsystem and is not accounted by
the memory controller.

Much like normal system memory, EPC memory can be overcommitted via
virtual memory techniques and pages can be swapped out of the EPC to
their backing store (normal system memory, e.g. shmem).  The SGX EPC
subsystem is analogous to the memory subsytem and the SGX EPC controller
is in turn analogous to the memory controller; it implements limit and
protection models for EPC memory.

The misc controller provides a mechanism to set a hard limit of EPC
usage via the "sgx_epc" resource in "misc.max". The total EPC memory
available on the system is reported via the "sgx_epc" resource in
"misc.capacity".

This patch was modified from its original version to use the misc cgroup
controller instead of a custom controller.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>

V3:

1) Use the same maximum number of reclaiming candidate pages to be
processed, SGX_NR_TO_SCAN_MAX, for each reclaiming iteration in both
cgroup worker function and ksgxd. This fixes an overflow in the
backing store buffer with the same fixed size allocated on stack in
sgx_reclaim_epc_pages().

2) Initialize max for root EPC cgroup. Otherwise, all
misc_cg_try_charge() calls would fail as it checks for all limits of
ancestors all the way to the root node.

3) Start reclaiming whenever misc_cg_try_charge fails. Removed all
re-checks for limits and current usage. For all purposes and intent,
when misc_try_charge() fails, reclaiming is needed. This also corrects
an error of not reclaiming when the child limit is larger than one of
its ancestors.

4) Handle failure on charging to the root EPC cgroup. Failure on charging
to root means we are at or above capacity, so start reclaiming or return
OOM error.

5) Removed the custom cgroup tree walking iterator with epoch tracking
logic. Replaced it with just the plain css_for_each_descendant_pre
iterator. The custom iterator implemented a rather complex epoch scheme
I believe was intended to prevent extra reclaiming from multiple worker
threads doing the same walk but it turned out not matter much as each
thread would only reclaim when usage is above limit. Using the plain
css_for_each_descendant_pre iterator simplified code a bit.

6) Do not reclaim synchrously in misc_max_write callback which would
block the user. Instead queue an async work item to run the reclaiming
loop.

7) Other minor refactorings:
- Remove unused params in epc_cgroup APIs
- centralize uncharge into sgx_free_epc_page()
---
 arch/x86/Kconfig                     |  13 +
 arch/x86/kernel/cpu/sgx/Makefile     |   1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 406 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  60 ++++
 arch/x86/kernel/cpu/sgx/main.c       |  79 ++++--
 arch/x86/kernel/cpu/sgx/sgx.h        |  14 +-
 6 files changed, 552 insertions(+), 21 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..8a7378159e9e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1952,6 +1952,19 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config CGROUP_SGX_EPC
+	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
+	depends on X86_SGX && CGROUP_MISC
+	help
+	  Provides control over the EPC footprint of tasks in a cgroup via
+	  the Miscellaneous cgroup controller.
+
+	  EPC is a subset of regular memory that is usable only by SGX
+	  enclaves and is very limited in quantity, e.g. less than 1%
+	  of total DRAM.
+
+          Say N if unsure.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
 	ioctl.o \
 	main.o
 obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC)	       += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..de0833e5606b
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,406 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/ratelimit.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/threads.h>
+
+#include "epc_cgroup.h"
+
+#define SGX_EPC_RECLAIM_MIN_PAGES		16UL
+#define SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD	5
+#define SGX_EPC_RECLAIM_OOM_THRESHOLD		5
+
+static struct workqueue_struct *sgx_epc_cg_wq;
+static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root);
+
+struct sgx_epc_reclaim_control {
+	struct sgx_epc_cgroup *epc_cg;
+	int nr_fails;
+	bool ignore_age;
+};
+
+static inline unsigned long sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
+{
+	return atomic_long_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline unsigned long sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+	return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+static inline unsigned long sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
+{
+	struct misc_cg *i = epc_cg->cg;
+	unsigned long m = ULONG_MAX;
+
+	while (i) {
+		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+		i = misc_cg_parent(i);
+	}
+	return m / PAGE_SIZE;
+}
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+	if (cg)
+		return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+
+	return NULL;
+}
+
+static inline bool sgx_epc_cgroup_disabled(void)
+{
+	return !cgroup_subsys_enabled(misc_cgrp_subsys);
+}
+
+/**
+ * sgx_epc_cgroup_lru_empty - check if a cgroup tree has no pages on its lrus
+ * @root:	root of the tree to check
+ *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	struct cgroup_subsys_state *css_root = NULL;
+	struct cgroup_subsys_state *pos = NULL;
+	struct sgx_epc_cgroup *epc_cg = NULL;
+	bool ret = true;
+
+	/*
+	 * Caller ensure css_root ref acquired
+	 */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		if (!css_tryget(pos))
+			break;
+
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+
+		spin_lock(&epc_cg->lru.lock);
+		ret = list_empty(&epc_cg->lru.reclaimable);
+		spin_unlock(&epc_cg->lru.lock);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (!ret)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
+ * sgx_epc_cgroup_isolate_pages - walk a cgroup tree and separate pages
+ * @root:	root of the tree to start walking
+ * @nr_to_scan: The number of pages that need to be isolated
+ * @dst:	Destination list to hold the isolated pages
+ *
+ * Walk the cgroup tree and isolate the pages in the hierarchy
+ * for reclaiming.
+ */
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+				  size_t *nr_to_scan, struct list_head *dst)
+{
+	struct cgroup_subsys_state *css_root = NULL;
+	struct cgroup_subsys_state *pos = NULL;
+	struct sgx_epc_cgroup *epc_cg = NULL;
+
+	if (!*nr_to_scan)
+		return;
+
+	 /* Caller ensure css_root ref acquired */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		if (!css_tryget(pos))
+			break;
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+		sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (!*nr_to_scan)
+			break;
+	}
+	rcu_read_unlock();
+}
+
+static int sgx_epc_cgroup_reclaim_pages(unsigned long nr_pages,
+					struct sgx_epc_reclaim_control *rc)
+{
+	/*
+	 * Ensure sgx_reclaim_pages is called with a minimum and maximum
+	 * number of pages.  Attempting to reclaim only a few pages will
+	 * often fail and is inefficient, while reclaiming a huge number
+	 * of pages can result in soft lockups due to holding various
+	 * locks for an extended duration.  This also bounds nr_pages so
+	 */
+	nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
+	nr_pages = min(nr_pages, SGX_NR_TO_SCAN_MAX);
+
+	return sgx_reclaim_epc_pages(nr_pages, rc->ignore_age, rc->epc_cg);
+}
+
+static int sgx_epc_cgroup_reclaim_failed(struct sgx_epc_reclaim_control *rc)
+{
+	if (sgx_epc_cgroup_lru_empty(rc->epc_cg))
+		return -ENOMEM;
+
+	++rc->nr_fails;
+	if (rc->nr_fails > SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD)
+		rc->ignore_age = true;
+
+	return 0;
+}
+
+static inline
+void sgx_epc_reclaim_control_init(struct sgx_epc_reclaim_control *rc,
+				  struct sgx_epc_cgroup *epc_cg)
+{
+	rc->epc_cg = epc_cg;
+	rc->nr_fails = 0;
+	rc->ignore_age = false;
+}
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
+ * cgroup when the cgroup is at/near its maximum capacity
+ */
+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+	struct sgx_epc_reclaim_control rc;
+	struct sgx_epc_cgroup *epc_cg;
+	unsigned long cur, max;
+
+	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+	for (;;) {
+		max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
+
+		/*
+		 * Adjust the limit down by one page, the goal is to free up
+		 * pages for fault allocations, not to simply obey the limit.
+		 * Conditionally decrementing max also means the cur vs. max
+		 * check will correctly handle the case where both are zero.
+		 */
+		if (max)
+			max--;
+
+		/*
+		 * Unless the limit is extremely low, in which case forcing
+		 * reclaim will likely cause thrashing, force the cgroup to
+		 * reclaim at least once if it's operating *near* its maximum
+		 * limit by adjusting @max down by half the min reclaim size.
+		 * This work func is scheduled by sgx_epc_cgroup_try_charge
+		 * when it cannot directly reclaim due to being in an atomic
+		 * context, e.g. EPC allocation in a fault handler.  Waiting
+		 * to reclaim until the cgroup is actually at its limit is less
+		 * performant as it means the faulting task is effectively
+		 * blocked until a worker makes its way through the global work
+		 * queue.
+		 */
+		if (max > SGX_NR_TO_SCAN_MAX)
+			max -= (SGX_EPC_RECLAIM_MIN_PAGES / 2);
+
+		max = min(max, sgx_epc_total_pages);
+		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+		if (cur <= max)
+			break;
+		/* Nothing reclaimable */
+		if (sgx_epc_cgroup_lru_empty(epc_cg)) {
+			if (!sgx_epc_cgroup_oom(epc_cg))
+				break;
+
+			continue;
+		}
+
+		if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc))
+				break;
+		}
+	}
+}
+
+static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
+				       bool reclaim)
+{
+	struct sgx_epc_reclaim_control rc;
+	unsigned int nr_empty = 0;
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+	for (;;) {
+		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+					PAGE_SIZE))
+			break;
+
+		if (sgx_epc_cgroup_lru_empty(epc_cg))
+			return -ENOMEM;
+
+		if (signal_pending(current))
+			return -ERESTARTSYS;
+
+		if (!reclaim) {
+			queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
+			return -EBUSY;
+		}
+
+		if (!sgx_epc_cgroup_reclaim_pages(1, &rc)) {
+			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
+				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
+					return -ENOMEM;
+				schedule();
+			}
+		}
+	}
+	if (epc_cg->cg != misc_cg_root())
+		css_get(&epc_cg->cg->css);
+
+	return 0;
+}
+
+/**
+ * sgx_epc_cgroup_try_charge - hierarchically try to charge a single EPC page
+ * @mm:			the mm_struct of the process to charge
+ * @reclaim:		whether or not synchronous reclaim is allowed
+ *
+ * Returns EPC cgroup or NULL on success, -errno on failure.
+ */
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
+{
+	struct sgx_epc_cgroup *epc_cg;
+	int ret;
+
+	if (sgx_epc_cgroup_disabled())
+		return NULL;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
+	ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
+	put_misc_cg(epc_cg->cg);
+
+	if (ret)
+		return ERR_PTR(ret);
+
+	return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_uncharge - hierarchically uncharge EPC pages
+ * @epc_cg:	the charged epc cgroup
+ */
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
+{
+	if (sgx_epc_cgroup_disabled())
+		return;
+
+	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+
+	if (epc_cg->cg != misc_cg_root())
+		put_misc_cg(epc_cg->cg);
+}
+
+static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root)
+{
+	struct cgroup_subsys_state *css_root = NULL;
+	struct cgroup_subsys_state *pos = NULL;
+	struct sgx_epc_cgroup *epc_cg = NULL;
+	bool oom = false;
+
+	 /* Caller ensure css_root ref acquired */
+	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
+
+	rcu_read_lock();
+	css_for_each_descendant_pre(pos, css_root) {
+		/* skip dead ones */
+		if (!css_tryget(pos))
+			continue;
+
+		rcu_read_unlock();
+
+		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+		oom = sgx_epc_oom(&epc_cg->lru);
+
+		rcu_read_lock();
+		css_put(pos);
+		if (oom)
+			break;
+	}
+	rcu_read_unlock();
+	return oom;
+}
+
+static void sgx_epc_cgroup_free(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+	cancel_work_sync(&epc_cg->reclaim_work);
+	kfree(epc_cg);
+}
+
+static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
+{
+	struct sgx_epc_reclaim_control rc;
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+
+	sgx_epc_reclaim_control_init(&rc, epc_cg);
+	/* Let the reclaimer to do the work so user is not blocked */
+	queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
+}
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
+{
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
+	if (!epc_cg)
+		return -ENOMEM;
+
+	sgx_lru_init(&epc_cg->lru);
+	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
+	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_alloc = sgx_epc_cgroup_alloc;
+	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_free = sgx_epc_cgroup_free;
+	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_max_write = sgx_epc_cgroup_max_write;
+	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
+	epc_cg->cg = cg;
+	return 0;
+}
+
+static int __init sgx_epc_cgroup_init(void)
+{
+	struct misc_cg *cg;
+
+	if (!boot_cpu_has(X86_FEATURE_SGX))
+		return 0;
+
+	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
+					WQ_UNBOUND | WQ_FREEZABLE,
+					WQ_UNBOUND_MAX_ACTIVE);
+	BUG_ON(!sgx_epc_cg_wq);
+
+	cg = misc_cg_root();
+	BUG_ON(!cg);
+	WRITE_ONCE(cg->res[MISC_CG_RES_SGX_EPC].max, ULONG_MAX);
+	atomic_long_set(&cg->res[MISC_CG_RES_SGX_EPC].usage, 0UL);
+	return sgx_epc_cgroup_alloc(cg);
+}
+subsys_initcall(sgx_epc_cgroup_init);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..03ac4dcea82b
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _INTEL_SGX_EPC_CGROUP_H_
+#define _INTEL_SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/misc_cgroup.h>
+#include <linux/page_counter.h>
+#include <linux/workqueue.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_epc_cgroup;
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
+{
+	return NULL;
+}
+
+static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
+
+static inline void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+						size_t *nr_to_scan,
+						struct list_head *dst) { }
+
+static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+	return NULL;
+}
+
+static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+	return true;
+}
+#else
+struct sgx_epc_cgroup {
+	struct misc_cg *cg;
+	struct sgx_epc_lru_lists	lru;
+	struct work_struct	reclaim_work;
+	atomic_long_t		epoch;
+};
+
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim);
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root);
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+				  size_t *nr_to_scan, struct list_head *dst);
+static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+	if (epc_cg)
+		return &epc_cg->lru;
+	return NULL;
+}
+#endif
+
+#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 68c89d575abc..1e5984b881a2 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
 #include <linux/highmem.h>
 #include <linux/kthread.h>
 #include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
 #include <linux/node.h>
 #include <linux/pagemap.h>
 #include <linux/ratelimit.h>
@@ -17,11 +18,9 @@
 #include "driver.h"
 #include "encl.h"
 #include "encls.h"
-/**
- * Maximum number of pages to scan for reclaiming.
- */
-#define SGX_NR_TO_SCAN_MAX	32
+#include "epc_cgroup.h"
 
+unsigned long sgx_epc_total_pages;
 struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 static int sgx_nr_epc_sections;
 static struct task_struct *ksgxd_tsk;
@@ -36,9 +35,20 @@ static struct sgx_epc_lru_lists sgx_global_lru;
 
 static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
 {
+	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		return epc_cg_lru(epc_page->epc_cg);
+
 	return &sgx_global_lru;
 }
 
+static inline bool sgx_can_reclaim(void)
+{
+	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+		return !list_empty(&sgx_global_lru.reclaimable);
+
+	return !sgx_epc_cgroup_lru_empty(NULL);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -298,14 +308,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
  * @nr_to_scan:	Number of pages to scan for reclaim
  * @dst:	Destination list to hold the isolated pages
  */
-void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t *nr_to_scan,
 			   struct list_head *dst)
 {
 	struct sgx_encl_page *encl_page;
 	struct sgx_epc_page *epc_page;
 
 	spin_lock(&lru->lock);
-	for (; nr_to_scan > 0; --nr_to_scan) {
+	for (; *nr_to_scan > 0; --(*nr_to_scan)) {
 		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
 		if (!epc_page)
 			break;
@@ -330,9 +340,10 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
 }
 
 /**
- * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
+ * __sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
  * @nr_to_scan:		 Number of EPC pages to scan for reclaim
  * @ignore_age:		 Reclaim a page even if it is young
+ * @epc_cg:		 EPC cgroup from which to reclaim
  *
  * Take a fixed number of pages from the head of the active page pool and
  * reclaim them to the enclave's private shmem files. Skip the pages, which have
@@ -346,7 +357,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
+			     struct sgx_epc_cgroup *epc_cg)
 {
 	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
 	struct sgx_epc_page *epc_page, *tmp;
@@ -357,7 +369,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	size_t ret;
 	size_t i;
 
-	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
+	/*
+	 * If a specific cgroup is not being targeted, take from the global
+	 * list first, even when cgroups are enabled.  If there are
+	 * pages on the global LRU then they should get reclaimed asap.
+	 */
+	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
+		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
+
+	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
 
 	if (list_empty(&iso))
 		return 0;
@@ -410,11 +430,6 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
 	return i;
 }
 
-static bool sgx_can_reclaim(void)
-{
-	return !list_empty(&sgx_global_lru.reclaimable);
-}
-
 static bool sgx_should_reclaim(unsigned long watermark)
 {
 	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
@@ -429,7 +444,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
 void sgx_reclaim_direct(void)
 {
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 }
 
 static int ksgxd(void *p)
@@ -452,7 +467,7 @@ static int ksgxd(void *p)
 				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
 
 		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
-			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 
 		cond_resched();
 	}
@@ -606,6 +621,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 {
 	struct sgx_epc_page *page;
+	struct sgx_epc_cgroup *epc_cg;
+
+	epc_cg = sgx_epc_cgroup_try_charge(reclaim);
+	if (IS_ERR(epc_cg))
+		return ERR_CAST(epc_cg);
 
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
@@ -614,8 +634,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		if (!sgx_can_reclaim())
-			return ERR_PTR(-ENOMEM);
+		if (!sgx_can_reclaim()) {
+			page = ERR_PTR(-ENOMEM);
+			break;
+		}
 
 		if (!reclaim) {
 			page = ERR_PTR(-EBUSY);
@@ -627,10 +649,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 			break;
 		}
 
-		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
 		cond_resched();
 	}
 
+	if (!IS_ERR(page)) {
+		WARN_ON_ONCE(page->epc_cg);
+		page->epc_cg = epc_cg;
+	} else {
+		sgx_epc_cgroup_uncharge(epc_cg);
+	}
+
 	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
 		wake_up(&ksgxd_waitq);
 
@@ -653,6 +682,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
 
+	if (page->epc_cg) {
+		sgx_epc_cgroup_uncharge(page->epc_cg);
+		page->epc_cg = NULL;
+	}
+
 	spin_lock(&node->lock);
 
 	page->encl_page = NULL;
@@ -663,6 +697,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 	page->flags = SGX_EPC_PAGE_FREE;
 
 	spin_unlock(&node->lock);
+
 	atomic_long_inc(&sgx_nr_free_pages);
 }
 
@@ -832,6 +867,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 		section->pages[i].flags = 0;
 		section->pages[i].encl_page = NULL;
 		section->pages[i].poison = 0;
+		section->pages[i].epc_cg = NULL;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
 
@@ -976,6 +1012,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
 static bool __init sgx_page_cache_init(void)
 {
 	u32 eax, ebx, ecx, edx, type;
+	u64 capacity = 0;
 	u64 pa, size;
 	int nid;
 	int i;
@@ -1026,6 +1063,7 @@ static bool __init sgx_page_cache_init(void)
 
 		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
 		sgx_numa_nodes[nid].size += size;
+		capacity += size;
 
 		sgx_nr_epc_sections++;
 	}
@@ -1035,6 +1073,9 @@ static bool __init sgx_page_cache_init(void)
 		return false;
 	}
 
+	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+	sgx_epc_total_pages = capacity >> PAGE_SHIFT;
+
 	return true;
 }
 
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index c6b3c90db0fa..36217032433b 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -19,6 +19,11 @@
 
 #define SGX_MAX_EPC_SECTIONS		8
 #define SGX_EEXTEND_BLOCK_SIZE		256
+
+/*
+ * Maximum number of pages to scan for reclaiming.
+ */
+#define SGX_NR_TO_SCAN_MAX		32UL
 #define SGX_NR_TO_SCAN			16
 #define SGX_NR_LOW_PAGES		32
 #define SGX_NR_HIGH_PAGES		64
@@ -70,6 +75,8 @@ enum sgx_epc_page_state {
 /* flag for pages owned by a sgx_encl struct */
 #define SGX_EPC_OWNER_ENCL		BIT(4)
 
+struct sgx_epc_cgroup;
+
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
@@ -79,6 +86,7 @@ struct sgx_epc_page {
 		struct sgx_encl *encl;
 	};
 	struct list_head list;
+	struct sgx_epc_cgroup *epc_cg;
 };
 
 static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
@@ -127,6 +135,7 @@ struct sgx_epc_section {
 	struct sgx_numa_node *node;
 };
 
+extern unsigned long sgx_epc_total_pages;
 extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
 
 static inline unsigned long sgx_get_epc_phys_addr(struct sgx_epc_page *page)
@@ -175,8 +184,9 @@ void sgx_reclaim_direct(void);
 void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
 int sgx_drop_epc_page(struct sgx_epc_page *page);
 struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
-size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
-void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
+size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
+			     struct sgx_epc_cgroup *epc_cg);
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t *nr_to_scan,
 			   struct list_head *dst);
 bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (20 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 21/28] x86/sgx: Limit process EPC usage with misc cgroup controller Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-13  0:10   ` Randy Dunlap
                     ` (2 more replies)
  2023-07-12 23:01 ` [PATCH v3 23/28] selftests/sgx: Retry the ioctl()'s returned with EAGAIN Haitao Huang
                   ` (8 subsequent siblings)
  30 siblings, 3 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Jonathan Corbet
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li,
	seanjc, bagasdotme, linux-doc, zhanb, anakrish, mikko.ylinen

From: Kristen Carlson Accardi <kristen@linux.intel.com>

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
---
 Documentation/arch/x86/sgx.rst | 77 ++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index 2bcbffacbed5..f6ca5594dcf2 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,80 @@ to expected failures and handle them as follows:
    first call.  It indicates a bug in the kernel or the userspace client
    if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
    a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that
+is used to provide SGX-enabled applications with protected memory,
+and is otherwise inaccessible, i.e. shows up as reserved in
+/proc/iomem and cannot be read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM,
+for all intents and purposes the EPC is independent from normal system
+memory, e.g. must be reserved at boot from RAM and cannot be converted
+between EPC and normal memory while the system is running.  The EPC is
+managed by the SGX subsystem and is not accounted by the memory
+controller.  Note that this is true only for EPC memory itself, i.e.
+normal memory allocations related to SGX and EPC memory, e.g. the
+backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via
+virtual memory techniques and pages can be swapped out of the EPC
+to their backing store (normal system memory allocated via shmem).
+The SGX EPC subsystem is analogous to the memory subsytem, and
+it implements limit and protection models for EPC memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface
+files, please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated
+otherwise.  If a value which is not PAGE_SIZE aligned is written,
+the actual value used by the controller will be rounded down to
+the closest PAGE_SIZE multiple.
+
+  misc.capacity
+        A read-only flat-keyed file shown only in the root cgroup.
+        The sgx_epc resource will show the total amount of EPC
+        memory available on the platform.
+
+  misc.current
+        A read-only flat-keyed file shown in the non-root cgroups.
+        The sgx_epc resource will show the current active EPC memory
+        usage of the cgroup and its descendants. EPC pages that are
+        swapped out to backing RAM are not included in the current count.
+
+  misc.max
+        A read-write single value file which exists on non-root
+        cgroups. The sgx_epc resource will show the EPC usage
+        hard limit. The default is "max".
+
+        If a cgroup's EPC usage reaches this limit, EPC allocations,
+        e.g. for page fault handling, will be blocked until EPC can
+        be reclaimed from the cgroup.  If EPC cannot be reclaimed in
+        a timely manner, reclaim will be forced, e.g. by ignoring LRU.
+
+  misc.events
+	A read-write flat-keyed file which exists on non-root cgroups.
+	Writes to the file reset the event counters to zero.  A value
+	change in this file generates a file modified event.
+
+	  max
+		The number of times the cgroup has triggered a reclaim
+		due to its EPC usage approaching (or exceeding) its max
+		EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it
+remains charged to the original cgroup until the page is released
+or reclaimed.  Migrating a process to a different cgroup doesn't
+move the EPC charges that it incurred while in the previous cgroup
+to its new cgroup.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 23/28] selftests/sgx: Retry the ioctl()'s returned with EAGAIN
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (21 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 24/28] selftests/sgx: Move ENCL_HEAP_SIZE_DEFAULT to main.c Haitao Huang
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, linux-sgx; +Cc: kai.huang, reinette.chatre

For EMODT and EREMOVE ioctl()'s with a large range, kernel
may not finish in one shot and return EAGAIN error code
and count of bytes of EPC pages on that operations are
finished successfully.

Change the unclobbered_vdso_oversubscribed_remove test
to rerun the ioctl()'s in a loop, updating offset and length
using the byte count returned in each iteration.

Fixes: 6507cce561b4 ("selftests/sgx: Page removal stress test")
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---
 tools/testing/selftests/sgx/main.c | 42 ++++++++++++++++++++++++------
 1 file changed, 34 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/sgx/main.c b/tools/testing/selftests/sgx/main.c
index 9820b3809c69..59cca806eda1 100644
--- a/tools/testing/selftests/sgx/main.c
+++ b/tools/testing/selftests/sgx/main.c
@@ -390,6 +390,7 @@ TEST_F_TIMEOUT(enclave, unclobbered_vdso_oversubscribed_remove, 900)
 	struct encl_segment *heap;
 	unsigned long total_mem;
 	int ret, errno_save;
+	unsigned long count;
 	unsigned long addr;
 	unsigned long i;
 
@@ -453,16 +454,30 @@ TEST_F_TIMEOUT(enclave, unclobbered_vdso_oversubscribed_remove, 900)
 	modt_ioc.offset = heap->offset;
 	modt_ioc.length = heap->size;
 	modt_ioc.page_type = SGX_PAGE_TYPE_TRIM;
-
+	count = 0;
 	TH_LOG("Changing type of %zd bytes to trimmed may take a while ...",
 	       heap->size);
-	ret = ioctl(self->encl.fd, SGX_IOC_ENCLAVE_MODIFY_TYPES, &modt_ioc);
-	errno_save = ret == -1 ? errno : 0;
+	do {
+		ret = ioctl(self->encl.fd, SGX_IOC_ENCLAVE_MODIFY_TYPES, &modt_ioc);
+
+		errno_save = ret == -1 ? errno : 0;
+		if (errno_save != EAGAIN)
+			break;
+
+		EXPECT_EQ(modt_ioc.result, 0);
+
+		count += modt_ioc.count;
+		modt_ioc.offset += modt_ioc.count;
+		modt_ioc.length -= modt_ioc.count;
+		modt_ioc.result = 0;
+		modt_ioc.count = 0;
+	} while (modt_ioc.length != 0);
 
 	EXPECT_EQ(ret, 0);
 	EXPECT_EQ(errno_save, 0);
 	EXPECT_EQ(modt_ioc.result, 0);
-	EXPECT_EQ(modt_ioc.count, heap->size);
+	count += modt_ioc.count;
+	EXPECT_EQ(count, heap->size);
 
 	/* EACCEPT all removed pages. */
 	addr = self->encl.encl_base + heap->offset;
@@ -490,15 +505,26 @@ TEST_F_TIMEOUT(enclave, unclobbered_vdso_oversubscribed_remove, 900)
 
 	remove_ioc.offset = heap->offset;
 	remove_ioc.length = heap->size;
-
+	count = 0;
 	TH_LOG("Removing %zd bytes from enclave may take a while ...",
 	       heap->size);
-	ret = ioctl(self->encl.fd, SGX_IOC_ENCLAVE_REMOVE_PAGES, &remove_ioc);
-	errno_save = ret == -1 ? errno : 0;
+	do {
+		ret = ioctl(self->encl.fd, SGX_IOC_ENCLAVE_REMOVE_PAGES, &remove_ioc);
+
+		errno_save = ret == -1 ? errno : 0;
+		if (errno_save != EAGAIN)
+			break;
+
+		count += remove_ioc.count;
+		remove_ioc.offset += remove_ioc.count;
+		remove_ioc.length -= remove_ioc.count;
+		remove_ioc.count = 0;
+	} while (remove_ioc.length != 0);
 
 	EXPECT_EQ(ret, 0);
 	EXPECT_EQ(errno_save, 0);
-	EXPECT_EQ(remove_ioc.count, heap->size);
+	count += remove_ioc.count;
+	EXPECT_EQ(count, heap->size);
 }
 
 TEST_F(enclave, clobbered_vdso)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 24/28] selftests/sgx: Move ENCL_HEAP_SIZE_DEFAULT to main.c
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (22 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 23/28] selftests/sgx: Retry the ioctl()'s returned with EAGAIN Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:01 ` [PATCH v3 25/28] selftests/sgx: Use encl->encl_size in sigstruct.c Haitao Huang
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, linux-sgx; +Cc: kai.huang, reinette.chatre

From: Jarkko Sakkinen <jarkko@kernel.org>

Move ENCL_HEAP_SIZE_DEFAULT to main.c because all the other constants
are also there, and it is only used locally there.

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
---
 tools/testing/selftests/sgx/main.c | 1 +
 tools/testing/selftests/sgx/main.h | 2 --
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/testing/selftests/sgx/main.c b/tools/testing/selftests/sgx/main.c
index 59cca806eda1..a1850e139c99 100644
--- a/tools/testing/selftests/sgx/main.c
+++ b/tools/testing/selftests/sgx/main.c
@@ -21,6 +21,7 @@
 #include "../kselftest_harness.h"
 #include "main.h"
 
+static const size_t ENCL_HEAP_SIZE_DEFAULT = PAGE_SIZE;
 static const uint64_t MAGIC = 0x1122334455667788ULL;
 static const uint64_t MAGIC2 = 0x8877665544332211ULL;
 vdso_sgx_enter_enclave_t vdso_sgx_enter_enclave;
diff --git a/tools/testing/selftests/sgx/main.h b/tools/testing/selftests/sgx/main.h
index fc585be97e2f..82b33f8db048 100644
--- a/tools/testing/selftests/sgx/main.h
+++ b/tools/testing/selftests/sgx/main.h
@@ -6,8 +6,6 @@
 #ifndef MAIN_H
 #define MAIN_H
 
-#define ENCL_HEAP_SIZE_DEFAULT	4096
-
 struct encl_segment {
 	void *src;
 	off_t offset;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 25/28] selftests/sgx: Use encl->encl_size in sigstruct.c
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (23 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 24/28] selftests/sgx: Move ENCL_HEAP_SIZE_DEFAULT to main.c Haitao Huang
@ 2023-07-12 23:01 ` Haitao Huang
  2023-07-12 23:02 ` [PATCH v3 26/28] selftests/sgx: Include the dynamic heap size to the ELRANGE calculation Haitao Huang
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, linux-sgx; +Cc: kai.huang, reinette.chatre

From: Jarkko Sakkinen <jarkko@kernel.org>

The final enclave address range (referred as ELRANGE in Intel SDM)
calculation is a reminiscent of signing tool being a separate command-line
utility, and sigstruct being produced during the compilation. Given that
nowadays the sigstruct is calculated on-fly, use the readily calculated
encl->encl_size instead, in order to remove duplicate code.

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
 tools/testing/selftests/sgx/load.c      | 5 +++--
 tools/testing/selftests/sgx/main.h      | 1 -
 tools/testing/selftests/sgx/sigstruct.c | 8 ++------
 3 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/sgx/load.c b/tools/testing/selftests/sgx/load.c
index 94bdeac1cf04..3b4e2422fb09 100644
--- a/tools/testing/selftests/sgx/load.c
+++ b/tools/testing/selftests/sgx/load.c
@@ -174,6 +174,7 @@ uint64_t encl_get_entry(struct encl *encl, const char *symbol)
 bool encl_load(const char *path, struct encl *encl, unsigned long heap_size)
 {
 	const char device_path[] = "/dev/sgx_enclave";
+	unsigned long contents_size;
 	struct encl_segment *seg;
 	Elf64_Phdr *phdr_tbl;
 	off_t src_offset;
@@ -298,9 +299,9 @@ bool encl_load(const char *path, struct encl *encl, unsigned long heap_size)
 	if (seg->src == MAP_FAILED)
 		goto err;
 
-	encl->src_size = encl->segment_tbl[j].offset + encl->segment_tbl[j].size;
+	contents_size = encl->segment_tbl[j].offset + encl->segment_tbl[j].size;
 
-	for (encl->encl_size = 4096; encl->encl_size < encl->src_size; )
+	for (encl->encl_size = 4096; encl->encl_size < contents_size; )
 		encl->encl_size <<= 1;
 
 	return true;
diff --git a/tools/testing/selftests/sgx/main.h b/tools/testing/selftests/sgx/main.h
index 82b33f8db048..9c1bc0d9b43c 100644
--- a/tools/testing/selftests/sgx/main.h
+++ b/tools/testing/selftests/sgx/main.h
@@ -20,7 +20,6 @@ struct encl {
 	void *bin;
 	off_t bin_size;
 	void *src;
-	size_t src_size;
 	size_t encl_size;
 	off_t encl_base;
 	unsigned int nr_segments;
diff --git a/tools/testing/selftests/sgx/sigstruct.c b/tools/testing/selftests/sgx/sigstruct.c
index a07896a46364..9a40c7966eda 100644
--- a/tools/testing/selftests/sgx/sigstruct.c
+++ b/tools/testing/selftests/sgx/sigstruct.c
@@ -218,13 +218,9 @@ struct mrecreate {
 } __attribute__((__packed__));
 
 
-static bool mrenclave_ecreate(EVP_MD_CTX *ctx, uint64_t blob_size)
+static bool mrenclave_ecreate(EVP_MD_CTX *ctx, uint64_t encl_size)
 {
 	struct mrecreate mrecreate;
-	uint64_t encl_size;
-
-	for (encl_size = 0x1000; encl_size < blob_size; )
-		encl_size <<= 1;
 
 	memset(&mrecreate, 0, sizeof(mrecreate));
 	mrecreate.tag = MRECREATE;
@@ -349,7 +345,7 @@ bool encl_measure(struct encl *encl)
 	if (!ctx)
 		goto err;
 
-	if (!mrenclave_ecreate(ctx, encl->src_size))
+	if (!mrenclave_ecreate(ctx, encl->encl_size))
 		goto err;
 
 	for (i = 0; i < encl->nr_segments; i++) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 26/28] selftests/sgx: Include the dynamic heap size to the ELRANGE calculation
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (24 preceding siblings ...)
  2023-07-12 23:01 ` [PATCH v3 25/28] selftests/sgx: Use encl->encl_size in sigstruct.c Haitao Huang
@ 2023-07-12 23:02 ` Haitao Huang
  2023-07-12 23:02 ` [PATCH v3 27/28] selftests/sgx: Add SGX selftest augment_via_eaccept_long Haitao Huang
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:02 UTC (permalink / raw)
  To: jarkko, dave.hansen, linux-sgx; +Cc: kai.huang, reinette.chatre

From: Jarkko Sakkinen <jarkko@kernel.org>

When calculating ELRANGE, i.e. the address range defined for an enclave,
and represented by encl->encl_size, also dynamic memory should be taken
into account. Implement setup_test_encl_dynamic() with dynamic_size
parameter for the dynamic heap size, and use it in 'augment_via_eaccept'
and 'augment' tests.

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
---
 tools/testing/selftests/sgx/load.c |  5 +++--
 tools/testing/selftests/sgx/main.c | 22 +++++++++++++++-------
 tools/testing/selftests/sgx/main.h |  3 ++-
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/sgx/load.c b/tools/testing/selftests/sgx/load.c
index 3b4e2422fb09..963a5c6bbbdc 100644
--- a/tools/testing/selftests/sgx/load.c
+++ b/tools/testing/selftests/sgx/load.c
@@ -171,7 +171,8 @@ uint64_t encl_get_entry(struct encl *encl, const char *symbol)
 	return 0;
 }
 
-bool encl_load(const char *path, struct encl *encl, unsigned long heap_size)
+bool encl_load(const char *path, struct encl *encl, unsigned long heap_size,
+	       unsigned long dynamic_size)
 {
 	const char device_path[] = "/dev/sgx_enclave";
 	unsigned long contents_size;
@@ -299,7 +300,7 @@ bool encl_load(const char *path, struct encl *encl, unsigned long heap_size)
 	if (seg->src == MAP_FAILED)
 		goto err;
 
-	contents_size = encl->segment_tbl[j].offset + encl->segment_tbl[j].size;
+	contents_size = encl->segment_tbl[j].offset + encl->segment_tbl[j].size + dynamic_size;
 
 	for (encl->encl_size = 4096; encl->encl_size < contents_size; )
 		encl->encl_size <<= 1;
diff --git a/tools/testing/selftests/sgx/main.c b/tools/testing/selftests/sgx/main.c
index a1850e139c99..78c3b913ce10 100644
--- a/tools/testing/selftests/sgx/main.c
+++ b/tools/testing/selftests/sgx/main.c
@@ -173,8 +173,8 @@ FIXTURE(enclave) {
 	struct sgx_enclave_run run;
 };
 
-static bool setup_test_encl(unsigned long heap_size, struct encl *encl,
-			    struct __test_metadata *_metadata)
+static bool setup_test_encl_dynamic(unsigned long heap_size, unsigned long dynamic_size,
+				    struct encl *encl, struct __test_metadata *_metadata)
 {
 	Elf64_Sym *sgx_enter_enclave_sym = NULL;
 	struct vdso_symtab symtab;
@@ -184,7 +184,7 @@ static bool setup_test_encl(unsigned long heap_size, struct encl *encl,
 	unsigned int i;
 	void *addr;
 
-	if (!encl_load("test_encl.elf", encl, heap_size)) {
+	if (!encl_load("test_encl.elf", encl, heap_size, dynamic_size)) {
 		encl_delete(encl);
 		TH_LOG("Failed to load the test enclave.");
 		return false;
@@ -251,6 +251,12 @@ static bool setup_test_encl(unsigned long heap_size, struct encl *encl,
 	return false;
 }
 
+static bool setup_test_encl(unsigned long heap_size, struct encl *encl,
+			    struct __test_metadata *_metadata)
+{
+	return setup_test_encl_dynamic(heap_size, 0, encl, _metadata);
+}
+
 FIXTURE_SETUP(enclave)
 {
 }
@@ -1013,7 +1019,8 @@ TEST_F(enclave, augment)
 	if (!sgx2_supported())
 		SKIP(return, "SGX2 not supported");
 
-	ASSERT_TRUE(setup_test_encl(ENCL_HEAP_SIZE_DEFAULT, &self->encl, _metadata));
+	ASSERT_TRUE(setup_test_encl_dynamic(ENCL_HEAP_SIZE_DEFAULT, PAGE_SIZE, &self->encl,
+					    _metadata));
 
 	memset(&self->run, 0, sizeof(self->run));
 	self->run.tcs = self->encl.encl_base;
@@ -1143,7 +1150,8 @@ TEST_F(enclave, augment_via_eaccept)
 	if (!sgx2_supported())
 		SKIP(return, "SGX2 not supported");
 
-	ASSERT_TRUE(setup_test_encl(ENCL_HEAP_SIZE_DEFAULT, &self->encl, _metadata));
+	ASSERT_TRUE(setup_test_encl_dynamic(ENCL_HEAP_SIZE_DEFAULT, PAGE_SIZE, &self->encl,
+					    _metadata));
 
 	memset(&self->run, 0, sizeof(self->run));
 	self->run.tcs = self->encl.encl_base;
@@ -1264,8 +1272,8 @@ TEST_F(enclave, tcs_create)
 	int errno_save;
 	int ret, i;
 
-	ASSERT_TRUE(setup_test_encl(ENCL_HEAP_SIZE_DEFAULT, &self->encl,
-				    _metadata));
+	ASSERT_TRUE(setup_test_encl_dynamic(ENCL_HEAP_SIZE_DEFAULT, 3 * PAGE_SIZE, &self->encl,
+					    _metadata));
 
 	memset(&self->run, 0, sizeof(self->run));
 	self->run.tcs = self->encl.encl_base;
diff --git a/tools/testing/selftests/sgx/main.h b/tools/testing/selftests/sgx/main.h
index 9c1bc0d9b43c..8f77ce56ad09 100644
--- a/tools/testing/selftests/sgx/main.h
+++ b/tools/testing/selftests/sgx/main.h
@@ -32,7 +32,8 @@ extern unsigned char sign_key[];
 extern unsigned char sign_key_end[];
 
 void encl_delete(struct encl *ctx);
-bool encl_load(const char *path, struct encl *encl, unsigned long heap_size);
+bool encl_load(const char *path, struct encl *encl, unsigned long heap_size,
+	       unsigned long dynamic_size);
 bool encl_measure(struct encl *encl);
 bool encl_build(struct encl *encl);
 uint64_t encl_get_entry(struct encl *encl, const char *symbol);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 27/28] selftests/sgx: Add SGX selftest augment_via_eaccept_long
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (25 preceding siblings ...)
  2023-07-12 23:02 ` [PATCH v3 26/28] selftests/sgx: Include the dynamic heap size to the ELRANGE calculation Haitao Huang
@ 2023-07-12 23:02 ` Haitao Huang
  2023-07-12 23:02 ` [PATCH v3 28/28] selftests/sgx: Add scripts for epc cgroup testing Haitao Huang
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:02 UTC (permalink / raw)
  To: jarkko, dave.hansen, linux-sgx; +Cc: kai.huang, reinette.chatre, Vijay Dhanraj

From: Vijay Dhanraj <vijay.dhanraj@intel.com>

Add a new test case which is same as augment_via_eaccept but adds a
larger number of EPC pages to stress test EAUG via EACCEPT.

Signed-off-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
---
 tools/testing/selftests/sgx/main.c | 112 ++++++++++++++++++++++++++++-
 1 file changed, 111 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/sgx/main.c b/tools/testing/selftests/sgx/main.c
index 78c3b913ce10..e596b45bc5f8 100644
--- a/tools/testing/selftests/sgx/main.c
+++ b/tools/testing/selftests/sgx/main.c
@@ -22,8 +22,10 @@
 #include "main.h"
 
 static const size_t ENCL_HEAP_SIZE_DEFAULT = PAGE_SIZE;
+static const unsigned long TIMEOUT_DEFAULT = 900;
 static const uint64_t MAGIC = 0x1122334455667788ULL;
 static const uint64_t MAGIC2 = 0x8877665544332211ULL;
+
 vdso_sgx_enter_enclave_t vdso_sgx_enter_enclave;
 
 /*
@@ -387,7 +389,7 @@ TEST_F(enclave, unclobbered_vdso_oversubscribed)
 	EXPECT_EQ(self->run.user_data, 0);
 }
 
-TEST_F_TIMEOUT(enclave, unclobbered_vdso_oversubscribed_remove, 900)
+TEST_F_TIMEOUT(enclave, unclobbered_vdso_oversubscribed_remove, TIMEOUT_DEFAULT)
 {
 	struct sgx_enclave_remove_pages remove_ioc;
 	struct sgx_enclave_modify_types modt_ioc;
@@ -1245,6 +1247,114 @@ TEST_F(enclave, augment_via_eaccept)
 	munmap(addr, PAGE_SIZE);
 }
 
+/*
+ * Test for the addition of large number of pages to an initialized enclave via
+ * a pre-emptive run of EACCEPT on every page to be added.
+ */
+TEST_F_TIMEOUT(enclave, augment_via_eaccept_long, TIMEOUT_DEFAULT)
+{
+	/*
+	 * The dynamic heap size was chosen based on a bug report:
+	 * Message-ID:
+	 * <DM8PR11MB55912A7F47A84EC9913A6352F6999@DM8PR11MB5591.namprd11.prod.outlook.com>
+	 */
+	static const unsigned long DYNAMIC_HEAP_SIZE = 0x200000L * PAGE_SIZE;
+	struct encl_op_get_from_addr get_addr_op;
+	struct encl_op_put_to_addr put_addr_op;
+	struct encl_op_eaccept eaccept_op;
+	size_t total_size = 0;
+	unsigned long i;
+	void *addr;
+
+	if (!sgx2_supported())
+		SKIP(return, "SGX2 not supported");
+
+	ASSERT_TRUE(setup_test_encl_dynamic(ENCL_HEAP_SIZE_DEFAULT, DYNAMIC_HEAP_SIZE,
+					    &self->encl, _metadata));
+
+	memset(&self->run, 0, sizeof(self->run));
+	self->run.tcs = self->encl.encl_base;
+
+	for (i = 0; i < self->encl.nr_segments; i++) {
+		struct encl_segment *seg = &self->encl.segment_tbl[i];
+
+		total_size += seg->size;
+	}
+
+	/*
+	 * mmap() every page at end of existing enclave to be used for
+	 * EDMM.
+	 */
+	addr = mmap((void *)self->encl.encl_base + total_size, DYNAMIC_HEAP_SIZE,
+		    PROT_READ | PROT_WRITE | PROT_EXEC, MAP_SHARED | MAP_FIXED,
+		    self->encl.fd, 0);
+	EXPECT_NE(addr, MAP_FAILED);
+
+	self->run.exception_vector = 0;
+	self->run.exception_error_code = 0;
+	self->run.exception_addr = 0;
+
+	/*
+	 * Run EACCEPT on every page to trigger the #PF->EAUG->EACCEPT(again
+	 * without a #PF). All should be transparent to userspace.
+	 */
+	eaccept_op.flags = SGX_SECINFO_R | SGX_SECINFO_W | SGX_SECINFO_REG | SGX_SECINFO_PENDING;
+	eaccept_op.ret = 0;
+	eaccept_op.header.type = ENCL_OP_EACCEPT;
+
+	for (i = 0; i < DYNAMIC_HEAP_SIZE; i += PAGE_SIZE) {
+		eaccept_op.epc_addr = (uint64_t)(addr + i);
+
+		EXPECT_EQ(ENCL_CALL(&eaccept_op, &self->run, true), 0);
+		if (self->run.exception_vector == 14 &&
+		    self->run.exception_error_code == 4 &&
+		    self->run.exception_addr == self->encl.encl_base) {
+			munmap(addr, DYNAMIC_HEAP_SIZE);
+			SKIP(return, "Kernel does not support adding pages to initialized enclave");
+		}
+
+		EXPECT_EQ(self->run.exception_vector, 0);
+		EXPECT_EQ(self->run.exception_error_code, 0);
+		EXPECT_EQ(self->run.exception_addr, 0);
+		ASSERT_EQ(eaccept_op.ret, 0);
+		ASSERT_EQ(self->run.function, EEXIT);
+	}
+
+	/*
+	 * Pool of pages were successfully added to enclave. Perform sanity
+	 * check on first page of the pool only to ensure data can be written
+	 * to and read from a dynamically added enclave page.
+	 */
+	put_addr_op.value = MAGIC;
+	put_addr_op.addr = (unsigned long)addr;
+	put_addr_op.header.type = ENCL_OP_PUT_TO_ADDRESS;
+
+	EXPECT_EQ(ENCL_CALL(&put_addr_op, &self->run, true), 0);
+
+	EXPECT_EEXIT(&self->run);
+	EXPECT_EQ(self->run.exception_vector, 0);
+	EXPECT_EQ(self->run.exception_error_code, 0);
+	EXPECT_EQ(self->run.exception_addr, 0);
+
+	/*
+	 * Read memory from newly added page that was just written to,
+	 * confirming that data previously written (MAGIC) is present.
+	 */
+	get_addr_op.value = 0;
+	get_addr_op.addr = (unsigned long)addr;
+	get_addr_op.header.type = ENCL_OP_GET_FROM_ADDRESS;
+
+	EXPECT_EQ(ENCL_CALL(&get_addr_op, &self->run, true), 0);
+
+	EXPECT_EQ(get_addr_op.value, MAGIC);
+	EXPECT_EEXIT(&self->run);
+	EXPECT_EQ(self->run.exception_vector, 0);
+	EXPECT_EQ(self->run.exception_error_code, 0);
+	EXPECT_EQ(self->run.exception_addr, 0);
+
+	munmap(addr, DYNAMIC_HEAP_SIZE);
+}
+
 /*
  * SGX2 page type modification test in two phases:
  * Phase 1:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v3 28/28] selftests/sgx: Add scripts for epc cgroup testing
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (26 preceding siblings ...)
  2023-07-12 23:02 ` [PATCH v3 27/28] selftests/sgx: Add SGX selftest augment_via_eaccept_long Haitao Huang
@ 2023-07-12 23:02 ` Haitao Huang
  2023-07-17 11:02 ` [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Jarkko Sakkinen
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-12 23:02 UTC (permalink / raw)
  To: jarkko, dave.hansen, linux-sgx
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc, zhanb,
	anakrish, mikko.ylinen

Scripts rely on cgroup-tools package from libcgroup [1].

To test:
1) sudo ./setup_epc_cg.sh (optional one time setup)
2) sudo ./run_tests_in_misc_cg.sh

To watch misc group current:
./watch_misc_for_tests.sh current

[1] https://github.com/libcgroup/libcgroup/blob/main/README

Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
---
 .../selftests/sgx/run_tests_in_misc_cg.sh     | 68 +++++++++++++++++++
 tools/testing/selftests/sgx/setup_epc_cg.sh   | 29 ++++++++
 .../selftests/sgx/watch_misc_for_tests.sh     | 13 ++++
 3 files changed, 110 insertions(+)
 create mode 100755 tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
 create mode 100755 tools/testing/selftests/sgx/setup_epc_cg.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/run_tests_in_misc_cg.sh b/tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
new file mode 100755
index 000000000000..f9b691252b8d
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
@@ -0,0 +1,68 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+if ! lscgroup | grep -q "test/test1/test3$"; then
+  echo "setting up cgroups for testing..."
+  ./setup_epc_cg.sh
+fi
+
+cmd='./test_sgx'
+default_test="augment_via_eaccept_long"
+
+# We use 'tail' to skip header lines and 'sed' to remove 'enclave' from the first non-header line.
+list=$($cmd -l 2>&1 | tail -n +4 | sed '0,/^enclave/ s/^enclave//' | sed 's/^ *//')
+
+IFS=$'\n' read -d '' -r -a lines <<< "$list"
+lines=("all" "${lines[@]}")
+
+echo "Available tests:"
+for i in "${!lines[@]}"; do
+  # Check if the current line is the default test
+  if [[ ${lines[$i]} == *"$default_test"* ]]; then
+    echo "$((i)). ${lines[$i]} (default)"
+  else
+    echo "$((i)). ${lines[$i]}"
+  fi
+done
+
+echo "Please enter the number of the test you want to run (or press enter for the default test):"
+read choice
+
+if [ -z "$choice" ]; then
+  testname="$default_test"
+else
+  testname="${lines[$choice]}"
+fi
+
+if [ "$testname" == "all" ]; then
+  test_cmd="$cmd"
+else
+  test_cmd="$cmd -t $testname"
+fi
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+# Alway use leaf node of misc cgroups so it works for both v1 and v2
+# these may fail on OOM
+nohup bash -c "cgexec -g misc:test/test1/test3 $test_cmd" >test1_1_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test1/test3 $test_cmd" >test1_2_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test1/test3 $test_cmd" >test1_3_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test1/test3 $test_cmd" >test1_4_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test1/test3 $test_cmd" >test1_5_$timestamp.log 2>&1 &
+
+# These tests may timeout on oversubscribed tests on 4G EPC
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_1_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_2_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_3_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_4_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_5_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_6_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_7_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test/test2 $test_cmd" >test2_8_$timestamp.log 2>&1 &
+
+# this should work on 4G EPC
+nohup bash -c "cgexec -g misc:test4 $test_cmd" >test4_1_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test4 $test_cmd" >test4_2_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test4 $test_cmd" >test4_3_$timestamp.log 2>&1 &
+nohup bash -c "cgexec -g misc:test4 $test_cmd" >test4_4_$timestamp.log 2>&1 &
diff --git a/tools/testing/selftests/sgx/setup_epc_cg.sh b/tools/testing/selftests/sgx/setup_epc_cg.sh
new file mode 100755
index 000000000000..5fd137a66436
--- /dev/null
+++ b/tools/testing/selftests/sgx/setup_epc_cg.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+cgcreate -g misc:test
+if [ $? -ne 0 ]; then
+    echo "Please make sure cgroup-tools is installed, and misc cgroup is mounted."
+    exit 1
+fi
+cgcreate -g misc:test/test1
+cgcreate -g misc:test/test1/test3
+cgcreate -g misc:test/test2
+cgcreate -g misc:test4
+
+# Setup for a platform with 4G EPC
+LARGER=4096000000
+LARGE=409600000
+SMALL=4096000
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+    echo "cgroups v2 is in use. Only leaf nodes can run a process"
+    echo "sgx_epc $SMALL" | tee /sys/fs/cgroup/test/test1/misc.max
+    echo "sgx_epc $LARGE" | tee /sys/fs/cgroup/test/test2/misc.max
+    echo "sgx_epc $LARGER" | tee /sys/fs/cgroup/test4/misc.max
+else
+    echo "cgroups v1 is in use."
+    echo "sgx_epc $SMALL" | tee /sys/fs/cgroup/misc/test/test1/misc.max
+    echo "sgx_epc $LARGE" | tee /sys/fs/cgroup/misc/test/test2/misc.max
+    echo "sgx_epc $LARGER" | tee /sys/fs/cgroup/misc/test4/misc.max
+fi
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
new file mode 100755
index 000000000000..dbd38f346e7b
--- /dev/null
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+if [ -z "$1" ]
+  then
+    echo "No argument supplied, please provide 'max', 'current' or 'events'"
+    exit 1
+fi
+
+watch -n 1 "find /sys/fs/cgroup -wholename */test*/misc.$1 -exec sh -c \
+    'echo \"\$1:\"; cat \"\$1\"' _ {} \;"
+
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 21/28] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-07-12 23:01 ` [PATCH v3 21/28] x86/sgx: Limit process EPC usage with misc cgroup controller Haitao Huang
@ 2023-07-13  0:03   ` Randy Dunlap
  2023-08-17 15:12   ` Mikko Ylinen
  1 sibling, 0 replies; 62+ messages in thread
From: Randy Dunlap @ 2023-07-13  0:03 UTC (permalink / raw)
  To: Haitao Huang, jarkko, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc



On 7/12/23 16:01, Haitao Huang wrote:
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 53bab123a8ee..8a7378159e9e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1952,6 +1952,19 @@ config X86_SGX
>  
>  	  If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> +	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
> +	depends on X86_SGX && CGROUP_MISC
> +	help
> +	  Provides control over the EPC footprint of tasks in a cgroup via
> +	  the Miscellaneous cgroup controller.
> +
> +	  EPC is a subset of regular memory that is usable only by SGX
> +	  enclaves and is very limited in quantity, e.g. less than 1%
> +	  of total DRAM.
> +
> +          Say N if unsure.

Use tab + 2 spaces above for indentation, please.

-- 
~Randy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support
  2023-07-12 23:01 ` [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support Haitao Huang
@ 2023-07-13  0:10   ` Randy Dunlap
  2023-07-14 20:01     ` Haitao Huang
  2023-07-14 20:26   ` Haitao Huang
  2023-08-17 15:18   ` Mikko Ylinen
  2 siblings, 1 reply; 62+ messages in thread
From: Randy Dunlap @ 2023-07-13  0:10 UTC (permalink / raw)
  To: Haitao Huang, jarkko, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Jonathan Corbet
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li,
	seanjc, bagasdotme, linux-doc, zhanb, anakrish, mikko.ylinen

Hi,

On 7/12/23 16:01, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> Add initial documentation of how to regulate the distribution of
> SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
> controller.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> ---
>  Documentation/arch/x86/sgx.rst | 77 ++++++++++++++++++++++++++++++++++
>  1 file changed, 77 insertions(+)
> 
> diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
> index 2bcbffacbed5..f6ca5594dcf2 100644
> --- a/Documentation/arch/x86/sgx.rst
> +++ b/Documentation/arch/x86/sgx.rst
> @@ -300,3 +300,80 @@ to expected failures and handle them as follows:
>     first call.  It indicates a bug in the kernel or the userspace client
>     if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
>     a return code other than 0.
> +
> +
> +Cgroup Support
> +==============
> +
> +The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
> +distribution of SGX EPC memory, which is a subset of system RAM that
> +is used to provide SGX-enabled applications with protected memory,
> +and is otherwise inaccessible, i.e. shows up as reserved in
> +/proc/iomem and cannot be read/written outside of an SGX enclave.
> +
> +Although current systems implement EPC by stealing memory from RAM,
> +for all intents and purposes the EPC is independent from normal system
> +memory, e.g. must be reserved at boot from RAM and cannot be converted
> +between EPC and normal memory while the system is running.  The EPC is
> +managed by the SGX subsystem and is not accounted by the memory
> +controller.  Note that this is true only for EPC memory itself, i.e.
> +normal memory allocations related to SGX and EPC memory, e.g. the
> +backing memory for evicted EPC pages, are accounted, limited and
> +protected by the memory controller.
> +
> +Much like normal system memory, EPC memory can be overcommitted via
> +virtual memory techniques and pages can be swapped out of the EPC
> +to their backing store (normal system memory allocated via shmem).
> +The SGX EPC subsystem is analogous to the memory subsytem, and
> +it implements limit and protection models for EPC memory.
> +
> +SGX EPC Interface Files
> +-----------------------
> +
> +For a generic description of the Miscellaneous controller interface
> +files, please see Documentation/admin-guide/cgroup-v2.rst
> +
> +All SGX EPC memory amounts are in bytes unless explicitly stated
> +otherwise.  If a value which is not PAGE_SIZE aligned is written,
> +the actual value used by the controller will be rounded down to
> +the closest PAGE_SIZE multiple.
> +
> +  misc.capacity
> +        A read-only flat-keyed file shown only in the root cgroup.
> +        The sgx_epc resource will show the total amount of EPC
> +        memory available on the platform.
> +
> +  misc.current
> +        A read-only flat-keyed file shown in the non-root cgroups.
> +        The sgx_epc resource will show the current active EPC memory
> +        usage of the cgroup and its descendants. EPC pages that are
> +        swapped out to backing RAM are not included in the current count.
> +
> +  misc.max
> +        A read-write single value file which exists on non-root
> +        cgroups. The sgx_epc resource will show the EPC usage
> +        hard limit. The default is "max".
> +
> +        If a cgroup's EPC usage reaches this limit, EPC allocations,
> +        e.g. for page fault handling, will be blocked until EPC can
> +        be reclaimed from the cgroup.  If EPC cannot be reclaimed in
> +        a timely manner, reclaim will be forced, e.g. by ignoring LRU.
> +
> +  misc.events
> +	A read-write flat-keyed file which exists on non-root cgroups.
> +	Writes to the file reset the event counters to zero.  A value
> +	change in this file generates a file modified event.
> +
> +	  max
> +		The number of times the cgroup has triggered a reclaim
> +		due to its EPC usage approaching (or exceeding) its max
> +		EPC boundary.

The indentation here (above) is a little confusing.
Is this formatted the way that is intended?

> +
> +Migration
> +---------
> +
> +Once an EPC page is charged to a cgroup (during allocation), it
> +remains charged to the original cgroup until the page is released
> +or reclaimed.  Migrating a process to a different cgroup doesn't
> +move the EPC charges that it incurred while in the previous cgroup
> +to its new cgroup.

-- 
~Randy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support
  2023-07-13  0:10   ` Randy Dunlap
@ 2023-07-14 20:01     ` Haitao Huang
  0 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-14 20:01 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Jonathan Corbet, Randy Dunlap
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li,
	seanjc, bagasdotme, linux-doc, zhanb, anakrish, mikko.ylinen

Hi

On Wed, 12 Jul 2023 19:10:59 -0500, Randy Dunlap <rdunlap@infradead.org>  
wrote:

>> +
>> +
>> +Cgroup Support
>> +==============
>> +
>> +The "sgx_epc" resource within the Miscellaneous cgroup controller  
>> regulates
>> +distribution of SGX EPC memory, which is a subset of system RAM that
>> +is used to provide SGX-enabled applications with protected memory,
>> +and is otherwise inaccessible, i.e. shows up as reserved in
>> +/proc/iomem and cannot be read/written outside of an SGX enclave.
>> +
>> +Although current systems implement EPC by stealing memory from RAM,
>> +for all intents and purposes the EPC is independent from normal system
>> +memory, e.g. must be reserved at boot from RAM and cannot be converted
>> +between EPC and normal memory while the system is running.  The EPC is
>> +managed by the SGX subsystem and is not accounted by the memory
>> +controller.  Note that this is true only for EPC memory itself, i.e.
>> +normal memory allocations related to SGX and EPC memory, e.g. the
>> +backing memory for evicted EPC pages, are accounted, limited and
>> +protected by the memory controller.
>> +
>> +Much like normal system memory, EPC memory can be overcommitted via
>> +virtual memory techniques and pages can be swapped out of the EPC
>> +to their backing store (normal system memory allocated via shmem).
>> +The SGX EPC subsystem is analogous to the memory subsytem, and
>> +it implements limit and protection models for EPC memory.
>> +
>> +SGX EPC Interface Files
>> +-----------------------
>> +
>> +For a generic description of the Miscellaneous controller interface
>> +files, please see Documentation/admin-guide/cgroup-v2.rst
>> +
>> +All SGX EPC memory amounts are in bytes unless explicitly stated
>> +otherwise.  If a value which is not PAGE_SIZE aligned is written,
>> +the actual value used by the controller will be rounded down to
>> +the closest PAGE_SIZE multiple.
>> +
>> +  misc.capacity
>> +        A read-only flat-keyed file shown only in the root cgroup.
>> +        The sgx_epc resource will show the total amount of EPC
>> +        memory available on the platform.
>> +
>> +  misc.current
>> +        A read-only flat-keyed file shown in the non-root cgroups.
>> +        The sgx_epc resource will show the current active EPC memory
>> +        usage of the cgroup and its descendants. EPC pages that are
>> +        swapped out to backing RAM are not included in the current  
>> count.
>> +
>> +  misc.max
>> +        A read-write single value file which exists on non-root
>> +        cgroups. The sgx_epc resource will show the EPC usage
>> +        hard limit. The default is "max".
>> +
>> +        If a cgroup's EPC usage reaches this limit, EPC allocations,
>> +        e.g. for page fault handling, will be blocked until EPC can
>> +        be reclaimed from the cgroup.  If EPC cannot be reclaimed in
>> +        a timely manner, reclaim will be forced, e.g. by ignoring LRU.
>> +
>> +  misc.events
>> +	A read-write flat-keyed file which exists on non-root cgroups.
>> +	Writes to the file reset the event counters to zero.  A value
>> +	change in this file generates a file modified event.
>> +
>> +	  max
>> +		The number of times the cgroup has triggered a reclaim
>> +		due to its EPC usage approaching (or exceeding) its max
>> +		EPC boundary.
>
> The indentation here (above) is a little confusing.
> Is this formatted the way that is intended?
>
max here is an entry in the misc.events file. So it needs be indented a  
subsection.
But I see spaces are used for indentation in sections above(misc.max,  
misc.current and misc.capacity), and tabs are used in this section. So I  
think maybe that's causing the confusing?
I'll fix them using all tabs.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support
  2023-07-12 23:01 ` [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support Haitao Huang
  2023-07-13  0:10   ` Randy Dunlap
@ 2023-07-14 20:26   ` Haitao Huang
  2023-08-17 15:18   ` Mikko Ylinen
  2 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-14 20:26 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Jonathan Corbet, Haitao Huang
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li,
	seanjc, bagasdotme, linux-doc, zhanb, anakrish, mikko.ylinen


> +
> +  misc.events
> +	A read-write flat-keyed file which exists on non-root cgroups.

It's actually read-only for this file. Will fix.

Haitao

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/28] Add Cgroup support for SGX EPC memory
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (27 preceding siblings ...)
  2023-07-12 23:02 ` [PATCH v3 28/28] selftests/sgx: Add scripts for epc cgroup testing Haitao Huang
@ 2023-07-17 11:02 ` Jarkko Sakkinen
  2023-07-24 19:09 ` Sohil Mehta
  2023-08-17 15:04 ` Mikko Ylinen
  30 siblings, 0 replies; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 11:02 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, cgroups
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc, zhanb,
	anakrish, mikko.ylinen

On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> SGX EPC memory allocations are separate from normal RAM allocations, and is
> managed solely by the SGX subsystem. The existing cgroup memory controller
> cannot be used to limit or account for SGX EPC memory, which is a desirable
> feature in some environments, e.g., support for pod level control in a
> Kubernates cluster on a VM or baremetal host [1,2] in those environments.
>
> This patchset implements the support for sgx_epc memory within the misc
> cgroup controller. The user can use the misc cgroup controller to set and
> enforce a max limit on total EPC usage per cgroup. The implementation
> reports current usage and events of reaching the limit per cgroup as well
> as the total system capacity.
>
> This work was originally authored by Sean Christopherson a few years ago,
> and previously modified by Kristen C. Accardi to work with more recent
> kernels, and to utilize the misc cgroup controller rather than a custom
> controller. Now I updated the patches based on review comments on the V2
> series[3], simplified a few aspects of the implementation/design and fixed
> some stability issues found from testing, while keeping the same user space
> facing interfaces.
>
> The patchset adds support for multiple LRUs to track both reclaimable EPC
> pages (i.e. pages the reclaimer knows about), as well as unreclaimable EPC
> pages (i.e.  pages which the reclaimer isn't aware of, such as VA pages).
> These pages are assigned to an LRU, as well as an enclave, so that an
> enclave's full EPC usage can be tracked, and limited to a max value. During
> OOM events, an enclave can be have its memory zapped, and all the EPC pages
> not tracked by the reclaimer can be freed.
>
> I appreciate your comments and feedback.
>
> Summary of changes from v2: (more details in commit logs)
>
> * Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
> * Unrolled wrappers for cond_resched, list (Dave)
> * Separate patches for adding reclaimable and unreclaimable lists. (Dave)
> * Other improvments on patch flow, commit messages, styles. (Dave, Jarkko)
> * Simplified the cgroup tree walking with plain
>   css_for_each_descendant_pre.
> * Fixed race conditions and crashes.
> * OOM killer to wait for the victim enclave pages being reclaimed.
> * Unblock the user by handling misc_max_write callback asynchronously.
> * Rebased onto 6.4 and no longer base this series on the MCA patchset.
> * Fix an overflow in misc_try_charge.
> * Fix a NULL pointer in SGX PF handler.
> * Updated and included the SGX selftest patches previously reviewed. Those
>   patches fix issues triggered in high EPC pressure required for cgroup
>   testing.
> * Added test scripts to help setup and test SGX EPC cgroups.
>
> [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
> [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
> [3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/
> [4]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
>
> Haitao Huang (6):
>   x86/sgx: Store struct sgx_encl when allocating new VA pages
>   x86/sgx: Introduce EPC page states
>   x86/sgx: fix a NULL pointer
>   cgroup/misc: Fix an overflow
>   selftests/sgx: Retry the ioctl()'s returned with EAGAIN
>   selftests/sgx: Add scripts for epc cgroup testing
>
> Jarkko Sakkinen (3):
>   selftests/sgx: Move ENCL_HEAP_SIZE_DEFAULT to main.c
>   selftests/sgx: Use encl->encl_size in sigstruct.c
>   selftests/sgx: Include the dynamic heap size to the ELRANGE
>     calculation
>
> Kristen Carlson Accardi (9):
>   x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
>   x86/sgx: Use sgx_epc_lru_lists for existing active page list
>   x86/sgx: Store reclaimable epc pages in sgx_epc_lru_lists
>   x86/sgx: store unreclaimable EPC pages in sgx_epc_lru_lists
>   x86/sgx: Use a list to track to-be-reclaimed pages
>   cgroup/misc: Add per resource callbacks for CSS events
>   cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
>   x86/sgx: Limit process EPC usage with misc cgroup controller
>   Docs/x86/sgx: Add description for cgroup support
>
> Sean Christopherson (9):
>   x86/sgx: Add EPC page flags to identify owner type
>   x86/sgx: Introduce RECLAIM_IN_PROGRESS state
>   x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
>   x85/sgx: Return the number of EPC pages that were successfully
>     reclaimed
>   x86/sgx: Add option to ignore age of page during EPC reclaim
>   x86/sgx: Prepare for multiple LRUs
>   x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
>   x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
>   x86/sgx: Add EPC OOM path to forcefully reclaim EPC
>
> Vijay Dhanraj (1):
>   selftests/sgx: Add SGX selftest augment_via_eaccept_long
>
>  Documentation/arch/x86/sgx.rst                |  77 ++++
>  arch/x86/Kconfig                              |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile              |   1 +
>  arch/x86/kernel/cpu/sgx/driver.c              |  27 +-
>  arch/x86/kernel/cpu/sgx/encl.c                |  95 +++-
>  arch/x86/kernel/cpu/sgx/encl.h                |   4 +-
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c          | 406 ++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h          |  60 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c               |  25 +-
>  arch/x86/kernel/cpu/sgx/main.c                | 406 ++++++++++++++----
>  arch/x86/kernel/cpu/sgx/sgx.h                 | 113 ++++-
>  include/linux/misc_cgroup.h                   |  34 ++
>  kernel/cgroup/misc.c                          |  63 ++-
>  tools/testing/selftests/sgx/load.c            |   8 +-
>  tools/testing/selftests/sgx/main.c            | 177 +++++++-
>  tools/testing/selftests/sgx/main.h            |   6 +-
>  .../selftests/sgx/run_tests_in_misc_cg.sh     |  68 +++
>  tools/testing/selftests/sgx/setup_epc_cg.sh   |  29 ++
>  tools/testing/selftests/sgx/sigstruct.c       |   8 +-
>  .../selftests/sgx/watch_misc_for_tests.sh     |  13 +
>  20 files changed, 1446 insertions(+), 187 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>  create mode 100755 tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
>  create mode 100755 tools/testing/selftests/sgx/setup_epc_cg.sh
>  create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
> -- 
> 2.25.1

Thanks for taking the effort, must have been tedious!

BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 01/28] x86/sgx: Store struct sgx_encl when allocating new VA pages
  2023-07-12 23:01 ` [PATCH v3 01/28] x86/sgx: Store struct sgx_encl when allocating new VA pages Haitao Huang
@ 2023-07-17 11:14   ` Jarkko Sakkinen
  0 siblings, 0 replies; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 11:14 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> In a later patch, when a cgroup has exceeded the max capacity for EPC pages
> and there are no more Enclave EPC pages associated with the cgroup that can
> be reclaimed, the only pages still associated with an enclave will be the
> unreclaimable Version Array (VA) pages or SECS pages, and the entire
> enclave will need to be killed to free up those pages.
>
> Currently, given an enclave pointer it is easy to find the associated VA
> pages and free them, however, OOM killing an enclave based on cgroup limits
> will require examining a cgroup's unreclaimable page list, and finding an
> enclave given a SECS page or a VA page. This will require a backpointer
> from a page to an enclave, including for VA pages.
>
> When allocating new Version Array (VA) pages, pass the struct sgx_encl of
> the enclave that is allocating the page. sgx_alloc_epc_page() will store
> this value in the owner field of the struct sgx_epc_page.  In a later
> patch, VA pages will be placed in an unreclaimable queue, and then when the
> cgroup max limit is reached and there are no more reclaimable pages and the
> enclave must be OOM killed, all the VA pages associated with that enclave
> can be uncharged and freed.
>
> To avoid casting needed to access the two types of owners: sgx_encl for VA
> pages, sgx_encl_page for other pages, replace 'owner' field in sgx_epc_page
> with a union of the two types.

I think the action taken is correct but the reasoning is a bit
convoluted.

Why not instead put something like:

"Because struct sgx_epc_page instances of VA pages are not owned by an
sgx_encl_page instance in the first place, mark their owner as sgx_encl,
in order to make it reachable from the unreclaimable list."

The code change itself, and rest of the paragraphs do look reasonable.

BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 02/28] x86/sgx: Add EPC page flags to identify owner type
  2023-07-12 23:01 ` [PATCH v3 02/28] x86/sgx: Add EPC page flags to identify owner type Haitao Huang
@ 2023-07-17 12:41   ` Jarkko Sakkinen
  2023-07-17 12:43     ` Jarkko Sakkinen
  0 siblings, 1 reply; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 12:41 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Two types of owners, 'sgx_encl' for VA pages and 'sgx_encl_page' for other,
> can be stored in the union field in sgx_epc_page struct introduced in the
> previous patch.

This would be easier to follow:

"Two types of owners of struct_epc_page, 'sgx_encl' for VA pages and
'sgx_encl_page' can be stored in the previously introduced union field."

> When cgroup OOM support is added in a later patch, the owning enclave of a
> page will need to be identified. Retrieving the sgx_encl struct from a
> sgx_epc_page will be different if the page is a VA page vs. other enclave
> pages.
>
> Add 2 flags which will identify the type of the owner and apply them
> accordingly to newly allocated pages.

This would be easier to follow:

"OOM support for cgroups requires that the owner needs to be identified
when selecting pages from the unreclaimable list. Address this by adding
flags for identifying the owner type."

It is better to carry the story a little bit forward than say that a
subsequent patch will require this :-) I.e. enough to get at least a
rough idea what is going on.

R, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 02/28] x86/sgx: Add EPC page flags to identify owner type
  2023-07-17 12:41   ` Jarkko Sakkinen
@ 2023-07-17 12:43     ` Jarkko Sakkinen
  0 siblings, 0 replies; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 12:43 UTC (permalink / raw)
  To: Jarkko Sakkinen, Haitao Huang, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Sean Christopherson, zhiquan1.li,
	kristen, seanjc

On Mon Jul 17, 2023 at 12:41 PM UTC, Jarkko Sakkinen wrote:
> On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> >
> > Two types of owners, 'sgx_encl' for VA pages and 'sgx_encl_page' for other,
> > can be stored in the union field in sgx_epc_page struct introduced in the
> > previous patch.
>
> This would be easier to follow:
>
> "Two types of owners of struct_epc_page, 'sgx_encl' for VA pages and
> 'sgx_encl_page' can be stored in the previously introduced union field."
>
> > When cgroup OOM support is added in a later patch, the owning enclave of a
> > page will need to be identified. Retrieving the sgx_encl struct from a
> > sgx_epc_page will be different if the page is a VA page vs. other enclave
> > pages.
> >
> > Add 2 flags which will identify the type of the owner and apply them
> > accordingly to newly allocated pages.
>
> This would be easier to follow:
>
> "OOM support for cgroups requires that the owner needs to be identified
> when selecting pages from the unreclaimable list. Address this by adding
> flags for identifying the owner type."
>
> It is better to carry the story a little bit forward than say that a
> subsequent patch will require this :-) I.e. enough to get at least a
> rough idea what is going on.

Oops, sent by mistake. I was going to say that the flag would be better
named simply as SGX_EPC_OWNER_PAGE instead of SGX_EPC_OWNER_ENCL_PAGE.

BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2023-07-12 23:01 ` [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s) Haitao Huang
@ 2023-07-17 12:45   ` Jarkko Sakkinen
  2023-07-17 13:23     ` Haitao Huang
  0 siblings, 1 reply; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 12:45 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>
> Introduce a data structure to wrap the existing reclaimable list
> and its spinlock in a struct to minimize the code changes needed
> to handle multiple LRUs as well as reclaimable and non-reclaimable
> lists. The new structure will be used in a following set of patches to
> implement SGX EPC cgroups.
>
> The changes to the structure needed for unreclaimable lists will be
> added in later patches.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
>
> V3:
> Removed the helper functions and revised commit messages
> ---
>  arch/x86/kernel/cpu/sgx/sgx.h | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index f6e3c5810eef..77fceba73a25 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -92,6 +92,23 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
>  	return section->virt_addr + index * PAGE_SIZE;
>  }
>  
> +/*
> + * This data structure wraps a list of reclaimable EPC pages, and a list of
> + * non-reclaimable EPC pages and is used to implement a LRU policy during
> + * reclamation.
> + */
> +struct sgx_epc_lru_lists {
> +	/* Must acquire this lock to access */
> +	spinlock_t lock;

Isn't this self-explanatory, why the inline comment?

> +	struct list_head reclaimable;
> +};
> +
> +static inline void sgx_lru_init(struct sgx_epc_lru_lists *lrus)
> +{
> +	spin_lock_init(&lrus->lock);
> +	INIT_LIST_HEAD(&lrus->reclaimable);
> +}
> +
>  struct sgx_epc_page *__sgx_alloc_epc_page(void);
>  void sgx_free_epc_page(struct sgx_epc_page *page);
>  
> -- 
> 2.25.1


BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 04/28] x86/sgx: Use sgx_epc_lru_lists for existing active page list
  2023-07-12 23:01 ` [PATCH v3 04/28] x86/sgx: Use sgx_epc_lru_lists for existing active page list Haitao Huang
@ 2023-07-17 12:47   ` Jarkko Sakkinen
  2023-07-31 20:43     ` Haitao Huang
  0 siblings, 1 reply; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 12:47 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>
> Replace the existing sgx_active_page_list and its spinlock with
> a global sgx_epc_lru_lists struct.

Similarly as the previous patch, I would extend this story a tiny
bit forward to see the connection with the follow-up patches.

BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 17/28] x86/sgx: fix a NULL pointer
  2023-07-12 23:01 ` [PATCH v3 17/28] x86/sgx: fix a NULL pointer Haitao Huang
@ 2023-07-17 12:48   ` Jarkko Sakkinen
  2023-07-17 12:49     ` Jarkko Sakkinen
  2023-07-17 15:49     ` Dave Hansen
  0 siblings, 2 replies; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 12:48 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> Under heavy load, the SGX EPC reclaimers (ksgxd or future EPC cgroup
> worker) may reclaim SECS EPC page for an enclave and set
> encl->secs.epc_page to NULL. But the SECS EPC page is required for EAUG
> in #PF handler and is used without checking for NULL and reloading.
>
> Fix this by checking if SECS is loaded before EAUG and load it if it was
> reclaimed.
>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

A bug fix should be 1/*.

BR, Jarkko


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 17/28] x86/sgx: fix a NULL pointer
  2023-07-17 12:48   ` Jarkko Sakkinen
@ 2023-07-17 12:49     ` Jarkko Sakkinen
  2023-07-17 13:14       ` Haitao Huang
  2023-07-17 15:49     ` Dave Hansen
  1 sibling, 1 reply; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 12:49 UTC (permalink / raw)
  To: Jarkko Sakkinen, Haitao Huang, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

On Mon Jul 17, 2023 at 12:48 PM UTC, Jarkko Sakkinen wrote:
> On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> > Under heavy load, the SGX EPC reclaimers (ksgxd or future EPC cgroup
> > worker) may reclaim SECS EPC page for an enclave and set
> > encl->secs.epc_page to NULL. But the SECS EPC page is required for EAUG
> > in #PF handler and is used without checking for NULL and reloading.
> >
> > Fix this by checking if SECS is loaded before EAUG and load it if it was
> > reclaimed.
> >
> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>
> A bug fix should be 1/*.

And a fixes tag.

Or is there a bug that is momentized by the earlier patches? This patch
feels confusing to say the least.

BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 17/28] x86/sgx: fix a NULL pointer
  2023-07-17 12:49     ` Jarkko Sakkinen
@ 2023-07-17 13:14       ` Haitao Huang
  2023-07-17 14:33         ` Jarkko Sakkinen
  0 siblings, 1 reply; 62+ messages in thread
From: Haitao Huang @ 2023-07-17 13:14 UTC (permalink / raw)
  To: Jarkko Sakkinen, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

On Mon, 17 Jul 2023 07:49:27 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:

> On Mon Jul 17, 2023 at 12:48 PM UTC, Jarkko Sakkinen wrote:
>> On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
>> > Under heavy load, the SGX EPC reclaimers (ksgxd or future EPC cgroup
>> > worker) may reclaim SECS EPC page for an enclave and set
>> > encl->secs.epc_page to NULL. But the SECS EPC page is required for  
>> EAUG
>> > in #PF handler and is used without checking for NULL and reloading.
>> >
>> > Fix this by checking if SECS is loaded before EAUG and load it if it  
>> was
>> > reclaimed.
>> >
>> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>>
>> A bug fix should be 1/*.
>
> And a fixes tag.
>
> Or is there a bug that is momentized by the earlier patches? This patch
> feels confusing to say the least.
>

It happens in heavy reclaiming cases, just extremely rare when EPC  
accounting is not partitioned into cgroups. Will add fix tag with the  
related EDMM patch. And move this as the first patch.

Thanks
Haitao


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 18/28] cgroup/misc: Fix an overflow
  2023-07-12 23:01 ` [PATCH v3 18/28] cgroup/misc: Fix an overflow Haitao Huang
@ 2023-07-17 13:15   ` Jarkko Sakkinen
  0 siblings, 0 replies; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 13:15 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Zefan Li, Johannes Weiner
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen

On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> Overflow may happen in misc_cg_try_charge if new_usage becomes above
> INT_MAX, for example, on platforms with large SGX EPC sizes.
>
> Change type of new_usage to long from int and check overflow.
>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

As are bug fixes, this is also precursory work that SGX cgroups patches
should build on top of. Therefore, it should be in the very beginning,
right after any possible bug fixes to the existing code.

BR, Jarkko



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 19/28] cgroup/misc: Add per resource callbacks for CSS events
  2023-07-12 23:01 ` [PATCH v3 19/28] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
@ 2023-07-17 13:16   ` Jarkko Sakkinen
  0 siblings, 0 replies; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 13:16 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Zefan Li, Johannes Weiner
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li

On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>
> Consumers of the misc cgroup controller might need to perform separate
> actions for Cgroups Subsystem State(CSS) events: cgroup alloc and free.
> In addition, writes to the max value may also need separate action. Add
> the ability to allow downstream users to setup callbacks for these
> operations, and call the corresponding per-resource-type callback when
> appropriate.
>
> This code will be utilized by the SGX driver in a future patch.
>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>

Ditto. Belongs to the head of the patch set.

BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2023-07-17 12:45   ` Jarkko Sakkinen
@ 2023-07-17 13:23     ` Haitao Huang
  2023-07-17 14:39       ` Jarkko Sakkinen
  2023-07-24 10:04       ` Huang, Kai
  0 siblings, 2 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-17 13:23 UTC (permalink / raw)
  To: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Jarkko Sakkinen
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

On Mon, 17 Jul 2023 07:45:36 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:

> On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>>
>> Introduce a data structure to wrap the existing reclaimable list
>> and its spinlock in a struct to minimize the code changes needed
>> to handle multiple LRUs as well as reclaimable and non-reclaimable
>> lists. The new structure will be used in a following set of patches to
>> implement SGX EPC cgroups.
>>
>> The changes to the structure needed for unreclaimable lists will be
>> added in later patches.
>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> Cc: Sean Christopherson <seanjc@google.com>
>>
>> V3:
>> Removed the helper functions and revised commit messages
>> ---
>>  arch/x86/kernel/cpu/sgx/sgx.h | 17 +++++++++++++++++
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h  
>> b/arch/x86/kernel/cpu/sgx/sgx.h
>> index f6e3c5810eef..77fceba73a25 100644
>> --- a/arch/x86/kernel/cpu/sgx/sgx.h
>> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
>> @@ -92,6 +92,23 @@ static inline void *sgx_get_epc_virt_addr(struct  
>> sgx_epc_page *page)
>>  	return section->virt_addr + index * PAGE_SIZE;
>>  }
>>
>> +/*
>> + * This data structure wraps a list of reclaimable EPC pages, and a  
>> list of
>> + * non-reclaimable EPC pages and is used to implement a LRU policy  
>> during
>> + * reclamation.
>> + */
>> +struct sgx_epc_lru_lists {
>> +	/* Must acquire this lock to access */
>> +	spinlock_t lock;
>
> Isn't this self-explanatory, why the inline comment?

I got a warning from the checkpatch script complaining this lock needs  
comments.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 17/28] x86/sgx: fix a NULL pointer
  2023-07-17 13:14       ` Haitao Huang
@ 2023-07-17 14:33         ` Jarkko Sakkinen
  0 siblings, 0 replies; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 14:33 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

On Mon Jul 17, 2023 at 1:14 PM UTC, Haitao Huang wrote:
> On Mon, 17 Jul 2023 07:49:27 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
> wrote:
>
> > On Mon Jul 17, 2023 at 12:48 PM UTC, Jarkko Sakkinen wrote:
> >> On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> >> > Under heavy load, the SGX EPC reclaimers (ksgxd or future EPC cgroup
> >> > worker) may reclaim SECS EPC page for an enclave and set
> >> > encl->secs.epc_page to NULL. But the SECS EPC page is required for  
> >> EAUG
> >> > in #PF handler and is used without checking for NULL and reloading.
> >> >
> >> > Fix this by checking if SECS is loaded before EAUG and load it if it  
> >> was
> >> > reclaimed.
> >> >
> >> > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> >>
> >> A bug fix should be 1/*.
> >
> > And a fixes tag.
> >
> > Or is there a bug that is momentized by the earlier patches? This patch
> > feels confusing to say the least.
> >
>
> It happens in heavy reclaiming cases, just extremely rare when EPC  
> accounting is not partitioned into cgroups. Will add fix tag with the  
> related EDMM patch. And move this as the first patch.

I understand, it is just a good practice to follow, i.e. have prelude
and then the "real" changes :-)

BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2023-07-17 13:23     ` Haitao Huang
@ 2023-07-17 14:39       ` Jarkko Sakkinen
  2023-07-24 10:04       ` Huang, Kai
  1 sibling, 0 replies; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 14:39 UTC (permalink / raw)
  To: Haitao Huang, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

On Mon Jul 17, 2023 at 1:23 PM UTC, Haitao Huang wrote:
> On Mon, 17 Jul 2023 07:45:36 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
> wrote:
>
> > On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> >>
> >> Introduce a data structure to wrap the existing reclaimable list
> >> and its spinlock in a struct to minimize the code changes needed
> >> to handle multiple LRUs as well as reclaimable and non-reclaimable
> >> lists. The new structure will be used in a following set of patches to
> >> implement SGX EPC cgroups.
> >>
> >> The changes to the structure needed for unreclaimable lists will be
> >> added in later patches.
> >>
> >> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> >> Cc: Sean Christopherson <seanjc@google.com>
> >>
> >> V3:
> >> Removed the helper functions and revised commit messages
> >> ---
> >>  arch/x86/kernel/cpu/sgx/sgx.h | 17 +++++++++++++++++
> >>  1 file changed, 17 insertions(+)
> >>
> >> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h  
> >> b/arch/x86/kernel/cpu/sgx/sgx.h
> >> index f6e3c5810eef..77fceba73a25 100644
> >> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> >> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> >> @@ -92,6 +92,23 @@ static inline void *sgx_get_epc_virt_addr(struct  
> >> sgx_epc_page *page)
> >>  	return section->virt_addr + index * PAGE_SIZE;
> >>  }
> >>
> >> +/*
> >> + * This data structure wraps a list of reclaimable EPC pages, and a  
> >> list of
> >> + * non-reclaimable EPC pages and is used to implement a LRU policy  
> >> during
> >> + * reclamation.
> >> + */
> >> +struct sgx_epc_lru_lists {
> >> +	/* Must acquire this lock to access */
> >> +	spinlock_t lock;
> >
> > Isn't this self-explanatory, why the inline comment?
>
> I got a warning from the checkpatch script complaining this lock needs  
> comments.

OK, cool, not a big deal.

BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 17/28] x86/sgx: fix a NULL pointer
  2023-07-17 12:48   ` Jarkko Sakkinen
  2023-07-17 12:49     ` Jarkko Sakkinen
@ 2023-07-17 15:49     ` Dave Hansen
  2023-07-17 18:49       ` Haitao Huang
  2023-07-17 18:52       ` Jarkko Sakkinen
  1 sibling, 2 replies; 62+ messages in thread
From: Dave Hansen @ 2023-07-17 15:49 UTC (permalink / raw)
  To: Jarkko Sakkinen, Haitao Huang, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

On 7/17/23 05:48, Jarkko Sakkinen wrote:
> On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
>> Under heavy load, the SGX EPC reclaimers (ksgxd or future EPC cgroup
>> worker) may reclaim SECS EPC page for an enclave and set
>> encl->secs.epc_page to NULL. But the SECS EPC page is required for EAUG
>> in #PF handler and is used without checking for NULL and reloading.
>>
>> Fix this by checking if SECS is loaded before EAUG and load it if it was
>> reclaimed.
>>
>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> A bug fix should be 1/*.

No, bug fixes should not even be _part_ of another series.  Send bug
fixes separately, please.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 17/28] x86/sgx: fix a NULL pointer
  2023-07-17 15:49     ` Dave Hansen
@ 2023-07-17 18:49       ` Haitao Huang
  2023-07-17 18:52       ` Jarkko Sakkinen
  1 sibling, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-17 18:49 UTC (permalink / raw)
  To: Jarkko Sakkinen, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Dave Hansen
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

On Mon, 17 Jul 2023 10:49:03 -0500, Dave Hansen <dave.hansen@intel.com>  
wrote:

> On 7/17/23 05:48, Jarkko Sakkinen wrote:
>> On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
>>> Under heavy load, the SGX EPC reclaimers (ksgxd or future EPC cgroup
>>> worker) may reclaim SECS EPC page for an enclave and set
>>> encl->secs.epc_page to NULL. But the SECS EPC page is required for EAUG
>>> in #PF handler and is used without checking for NULL and reloading.
>>>
>>> Fix this by checking if SECS is loaded before EAUG and load it if it  
>>> was
>>> reclaimed.
>>>
>>> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> A bug fix should be 1/*.
>
> No, bug fixes should not even be _part_ of another series.  Send bug
> fixes separately, please.


I sent the two bug fixes separately now. Do you want me resend this series  
without those?
Thanks
Haitao

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 17/28] x86/sgx: fix a NULL pointer
  2023-07-17 15:49     ` Dave Hansen
  2023-07-17 18:49       ` Haitao Huang
@ 2023-07-17 18:52       ` Jarkko Sakkinen
  1 sibling, 0 replies; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-17 18:52 UTC (permalink / raw)
  To: Dave Hansen, Haitao Huang, dave.hansen, tj, linux-kernel,
	linux-sgx, cgroups, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc

On Mon Jul 17, 2023 at 3:49 PM UTC, Dave Hansen wrote:
> On 7/17/23 05:48, Jarkko Sakkinen wrote:
> > On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> >> Under heavy load, the SGX EPC reclaimers (ksgxd or future EPC cgroup
> >> worker) may reclaim SECS EPC page for an enclave and set
> >> encl->secs.epc_page to NULL. But the SECS EPC page is required for EAUG
> >> in #PF handler and is used without checking for NULL and reloading.
> >>
> >> Fix this by checking if SECS is loaded before EAUG and load it if it was
> >> reclaimed.
> >>
> >> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> > A bug fix should be 1/*.
>
> No, bug fixes should not even be _part_ of another series.  Send bug
> fixes separately, please.

Yes, that would be of course a better option.

BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2023-07-17 13:23     ` Haitao Huang
  2023-07-17 14:39       ` Jarkko Sakkinen
@ 2023-07-24 10:04       ` Huang, Kai
  2023-07-24 14:55         ` Haitao Huang
  1 sibling, 1 reply; 62+ messages in thread
From: Huang, Kai @ 2023-07-24 10:04 UTC (permalink / raw)
  To: hpa, linux-sgx, jarkko, dave.hansen, cgroups, bp, linux-kernel,
	tglx, x86, haitao.huang, tj, mingo
  Cc: kristen, Chatre, Reinette, Li, Zhiquan1, Christopherson,, Sean

On Mon, 2023-07-17 at 08:23 -0500, Haitao Huang wrote:
> On Mon, 17 Jul 2023 07:45:36 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
> wrote:
> 
> > On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> > > From: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > 
> > > Introduce a data structure to wrap the existing reclaimable list
> > > and its spinlock in a struct to minimize the code changes needed
> > > to handle multiple LRUs as well as reclaimable and non-reclaimable
> > > lists. The new structure will be used in a following set of patches to
> > > implement SGX EPC cgroups.

Although briefly mentioned in the first patch, it would be better to put more
background about the "reclaimable" and "non-reclaimable" thing here, focusing on
_why_ we need multiple LRUs (presumably you mean two lists: reclaimable and non-
reclaimable).

> > > 
> > > The changes to the structure needed for unreclaimable lists will be
> > > added in later patches.
> > > 
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> > > Cc: Sean Christopherson <seanjc@google.com>
> > > 
> > > V3:
> > > Removed the helper functions and revised commit messages

Please put change history into:

---
  change history
---

So it can be stripped away when applying the patch.

> > > ---
> > >  arch/x86/kernel/cpu/sgx/sgx.h | 17 +++++++++++++++++
> > >  1 file changed, 17 insertions(+)
> > > 
> > > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h  
> > > b/arch/x86/kernel/cpu/sgx/sgx.h
> > > index f6e3c5810eef..77fceba73a25 100644
> > > --- a/arch/x86/kernel/cpu/sgx/sgx.h
> > > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> > > @@ -92,6 +92,23 @@ static inline void *sgx_get_epc_virt_addr(struct  
> > > sgx_epc_page *page)
> > >  	return section->virt_addr + index * PAGE_SIZE;
> > >  }
> > > 
> > > +/*
> > > + * This data structure wraps a list of reclaimable EPC pages, and a  
> > > list of
> > > + * non-reclaimable EPC pages and is used to implement a LRU policy  
> > > during
> > > + * reclamation.
> > > + */

I'd prefer to not mention the "non-reclaimable" thing in this patch, but defer
to the one actually introduces the "non-reclaimable" list.  Actually, I don't
think we even need this comment, given you have this in the structure:

	struct list_head reclaimable;

Which already explains the "reclaimable" list.  I suppose the non-reclaimable
list would be named similarly thus need no comment either.

Also, I am wondering why you need to split this out as a separate patch.  It
basically does nothing.  To me you should just merge this to the next patch,
which actually does what you claimed in the changelog:

	Introduce a data structure to wrap the existing reclaimable list and 
	its spinlock ...

Then this can be an infrastructure change patch, which doesn't bring any
functional change, to support the non-reclaimable list.


> > > +struct sgx_epc_lru_lists {
> > > +	/* Must acquire this lock to access */
> > > +	spinlock_t lock;
> > 
> > Isn't this self-explanatory, why the inline comment?
> 
> I got a warning from the checkpatch script complaining this lock needs  
> comments.

I suspected this, so I applied this patch, removed the comment, generated a new
patch, and run checkpatch.pl for it.  It didn't report any warning/error in my
testing.

Are you sure you got a warning?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2023-07-24 10:04       ` Huang, Kai
@ 2023-07-24 14:55         ` Haitao Huang
  2023-07-24 23:31           ` Huang, Kai
  0 siblings, 1 reply; 62+ messages in thread
From: Haitao Huang @ 2023-07-24 14:55 UTC (permalink / raw)
  To: hpa, linux-sgx, jarkko, dave.hansen, cgroups, bp, linux-kernel,
	tglx, x86, tj, mingo, Huang, Kai
  Cc: kristen, Chatre, Reinette, Li, Zhiquan1, Christopherson,, Sean

Hi Kai
On Mon, 24 Jul 2023 05:04:48 -0500, Huang, Kai <kai.huang@intel.com> wrote:

> On Mon, 2023-07-17 at 08:23 -0500, Haitao Huang wrote:
>> On Mon, 17 Jul 2023 07:45:36 -0500, Jarkko Sakkinen <jarkko@kernel.org>
>> wrote:
>>
>> > On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
>> > > From: Kristen Carlson Accardi <kristen@linux.intel.com>
>> > >
>> > > Introduce a data structure to wrap the existing reclaimable list
>> > > and its spinlock in a struct to minimize the code changes needed
>> > > to handle multiple LRUs as well as reclaimable and non-reclaimable
>> > > lists. The new structure will be used in a following set of patches  
>> to
>> > > implement SGX EPC cgroups.
>
> Although briefly mentioned in the first patch, it would be better to put  
> more
> background about the "reclaimable" and "non-reclaimable" thing here,  
> focusing on
> _why_ we need multiple LRUs (presumably you mean two lists: reclaimable  
> and non-
> reclaimable).
>
Sure I can add a little more background to introduce the  
reclaimable/unreclaimable concept. But why we need multiple LRUs would be  
self-evident in later patches, not sure I will add details here.

>> > >
>> > > The changes to the structure needed for unreclaimable lists will be
>> > > added in later patches.
>> > >
>> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> > > Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
>> > > Cc: Sean Christopherson <seanjc@google.com>
>> > >
>> > > V3:
>> > > Removed the helper functions and revised commit messages
>
> Please put change history into:
>
> ---
>   change history
> ---
>
> So it can be stripped away when applying the patch.
>
Will do that.

>> > > ---
>> > >  arch/x86/kernel/cpu/sgx/sgx.h | 17 +++++++++++++++++
>> > >  1 file changed, 17 insertions(+)
>> > >
>> > > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h
>> > > b/arch/x86/kernel/cpu/sgx/sgx.h
>> > > index f6e3c5810eef..77fceba73a25 100644
>> > > --- a/arch/x86/kernel/cpu/sgx/sgx.h
>> > > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
>> > > @@ -92,6 +92,23 @@ static inline void *sgx_get_epc_virt_addr(struct
>> > > sgx_epc_page *page)
>> > >  	return section->virt_addr + index * PAGE_SIZE;
>> > >  }
>> > >
>> > > +/*
>> > > + * This data structure wraps a list of reclaimable EPC pages, and a
>> > > list of
>> > > + * non-reclaimable EPC pages and is used to implement a LRU policy
>> > > during
>> > > + * reclamation.
>> > > + */
>
> I'd prefer to not mention the "non-reclaimable" thing in this patch, but  
> defer
> to the one actually introduces the "non-reclaimable" list.  Actually, I  
> don't
> think we even need this comment, given you have this in the structure:
>
> 	struct list_head reclaimable;
>

Agreed.

> Which already explains the "reclaimable" list.  I suppose the  
> non-reclaimable
> list would be named similarly thus need no comment either.
>
> Also, I am wondering why you need to split this out as a separate  
> patch.  It
> basically does nothing.  To me you should just merge this to the next  
> patch,

I think Kristen splitted the original patch based on Dave's comments:

https://lore.kernel.org/all/e71d76b2-4368-4627-abd4-2163e6786a20@intel.com/

> which actually does what you claimed in the changelog:
>
> 	Introduce a data structure to wrap the existing reclaimable list and 
> 	its spinlock ...
>
> Then this can be an infrastructure change patch, which doesn't bring any
> functional change, to support the non-reclaimable list.
>
>
>> > > +struct sgx_epc_lru_lists {
>> > > +	/* Must acquire this lock to access */
>> > > +	spinlock_t lock;
>> >
>> > Isn't this self-explanatory, why the inline comment?
>>
>> I got a warning from the checkpatch script complaining this lock needs
>> comments.
>
> I suspected this, so I applied this patch, removed the comment,  
> generated a new
> patch, and run checkpatch.pl for it.  It didn't report any warning/error  
> in my
> testing.
>
> Are you sure you got a warning?

I did a reran and it's actually a "CHECK" I got:

$ ./scripts/checkpatch.pl --strict  
0001-x86-sgx-Add-struct-sgx_epc_lru_lists-to-encapsulate-.patch
CHECK: spinlock_t definition without comment
#41: FILE: arch/x86/kernel/cpu/sgx/sgx.h:101:
+       spinlock_t lock;

total: 0 errors, 0 warnings, 1 checks, 22 lines checked

Thanks
Haitao

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/28] Add Cgroup support for SGX EPC memory
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (28 preceding siblings ...)
  2023-07-17 11:02 ` [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Jarkko Sakkinen
@ 2023-07-24 19:09 ` Sohil Mehta
  2023-07-25 17:16   ` Haitao Huang
  2023-08-17 15:04 ` Mikko Ylinen
  30 siblings, 1 reply; 62+ messages in thread
From: Sohil Mehta @ 2023-07-24 19:09 UTC (permalink / raw)
  To: Haitao Huang, jarkko, dave.hansen, tj, linux-kernel, linux-sgx,
	cgroups, X86-kernel
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc, zhanb,
	anakrish, mikko.ylinen

Hi Haitao,

On 7/12/2023 4:01 PM, Haitao Huang wrote:

> I appreciate your comments and feedback.
> 

Nit: You missed emailing the cover letter to x86@kernel.org. I think a
few other people included in the individual patches are also missing in
the cover letter.

In general, it might be useful to keep the email list consistent across
the cover letter and the individual patches.

-Sohil

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2023-07-24 14:55         ` Haitao Huang
@ 2023-07-24 23:31           ` Huang, Kai
  2023-07-31 20:35             ` Haitao Huang
  0 siblings, 1 reply; 62+ messages in thread
From: Huang, Kai @ 2023-07-24 23:31 UTC (permalink / raw)
  To: mingo, jarkko, dave.hansen, bp, cgroups, hpa, linux-kernel,
	linux-sgx, tglx, haitao.huang, tj, x86
  Cc: kristen, Chatre, Reinette, Li, Zhiquan1, Christopherson,, Sean

On Mon, 2023-07-24 at 09:55 -0500, Haitao Huang wrote:
> Hi Kai
> On Mon, 24 Jul 2023 05:04:48 -0500, Huang, Kai <kai.huang@intel.com> wrote:
> 
> > On Mon, 2023-07-17 at 08:23 -0500, Haitao Huang wrote:
> > > On Mon, 17 Jul 2023 07:45:36 -0500, Jarkko Sakkinen <jarkko@kernel.org>
> > > wrote:
> > > 
> > > > On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
> > > > > From: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > > > 
> > > > > Introduce a data structure to wrap the existing reclaimable list
> > > > > and its spinlock in a struct to minimize the code changes needed
> > > > > to handle multiple LRUs as well as reclaimable and non-reclaimable
> > > > > lists. The new structure will be used in a following set of patches  
> > > to
> > > > > implement SGX EPC cgroups.
> > 
> > Although briefly mentioned in the first patch, it would be better to put  
> > more
> > background about the "reclaimable" and "non-reclaimable" thing here,  
> > focusing on
> > _why_ we need multiple LRUs (presumably you mean two lists: reclaimable  
> > and non-
> > reclaimable).
> > 
> Sure I can add a little more background to introduce the  
> reclaimable/unreclaimable concept. But why we need multiple LRUs would be  
> self-evident in later patches, not sure I will add details here.

In this case people will need to go to that patch to get some idea first.  It
doesn't seem hurt if you can explain why you need multiple LRUs here first.

[...]

> > > > > +struct sgx_epc_lru_lists {
> > > > > +	/* Must acquire this lock to access */
> > > > > +	spinlock_t lock;
> > > > 
> > > > Isn't this self-explanatory, why the inline comment?
> > > 
> > > I got a warning from the checkpatch script complaining this lock needs
> > > comments.
> > 
> > I suspected this, so I applied this patch, removed the comment,  
> > generated a new
> > patch, and run checkpatch.pl for it.  It didn't report any warning/error  
> > in my
> > testing.
> > 
> > Are you sure you got a warning?
> 
> I did a reran and it's actually a "CHECK" I got:
> 
> $ ./scripts/checkpatch.pl --strict  
> 0001-x86-sgx-Add-struct-sgx_epc_lru_lists-to-encapsulate-.patch
> CHECK: spinlock_t definition without comment
> #41: FILE: arch/x86/kernel/cpu/sgx/sgx.h:101:
> +       spinlock_t lock;
> 
> total: 0 errors, 0 warnings, 1 checks, 22 lines checked
> 

I didn't get the CHECK in my testing.  Not sure why.

Anyway, I guess the comment can be useful if it is to explain why we need to use
spinlock or whatever lock.  But

	/* Must acquire this lock to access */

doesn't explain why at all, thus doesn't look helpful to me.

I guess you either need a better comment, or just remove it (it's obvious that a
lot of kernel code doesn't have a comment around spinlock_t).


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/28] Add Cgroup support for SGX EPC memory
  2023-07-24 19:09 ` Sohil Mehta
@ 2023-07-25 17:16   ` Haitao Huang
  0 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-25 17:16 UTC (permalink / raw)
  To: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	X86-kernel, Sohil Mehta
  Cc: kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc, zhanb,
	anakrish, mikko.ylinen

On Mon, 24 Jul 2023 14:09:21 -0500, Sohil Mehta <sohil.mehta@intel.com>  
wrote:

> Hi Haitao,
>
> On 7/12/2023 4:01 PM, Haitao Huang wrote:
>
>> I appreciate your comments and feedback.
>>
>
> Nit: You missed emailing the cover letter to x86@kernel.org. I think a
> few other people included in the individual patches are also missing in
> the cover letter.
>
> In general, it might be useful to keep the email list consistent across
> the cover letter and the individual patches.
>
> -Sohil

Thanks

I'll change to use the same set of lists and addresses for all patches and  
cover letters in next version.
BR
Haitao

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 11/28] x85/sgx: Return the number of EPC pages that were successfully reclaimed
  2023-07-12 23:01 ` [PATCH v3 11/28] x85/sgx: Return the number of EPC pages that were successfully reclaimed Haitao Huang
@ 2023-07-29 12:47   ` Pavel Machek
  2023-07-31 11:10     ` Jarkko Sakkinen
  0 siblings, 1 reply; 62+ messages in thread
From: Pavel Machek @ 2023-07-29 12:47 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kai.huang, reinette.chatre, Sean Christopherson,
	zhiquan1.li, kristen, seanjc

[-- Attachment #1: Type: text/plain, Size: 521 bytes --]

Hi!

> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Return the number of reclaimed pages from sgx_reclaim_pages(), the EPC
> cgroup will use the result to track the success rate of its reclaim
> calls, e.g. to escalate to a more forceful reclaiming mode if
> necessary.

Subject says x85. While some would love to see support of Linux on
Intel 8085, I guess it should be x86.

Best regards,
								Pavel
								
-- 
People of Russia, stop Putin before his war on Ukraine escalates.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 11/28] x85/sgx: Return the number of EPC pages that were successfully reclaimed
  2023-07-29 12:47   ` Pavel Machek
@ 2023-07-31 11:10     ` Jarkko Sakkinen
  0 siblings, 0 replies; 62+ messages in thread
From: Jarkko Sakkinen @ 2023-07-31 11:10 UTC (permalink / raw)
  To: Pavel Machek, Haitao Huang
  Cc: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kai.huang, reinette.chatre, Sean Christopherson,
	zhiquan1.li, kristen, seanjc

On Sat Jul 29, 2023 at 12:47 PM UTC, Pavel Machek wrote:
> Hi!
>
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > Return the number of reclaimed pages from sgx_reclaim_pages(), the EPC
> > cgroup will use the result to track the success rate of its reclaim
> > calls, e.g. to escalate to a more forceful reclaiming mode if
> > necessary.
>
> Subject says x85. While some would love to see support of Linux on
> Intel 8085, I guess it should be x86.

hmm... that could potentially be also a step towards also to zilog z80
compatibility :-)

BR, Jarkko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
  2023-07-24 23:31           ` Huang, Kai
@ 2023-07-31 20:35             ` Haitao Huang
  0 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-31 20:35 UTC (permalink / raw)
  To: mingo, jarkko, dave.hansen, bp, cgroups, hpa, linux-kernel,
	linux-sgx, tglx, tj, x86, Huang, Kai
  Cc: kristen, Chatre, Reinette, Li, Zhiquan1, Christopherson,, Sean

On Mon, 24 Jul 2023 18:31:58 -0500, Huang, Kai <kai.huang@intel.com> wrote:

...
>> > Although briefly mentioned in the first patch, it would be better to  
>> put
>> > more
>> > background about the "reclaimable" and "non-reclaimable" thing here,
>> > focusing on
>> > _why_ we need multiple LRUs (presumably you mean two lists:  
>> reclaimable
>> > and non-
>> > reclaimable).
>> >
>> Sure I can add a little more background to introduce the
>> reclaimable/unreclaimable concept. But why we need multiple LRUs would  
>> be
>> self-evident in later patches, not sure I will add details here.
>
> In this case people will need to go to that patch to get some idea  
> first.  It
> doesn't seem hurt if you can explain why you need multiple LRUs here  
> first.
>
Will add.

...
>
> I didn't get the CHECK in my testing.  Not sure why.
>
> Anyway, I guess the comment can be useful if it is to explain why we  
> need to use
> spinlock or whatever lock.  But
>
> 	/* Must acquire this lock to access */
>
> doesn't explain why at all, thus doesn't look helpful to me.
>
> I guess you either need a better comment, or just remove it (it's  
> obvious that a
> lot of kernel code doesn't have a comment around spinlock_t).
>

I'll remove the comments.
Thanks
Haitao

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 04/28] x86/sgx: Use sgx_epc_lru_lists for existing active page list
  2023-07-17 12:47   ` Jarkko Sakkinen
@ 2023-07-31 20:43     ` Haitao Huang
  0 siblings, 0 replies; 62+ messages in thread
From: Haitao Huang @ 2023-07-31 20:43 UTC (permalink / raw)
  To: dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Jarkko Sakkinen
  Cc: kai.huang, reinette.chatre, Kristen Carlson Accardi, zhiquan1.li, seanjc

On Mon, 17 Jul 2023 07:47:01 -0500, Jarkko Sakkinen <jarkko@kernel.org>  
wrote:

> On Wed Jul 12, 2023 at 11:01 PM UTC, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <kristen@linux.intel.com>
>>
>> Replace the existing sgx_active_page_list and its spinlock with
>> a global sgx_epc_lru_lists struct.
>
> Similarly as the previous patch, I would extend this story a tiny
> bit forward to see the connection with the follow-up patches.
>
Sure

I also feel it may flow better by moving all changes related to  
'unreclaimable' such as owner field for VA, flags for types of owners,  
storing unreclaimables to LRU, etc. to later after all changes dealing  
with reclaimables are introduced. The unreclaimables are only of concern  
when OOM is involved so it'd be better to do them right before OOM.

Thanks
Haitao

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 00/28]  Add Cgroup support for SGX EPC memory
  2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
                   ` (29 preceding siblings ...)
  2023-07-24 19:09 ` Sohil Mehta
@ 2023-08-17 15:04 ` Mikko Ylinen
  30 siblings, 0 replies; 62+ messages in thread
From: Mikko Ylinen @ 2023-08-17 15:04 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	kai.huang, reinette.chatre, zhiquan1.li, kristen, seanjc, zhanb,
	anakrish

On Wed, Jul 12, 2023 at 04:01:34PM -0700, Haitao Huang wrote:
> SGX EPC memory allocations are separate from normal RAM allocations, and is
> managed solely by the SGX subsystem. The existing cgroup memory controller
> cannot be used to limit or account for SGX EPC memory, which is a desirable
> feature in some environments, e.g., support for pod level control in a
> Kubernates cluster on a VM or baremetal host [1,2] in those environments.
> 
> This patchset implements the support for sgx_epc memory within the misc
> cgroup controller. The user can use the misc cgroup controller to set and
> enforce a max limit on total EPC usage per cgroup. The implementation
> reports current usage and events of reaching the limit per cgroup as well
> as the total system capacity.
> 
> This work was originally authored by Sean Christopherson a few years ago,
> and previously modified by Kristen C. Accardi to work with more recent
> kernels, and to utilize the misc cgroup controller rather than a custom
> controller. Now I updated the patches based on review comments on the V2
> series[3], simplified a few aspects of the implementation/design and fixed
> some stability issues found from testing, while keeping the same user space
> facing interfaces.
> 
> The patchset adds support for multiple LRUs to track both reclaimable EPC
> pages (i.e. pages the reclaimer knows about), as well as unreclaimable EPC
> pages (i.e.  pages which the reclaimer isn't aware of, such as VA pages).
> These pages are assigned to an LRU, as well as an enclave, so that an
> enclave's full EPC usage can be tracked, and limited to a max value. During
> OOM events, an enclave can be have its memory zapped, and all the EPC pages
> not tracked by the reclaimer can be freed.
> 
> I appreciate your comments and feedback.

I've been stressing this patch set in my Kubernetes cluster with a few
simultaneous replicas with two Gramine containers (one with EDMM and another
one without) in each and per container misc.max limits set.

I've not observed any issues and everything seems to be as expected: per
container EPC usage is capped to misc.max and its memory.limit triggers OOM
for the reclaimed EPC.

Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>

> 
> Summary of changes from v2: (more details in commit logs)
> 
> * Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
> * Unrolled wrappers for cond_resched, list (Dave)
> * Separate patches for adding reclaimable and unreclaimable lists. (Dave)
> * Other improvments on patch flow, commit messages, styles. (Dave, Jarkko)
> * Simplified the cgroup tree walking with plain
>   css_for_each_descendant_pre.
> * Fixed race conditions and crashes.
> * OOM killer to wait for the victim enclave pages being reclaimed.
> * Unblock the user by handling misc_max_write callback asynchronously.
> * Rebased onto 6.4 and no longer base this series on the MCA patchset.
> * Fix an overflow in misc_try_charge.
> * Fix a NULL pointer in SGX PF handler.
> * Updated and included the SGX selftest patches previously reviewed. Those
>   patches fix issues triggered in high EPC pressure required for cgroup
>   testing.
> * Added test scripts to help setup and test SGX EPC cgroups.
> 
> [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
> [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
> [3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen@linux.intel.com/
> [4]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
> 
> Haitao Huang (6):
>   x86/sgx: Store struct sgx_encl when allocating new VA pages
>   x86/sgx: Introduce EPC page states
>   x86/sgx: fix a NULL pointer
>   cgroup/misc: Fix an overflow
>   selftests/sgx: Retry the ioctl()'s returned with EAGAIN
>   selftests/sgx: Add scripts for epc cgroup testing
> 
> Jarkko Sakkinen (3):
>   selftests/sgx: Move ENCL_HEAP_SIZE_DEFAULT to main.c
>   selftests/sgx: Use encl->encl_size in sigstruct.c
>   selftests/sgx: Include the dynamic heap size to the ELRANGE
>     calculation
> 
> Kristen Carlson Accardi (9):
>   x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s)
>   x86/sgx: Use sgx_epc_lru_lists for existing active page list
>   x86/sgx: Store reclaimable epc pages in sgx_epc_lru_lists
>   x86/sgx: store unreclaimable EPC pages in sgx_epc_lru_lists
>   x86/sgx: Use a list to track to-be-reclaimed pages
>   cgroup/misc: Add per resource callbacks for CSS events
>   cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
>   x86/sgx: Limit process EPC usage with misc cgroup controller
>   Docs/x86/sgx: Add description for cgroup support
> 
> Sean Christopherson (9):
>   x86/sgx: Add EPC page flags to identify owner type
>   x86/sgx: Introduce RECLAIM_IN_PROGRESS state
>   x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
>   x85/sgx: Return the number of EPC pages that were successfully
>     reclaimed
>   x86/sgx: Add option to ignore age of page during EPC reclaim
>   x86/sgx: Prepare for multiple LRUs
>   x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
>   x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
>   x86/sgx: Add EPC OOM path to forcefully reclaim EPC
> 
> Vijay Dhanraj (1):
>   selftests/sgx: Add SGX selftest augment_via_eaccept_long
> 
>  Documentation/arch/x86/sgx.rst                |  77 ++++
>  arch/x86/Kconfig                              |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile              |   1 +
>  arch/x86/kernel/cpu/sgx/driver.c              |  27 +-
>  arch/x86/kernel/cpu/sgx/encl.c                |  95 +++-
>  arch/x86/kernel/cpu/sgx/encl.h                |   4 +-
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c          | 406 ++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h          |  60 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c               |  25 +-
>  arch/x86/kernel/cpu/sgx/main.c                | 406 ++++++++++++++----
>  arch/x86/kernel/cpu/sgx/sgx.h                 | 113 ++++-
>  include/linux/misc_cgroup.h                   |  34 ++
>  kernel/cgroup/misc.c                          |  63 ++-
>  tools/testing/selftests/sgx/load.c            |   8 +-
>  tools/testing/selftests/sgx/main.c            | 177 +++++++-
>  tools/testing/selftests/sgx/main.h            |   6 +-
>  .../selftests/sgx/run_tests_in_misc_cg.sh     |  68 +++
>  tools/testing/selftests/sgx/setup_epc_cg.sh   |  29 ++
>  tools/testing/selftests/sgx/sigstruct.c       |   8 +-
>  .../selftests/sgx/watch_misc_for_tests.sh     |  13 +
>  20 files changed, 1446 insertions(+), 187 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>  create mode 100755 tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
>  create mode 100755 tools/testing/selftests/sgx/setup_epc_cg.sh
>  create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
> 
> -- 
> 2.25.1
> 

-- Regards, Mikko

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 21/28] x86/sgx: Limit process EPC usage with misc cgroup controller
  2023-07-12 23:01 ` [PATCH v3 21/28] x86/sgx: Limit process EPC usage with misc cgroup controller Haitao Huang
  2023-07-13  0:03   ` Randy Dunlap
@ 2023-08-17 15:12   ` Mikko Ylinen
  1 sibling, 0 replies; 62+ messages in thread
From: Mikko Ylinen @ 2023-08-17 15:12 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kai.huang, reinette.chatre,
	Kristen Carlson Accardi, zhiquan1.li, seanjc

On Wed, Jul 12, 2023 at 04:01:55PM -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
> 
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsytem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
> 
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
> 
> This patch was modified from its original version to use the misc cgroup
> controller instead of a custom controller.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> 
> V3:
> 
> 1) Use the same maximum number of reclaiming candidate pages to be
> processed, SGX_NR_TO_SCAN_MAX, for each reclaiming iteration in both
> cgroup worker function and ksgxd. This fixes an overflow in the
> backing store buffer with the same fixed size allocated on stack in
> sgx_reclaim_epc_pages().
> 
> 2) Initialize max for root EPC cgroup. Otherwise, all
> misc_cg_try_charge() calls would fail as it checks for all limits of
> ancestors all the way to the root node.
> 
> 3) Start reclaiming whenever misc_cg_try_charge fails. Removed all
> re-checks for limits and current usage. For all purposes and intent,
> when misc_try_charge() fails, reclaiming is needed. This also corrects
> an error of not reclaiming when the child limit is larger than one of
> its ancestors.
> 
> 4) Handle failure on charging to the root EPC cgroup. Failure on charging
> to root means we are at or above capacity, so start reclaiming or return
> OOM error.
> 
> 5) Removed the custom cgroup tree walking iterator with epoch tracking
> logic. Replaced it with just the plain css_for_each_descendant_pre
> iterator. The custom iterator implemented a rather complex epoch scheme
> I believe was intended to prevent extra reclaiming from multiple worker
> threads doing the same walk but it turned out not matter much as each
> thread would only reclaim when usage is above limit. Using the plain
> css_for_each_descendant_pre iterator simplified code a bit.
> 
> 6) Do not reclaim synchrously in misc_max_write callback which would
> block the user. Instead queue an async work item to run the reclaiming
> loop.
> 
> 7) Other minor refactorings:
> - Remove unused params in epc_cgroup APIs
> - centralize uncharge into sgx_free_epc_page()
> ---
>  arch/x86/Kconfig                     |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile     |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 406 +++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  60 ++++
>  arch/x86/kernel/cpu/sgx/main.c       |  79 ++++--
>  arch/x86/kernel/cpu/sgx/sgx.h        |  14 +-
>  6 files changed, 552 insertions(+), 21 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 53bab123a8ee..8a7378159e9e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1952,6 +1952,19 @@ config X86_SGX
>  
>  	  If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> +	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
> +	depends on X86_SGX && CGROUP_MISC
> +	help
> +	  Provides control over the EPC footprint of tasks in a cgroup via
> +	  the Miscellaneous cgroup controller.
> +
> +	  EPC is a subset of regular memory that is usable only by SGX
> +	  enclaves and is very limited in quantity, e.g. less than 1%
> +	  of total DRAM.
> +
> +          Say N if unsure.
> +
>  config EFI
>  	bool "EFI runtime service support"
>  	depends on ACPI
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
>  	ioctl.o \
>  	main.o
>  obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC)	       += epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index 000000000000..de0833e5606b
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,406 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
> +#include <linux/ratelimit.h>
> +#include <linux/sched/signal.h>
> +#include <linux/slab.h>
> +#include <linux/threads.h>
> +
> +#include "epc_cgroup.h"
> +
> +#define SGX_EPC_RECLAIM_MIN_PAGES		16UL
> +#define SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD	5
> +#define SGX_EPC_RECLAIM_OOM_THRESHOLD		5
> +
> +static struct workqueue_struct *sgx_epc_cg_wq;
> +static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root);
> +
> +struct sgx_epc_reclaim_control {
> +	struct sgx_epc_cgroup *epc_cg;
> +	int nr_fails;
> +	bool ignore_age;
> +};
> +
> +static inline unsigned long sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return atomic_long_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
> +}
> +
> +static inline unsigned long sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
> +}
> +
> +static inline unsigned long sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
> +{
> +	struct misc_cg *i = epc_cg->cg;
> +	unsigned long m = ULONG_MAX;
> +
> +	while (i) {
> +		m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
> +		i = misc_cg_parent(i);
> +	}
> +	return m / PAGE_SIZE;
> +}
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> +	if (cg)
> +		return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +
> +	return NULL;
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> +	return !cgroup_subsys_enabled(misc_cgrp_subsys);
> +}
> +
> +/**
> + * sgx_epc_cgroup_lru_empty - check if a cgroup tree has no pages on its lrus
> + * @root:	root of the tree to check
> + *
> + * Return: %true if all cgroups under the specified root have empty LRU lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
> +{
> +	struct cgroup_subsys_state *css_root = NULL;
> +	struct cgroup_subsys_state *pos = NULL;
> +	struct sgx_epc_cgroup *epc_cg = NULL;
> +	bool ret = true;
> +
> +	/*
> +	 * Caller ensure css_root ref acquired
> +	 */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> +		spin_lock(&epc_cg->lru.lock);
> +		ret = list_empty(&epc_cg->lru.reclaimable);
> +		spin_unlock(&epc_cg->lru.lock);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!ret)
> +			break;
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +/**
> + * sgx_epc_cgroup_isolate_pages - walk a cgroup tree and separate pages
> + * @root:	root of the tree to start walking
> + * @nr_to_scan: The number of pages that need to be isolated
> + * @dst:	Destination list to hold the isolated pages
> + *
> + * Walk the cgroup tree and isolate the pages in the hierarchy
> + * for reclaiming.
> + */
> +void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +				  size_t *nr_to_scan, struct list_head *dst)
> +{
> +	struct cgroup_subsys_state *css_root = NULL;
> +	struct cgroup_subsys_state *pos = NULL;
> +	struct sgx_epc_cgroup *epc_cg = NULL;
> +
> +	if (!*nr_to_scan)
> +		return;
> +
> +	 /* Caller ensure css_root ref acquired */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		if (!css_tryget(pos))
> +			break;
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +		sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (!*nr_to_scan)
> +			break;
> +	}
> +	rcu_read_unlock();
> +}
> +
> +static int sgx_epc_cgroup_reclaim_pages(unsigned long nr_pages,
> +					struct sgx_epc_reclaim_control *rc)
> +{
> +	/*
> +	 * Ensure sgx_reclaim_pages is called with a minimum and maximum
> +	 * number of pages.  Attempting to reclaim only a few pages will
> +	 * often fail and is inefficient, while reclaiming a huge number
> +	 * of pages can result in soft lockups due to holding various
> +	 * locks for an extended duration.  This also bounds nr_pages so
> +	 */

Looks like an incomplete sentence here.

> +	nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
> +	nr_pages = min(nr_pages, SGX_NR_TO_SCAN_MAX);
> +
> +	return sgx_reclaim_epc_pages(nr_pages, rc->ignore_age, rc->epc_cg);
> +}
> +
> +static int sgx_epc_cgroup_reclaim_failed(struct sgx_epc_reclaim_control *rc)
> +{
> +	if (sgx_epc_cgroup_lru_empty(rc->epc_cg))
> +		return -ENOMEM;
> +
> +	++rc->nr_fails;
> +	if (rc->nr_fails > SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD)
> +		rc->ignore_age = true;
> +
> +	return 0;
> +}
> +
> +static inline
> +void sgx_epc_reclaim_control_init(struct sgx_epc_reclaim_control *rc,
> +				  struct sgx_epc_cgroup *epc_cg)
> +{
> +	rc->epc_cg = epc_cg;
> +	rc->nr_fails = 0;
> +	rc->ignore_age = false;
> +}
> +
> +/*
> + * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
> + * cgroup when the cgroup is at/near its maximum capacity
> + */
> +static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	struct sgx_epc_cgroup *epc_cg;
> +	unsigned long cur, max;
> +
> +	epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +
> +	for (;;) {
> +		max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
> +
> +		/*
> +		 * Adjust the limit down by one page, the goal is to free up
> +		 * pages for fault allocations, not to simply obey the limit.
> +		 * Conditionally decrementing max also means the cur vs. max
> +		 * check will correctly handle the case where both are zero.
> +		 */
> +		if (max)
> +			max--;
> +
> +		/*
> +		 * Unless the limit is extremely low, in which case forcing
> +		 * reclaim will likely cause thrashing, force the cgroup to
> +		 * reclaim at least once if it's operating *near* its maximum
> +		 * limit by adjusting @max down by half the min reclaim size.
> +		 * This work func is scheduled by sgx_epc_cgroup_try_charge
> +		 * when it cannot directly reclaim due to being in an atomic
> +		 * context, e.g. EPC allocation in a fault handler.  Waiting
> +		 * to reclaim until the cgroup is actually at its limit is less
> +		 * performant as it means the faulting task is effectively
> +		 * blocked until a worker makes its way through the global work
> +		 * queue.
> +		 */
> +		if (max > SGX_NR_TO_SCAN_MAX)
> +			max -= (SGX_EPC_RECLAIM_MIN_PAGES / 2);
> +
> +		max = min(max, sgx_epc_total_pages);
> +		cur = sgx_epc_cgroup_page_counter_read(epc_cg);
> +		if (cur <= max)
> +			break;
> +		/* Nothing reclaimable */
> +		if (sgx_epc_cgroup_lru_empty(epc_cg)) {
> +			if (!sgx_epc_cgroup_oom(epc_cg))
> +				break;
> +
> +			continue;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
> +			if (sgx_epc_cgroup_reclaim_failed(&rc))
> +				break;
> +		}
> +	}
> +}
> +
> +static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
> +				       bool reclaim)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	unsigned int nr_empty = 0;
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +
> +	for (;;) {
> +		if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> +					PAGE_SIZE))
> +			break;
> +
> +		if (sgx_epc_cgroup_lru_empty(epc_cg))
> +			return -ENOMEM;
> +
> +		if (signal_pending(current))
> +			return -ERESTARTSYS;
> +
> +		if (!reclaim) {
> +			queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
> +			return -EBUSY;
> +		}
> +
> +		if (!sgx_epc_cgroup_reclaim_pages(1, &rc)) {
> +			if (sgx_epc_cgroup_reclaim_failed(&rc)) {
> +				if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
> +					return -ENOMEM;
> +				schedule();
> +			}
> +		}
> +	}
> +	if (epc_cg->cg != misc_cg_root())
> +		css_get(&epc_cg->cg->css);
> +
> +	return 0;
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge - hierarchically try to charge a single EPC page
> + * @mm:			the mm_struct of the process to charge
> + * @reclaim:		whether or not synchronous reclaim is allowed
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.
> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +	int ret;
> +
> +	if (sgx_epc_cgroup_disabled())
> +		return NULL;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> +	ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
> +	put_misc_cg(epc_cg->cg);
> +
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return epc_cg;
> +}
> +
> +/**
> + * sgx_epc_cgroup_uncharge - hierarchically uncharge EPC pages
> + * @epc_cg:	the charged epc cgroup
> + */
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (sgx_epc_cgroup_disabled())
> +		return;
> +
> +	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> +	if (epc_cg->cg != misc_cg_root())
> +		put_misc_cg(epc_cg->cg);
> +}
> +
> +static bool sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root)
> +{
> +	struct cgroup_subsys_state *css_root = NULL;
> +	struct cgroup_subsys_state *pos = NULL;
> +	struct sgx_epc_cgroup *epc_cg = NULL;
> +	bool oom = false;
> +
> +	 /* Caller ensure css_root ref acquired */
> +	css_root = root ? &root->cg->css : &(misc_cg_root()->css);
> +
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(pos, css_root) {
> +		/* skip dead ones */
> +		if (!css_tryget(pos))
> +			continue;
> +
> +		rcu_read_unlock();
> +
> +		epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +		oom = sgx_epc_oom(&epc_cg->lru);
> +
> +		rcu_read_lock();
> +		css_put(pos);
> +		if (oom)
> +			break;
> +	}
> +	rcu_read_unlock();
> +	return oom;
> +}
> +
> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +	cancel_work_sync(&epc_cg->reclaim_work);
> +	kfree(epc_cg);
> +}
> +
> +static void sgx_epc_cgroup_max_write(struct misc_cg *cg)
> +{
> +	struct sgx_epc_reclaim_control rc;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +
> +	sgx_epc_reclaim_control_init(&rc, epc_cg);
> +	/* Let the reclaimer to do the work so user is not blocked */
> +	queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
> +}
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
> +	if (!epc_cg)
> +		return -ENOMEM;
> +
> +	sgx_lru_init(&epc_cg->lru);
> +	INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
> +	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_alloc = sgx_epc_cgroup_alloc;
> +	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_free = sgx_epc_cgroup_free;
> +	cg->res[MISC_CG_RES_SGX_EPC].misc_cg_max_write = sgx_epc_cgroup_max_write;
> +	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> +	epc_cg->cg = cg;
> +	return 0;
> +}
> +
> +static int __init sgx_epc_cgroup_init(void)
> +{
> +	struct misc_cg *cg;
> +
> +	if (!boot_cpu_has(X86_FEATURE_SGX))
> +		return 0;
> +
> +	sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
> +					WQ_UNBOUND | WQ_FREEZABLE,
> +					WQ_UNBOUND_MAX_ACTIVE);
> +	BUG_ON(!sgx_epc_cg_wq);
> +
> +	cg = misc_cg_root();
> +	BUG_ON(!cg);
> +	WRITE_ONCE(cg->res[MISC_CG_RES_SGX_EPC].max, ULONG_MAX);
> +	atomic_long_set(&cg->res[MISC_CG_RES_SGX_EPC].usage, 0UL);
> +	return sgx_epc_cgroup_alloc(cg);
> +}
> +subsys_initcall(sgx_epc_cgroup_init);
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> new file mode 100644
> index 000000000000..03ac4dcea82b
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2022 Intel Corporation. */
> +#ifndef _INTEL_SGX_EPC_CGROUP_H_
> +#define _INTEL_SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
> +#include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/misc_cgroup.h>
> +#include <linux/page_counter.h>
> +#include <linux/workqueue.h>
> +
> +#include "sgx.h"
> +
> +#ifndef CONFIG_CGROUP_SGX_EPC
> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
> +struct sgx_epc_cgroup;
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
> +{
> +	return NULL;
> +}
> +
> +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
> +
> +static inline void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +						size_t *nr_to_scan,
> +						struct list_head *dst) { }
> +
> +static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
> +{
> +	return NULL;
> +}
> +
> +static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
> +{
> +	return true;
> +}
> +#else
> +struct sgx_epc_cgroup {
> +	struct misc_cg *cg;
> +	struct sgx_epc_lru_lists	lru;
> +	struct work_struct	reclaim_work;
> +	atomic_long_t		epoch;
> +};
> +
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim);
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> +bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root);
> +void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
> +				  size_t *nr_to_scan, struct list_head *dst);
> +static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (epc_cg)
> +		return &epc_cg->lru;
> +	return NULL;
> +}
> +#endif
> +
> +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 68c89d575abc..1e5984b881a2 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
>  #include <linux/highmem.h>
>  #include <linux/kthread.h>
>  #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
>  #include <linux/node.h>
>  #include <linux/pagemap.h>
>  #include <linux/ratelimit.h>
> @@ -17,11 +18,9 @@
>  #include "driver.h"
>  #include "encl.h"
>  #include "encls.h"
> -/**
> - * Maximum number of pages to scan for reclaiming.
> - */
> -#define SGX_NR_TO_SCAN_MAX	32
> +#include "epc_cgroup.h"
>  
> +unsigned long sgx_epc_total_pages;
>  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  static int sgx_nr_epc_sections;
>  static struct task_struct *ksgxd_tsk;
> @@ -36,9 +35,20 @@ static struct sgx_epc_lru_lists sgx_global_lru;
>  
>  static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page *epc_page)
>  {
> +	if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> +		return epc_cg_lru(epc_page->epc_cg);
> +
>  	return &sgx_global_lru;
>  }
>  
> +static inline bool sgx_can_reclaim(void)
> +{
> +	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> +		return !list_empty(&sgx_global_lru.reclaimable);
> +
> +	return !sgx_epc_cgroup_lru_empty(NULL);
> +}
> +

Keep the IS_ENABLED() logic the same in these two? 

>  static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>  
>  /* Nodes with one or more EPC sections. */
> @@ -298,14 +308,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>   * @nr_to_scan:	Number of pages to scan for reclaim
>   * @dst:	Destination list to hold the isolated pages
>   */
> -void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
> +void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t *nr_to_scan,
>  			   struct list_head *dst)
>  {
>  	struct sgx_encl_page *encl_page;
>  	struct sgx_epc_page *epc_page;
>  
>  	spin_lock(&lru->lock);
> -	for (; nr_to_scan > 0; --nr_to_scan) {
> +	for (; *nr_to_scan > 0; --(*nr_to_scan)) {
>  		epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
>  		if (!epc_page)
>  			break;
> @@ -330,9 +340,10 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
>  }
>  
>  /**
> - * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
> + * __sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers

This can be dropped.

>   * @nr_to_scan:		 Number of EPC pages to scan for reclaim
>   * @ignore_age:		 Reclaim a page even if it is young
> + * @epc_cg:		 EPC cgroup from which to reclaim
>   *
>   * Take a fixed number of pages from the head of the active page pool and
>   * reclaim them to the enclave's private shmem files. Skip the pages, which have
> @@ -346,7 +357,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
> +			     struct sgx_epc_cgroup *epc_cg)
>  {
>  	struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>  	struct sgx_epc_page *epc_page, *tmp;
> @@ -357,7 +369,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>  	size_t ret;
>  	size_t i;
>  
> -	sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
> +	/*
> +	 * If a specific cgroup is not being targeted, take from the global
> +	 * list first, even when cgroups are enabled.  If there are
> +	 * pages on the global LRU then they should get reclaimed asap.
> +	 */
> +	if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
> +		sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
> +
> +	sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);
>  
>  	if (list_empty(&iso))
>  		return 0;
> @@ -410,11 +430,6 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
>  	return i;
>  }
>  
> -static bool sgx_can_reclaim(void)
> -{
> -	return !list_empty(&sgx_global_lru.reclaimable);
> -}
> -
>  static bool sgx_should_reclaim(unsigned long watermark)
>  {
>  	return atomic_long_read(&sgx_nr_free_pages) < watermark &&
> @@ -429,7 +444,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  void sgx_reclaim_direct(void)
>  {
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  }
>  
>  static int ksgxd(void *p)
> @@ -452,7 +467,7 @@ static int ksgxd(void *p)
>  				     sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>  
>  		if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> -			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +			sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  
>  		cond_resched();
>  	}
> @@ -606,6 +621,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  {
>  	struct sgx_epc_page *page;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_try_charge(reclaim);
> +	if (IS_ERR(epc_cg))
> +		return ERR_CAST(epc_cg);
>  
>  	for ( ; ; ) {
>  		page = __sgx_alloc_epc_page();
> @@ -614,8 +634,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		if (!sgx_can_reclaim())
> -			return ERR_PTR(-ENOMEM);
> +		if (!sgx_can_reclaim()) {
> +			page = ERR_PTR(-ENOMEM);
> +			break;
> +		}
>  
>  		if (!reclaim) {
>  			page = ERR_PTR(-EBUSY);
> @@ -627,10 +649,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> -		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> +		sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
>  		cond_resched();
>  	}
>  
> +	if (!IS_ERR(page)) {
> +		WARN_ON_ONCE(page->epc_cg);
> +		page->epc_cg = epc_cg;
> +	} else {
> +		sgx_epc_cgroup_uncharge(epc_cg);
> +	}
> +
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>  		wake_up(&ksgxd_waitq);
>  
> @@ -653,6 +682,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  
>  	WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
>  
> +	if (page->epc_cg) {
> +		sgx_epc_cgroup_uncharge(page->epc_cg);
> +		page->epc_cg = NULL;
> +	}
> +
>  	spin_lock(&node->lock);
>  
>  	page->encl_page = NULL;
> @@ -663,6 +697,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	page->flags = SGX_EPC_PAGE_FREE;
>  
>  	spin_unlock(&node->lock);
> +
>  	atomic_long_inc(&sgx_nr_free_pages);
>  }
>  
> @@ -832,6 +867,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  		section->pages[i].flags = 0;
>  		section->pages[i].encl_page = NULL;
>  		section->pages[i].poison = 0;
> +		section->pages[i].epc_cg = NULL;
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
>  
> @@ -976,6 +1012,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
>  static bool __init sgx_page_cache_init(void)
>  {
>  	u32 eax, ebx, ecx, edx, type;
> +	u64 capacity = 0;
>  	u64 pa, size;
>  	int nid;
>  	int i;
> @@ -1026,6 +1063,7 @@ static bool __init sgx_page_cache_init(void)
>  
>  		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
>  		sgx_numa_nodes[nid].size += size;
> +		capacity += size;
>  
>  		sgx_nr_epc_sections++;
>  	}
> @@ -1035,6 +1073,9 @@ static bool __init sgx_page_cache_init(void)
>  		return false;
>  	}
>  
> +	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +	sgx_epc_total_pages = capacity >> PAGE_SHIFT;
> +
>  	return true;
>  }
>  
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index c6b3c90db0fa..36217032433b 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -19,6 +19,11 @@
>  
>  #define SGX_MAX_EPC_SECTIONS		8
>  #define SGX_EEXTEND_BLOCK_SIZE		256
> +
> +/*
> + * Maximum number of pages to scan for reclaiming.
> + */
> +#define SGX_NR_TO_SCAN_MAX		32UL
>  #define SGX_NR_TO_SCAN			16
>  #define SGX_NR_LOW_PAGES		32
>  #define SGX_NR_HIGH_PAGES		64
> @@ -70,6 +75,8 @@ enum sgx_epc_page_state {
>  /* flag for pages owned by a sgx_encl struct */
>  #define SGX_EPC_OWNER_ENCL		BIT(4)
>  
> +struct sgx_epc_cgroup;
> +
>  struct sgx_epc_page {
>  	unsigned int section;
>  	u16 flags;
> @@ -79,6 +86,7 @@ struct sgx_epc_page {
>  		struct sgx_encl *encl;
>  	};
>  	struct list_head list;
> +	struct sgx_epc_cgroup *epc_cg;
>  };
>  
>  static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
> @@ -127,6 +135,7 @@ struct sgx_epc_section {
>  	struct sgx_numa_node *node;
>  };
>  
> +extern unsigned long sgx_epc_total_pages;
>  extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  
>  static inline unsigned long sgx_get_epc_phys_addr(struct sgx_epc_page *page)
> @@ -175,8 +184,9 @@ void sgx_reclaim_direct(void);
>  void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
>  int sgx_drop_epc_page(struct sgx_epc_page *page);
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
> -void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
> +			     struct sgx_epc_cgroup *epc_cg);
> +void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t *nr_to_scan,
>  			   struct list_head *dst);
>  bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
>  
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support
  2023-07-12 23:01 ` [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support Haitao Huang
  2023-07-13  0:10   ` Randy Dunlap
  2023-07-14 20:26   ` Haitao Huang
@ 2023-08-17 15:18   ` Mikko Ylinen
  2 siblings, 0 replies; 62+ messages in thread
From: Mikko Ylinen @ 2023-08-17 15:18 UTC (permalink / raw)
  To: Haitao Huang
  Cc: jarkko, dave.hansen, tj, linux-kernel, linux-sgx, cgroups,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Jonathan Corbet, kai.huang, reinette.chatre,
	Kristen Carlson Accardi, zhiquan1.li, seanjc, bagasdotme,
	linux-doc, zhanb, anakrish

On Wed, Jul 12, 2023 at 04:01:56PM -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@linux.intel.com>
> 
> Add initial documentation of how to regulate the distribution of
> SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
> controller.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> ---
>  Documentation/arch/x86/sgx.rst | 77 ++++++++++++++++++++++++++++++++++
>  1 file changed, 77 insertions(+)
> 
> diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
> index 2bcbffacbed5..f6ca5594dcf2 100644
> --- a/Documentation/arch/x86/sgx.rst
> +++ b/Documentation/arch/x86/sgx.rst
> @@ -300,3 +300,80 @@ to expected failures and handle them as follows:
>     first call.  It indicates a bug in the kernel or the userspace client
>     if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
>     a return code other than 0.
> +
> +
> +Cgroup Support
> +==============
> +
> +The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
> +distribution of SGX EPC memory, which is a subset of system RAM that
> +is used to provide SGX-enabled applications with protected memory,
> +and is otherwise inaccessible, i.e. shows up as reserved in
> +/proc/iomem and cannot be read/written outside of an SGX enclave.
> +
> +Although current systems implement EPC by stealing memory from RAM,
> +for all intents and purposes the EPC is independent from normal system
> +memory, e.g. must be reserved at boot from RAM and cannot be converted
> +between EPC and normal memory while the system is running.  The EPC is
> +managed by the SGX subsystem and is not accounted by the memory
> +controller.  Note that this is true only for EPC memory itself, i.e.
> +normal memory allocations related to SGX and EPC memory, e.g. the
> +backing memory for evicted EPC pages, are accounted, limited and
> +protected by the memory controller.
> +
> +Much like normal system memory, EPC memory can be overcommitted via
> +virtual memory techniques and pages can be swapped out of the EPC
> +to their backing store (normal system memory allocated via shmem).
> +The SGX EPC subsystem is analogous to the memory subsytem, and
> +it implements limit and protection models for EPC memory.
> +
> +SGX EPC Interface Files
> +-----------------------
> +
> +For a generic description of the Miscellaneous controller interface
> +files, please see Documentation/admin-guide/cgroup-v2.rst
> +
> +All SGX EPC memory amounts are in bytes unless explicitly stated
> +otherwise.  If a value which is not PAGE_SIZE aligned is written,
> +the actual value used by the controller will be rounded down to
> +the closest PAGE_SIZE multiple.
> +
> +  misc.capacity
> +        A read-only flat-keyed file shown only in the root cgroup.
> +        The sgx_epc resource will show the total amount of EPC
> +        memory available on the platform.
> +
> +  misc.current
> +        A read-only flat-keyed file shown in the non-root cgroups.
> +        The sgx_epc resource will show the current active EPC memory
> +        usage of the cgroup and its descendants. EPC pages that are
> +        swapped out to backing RAM are not included in the current count.
> +
> +  misc.max
> +        A read-write single value file which exists on non-root
> +        cgroups. The sgx_epc resource will show the EPC usage
> +        hard limit. The default is "max".
> +
> +        If a cgroup's EPC usage reaches this limit, EPC allocations,
> +        e.g. for page fault handling, will be blocked until EPC can
> +        be reclaimed from the cgroup.  If EPC cannot be reclaimed in
> +        a timely manner, reclaim will be forced, e.g. by ignoring LRU.

Document the behavior when reclaim cannot happen, e.g., for the vEPC
pages when a VMM tries to allocate more than misc.max.

> +
> +  misc.events
> +	A read-write flat-keyed file which exists on non-root cgroups.
> +	Writes to the file reset the event counters to zero.  A value
> +	change in this file generates a file modified event.
> +
> +	  max
> +		The number of times the cgroup has triggered a reclaim
> +		due to its EPC usage approaching (or exceeding) its max
> +		EPC boundary.
> +
> +Migration
> +---------
> +
> +Once an EPC page is charged to a cgroup (during allocation), it
> +remains charged to the original cgroup until the page is released
> +or reclaimed.  Migrating a process to a different cgroup doesn't
> +move the EPC charges that it incurred while in the previous cgroup
> +to its new cgroup.
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2023-08-17 15:19 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-12 23:01 [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Haitao Huang
2023-07-12 23:01 ` [PATCH v3 01/28] x86/sgx: Store struct sgx_encl when allocating new VA pages Haitao Huang
2023-07-17 11:14   ` Jarkko Sakkinen
2023-07-12 23:01 ` [PATCH v3 02/28] x86/sgx: Add EPC page flags to identify owner type Haitao Huang
2023-07-17 12:41   ` Jarkko Sakkinen
2023-07-17 12:43     ` Jarkko Sakkinen
2023-07-12 23:01 ` [PATCH v3 03/28] x86/sgx: Add 'struct sgx_epc_lru_lists' to encapsulate lru list(s) Haitao Huang
2023-07-17 12:45   ` Jarkko Sakkinen
2023-07-17 13:23     ` Haitao Huang
2023-07-17 14:39       ` Jarkko Sakkinen
2023-07-24 10:04       ` Huang, Kai
2023-07-24 14:55         ` Haitao Huang
2023-07-24 23:31           ` Huang, Kai
2023-07-31 20:35             ` Haitao Huang
2023-07-12 23:01 ` [PATCH v3 04/28] x86/sgx: Use sgx_epc_lru_lists for existing active page list Haitao Huang
2023-07-17 12:47   ` Jarkko Sakkinen
2023-07-31 20:43     ` Haitao Huang
2023-07-12 23:01 ` [PATCH v3 05/28] x86/sgx: Store reclaimable epc pages in sgx_epc_lru_lists Haitao Huang
2023-07-12 23:01 ` [PATCH v3 06/28] x86/sgx: store unreclaimable EPC " Haitao Huang
2023-07-12 23:01 ` [PATCH v3 07/28] x86/sgx: Introduce EPC page states Haitao Huang
2023-07-12 23:01 ` [PATCH v3 08/28] x86/sgx: Introduce RECLAIM_IN_PROGRESS state Haitao Huang
2023-07-12 23:01 ` [PATCH v3 09/28] x86/sgx: Use a list to track to-be-reclaimed pages Haitao Huang
2023-07-12 23:01 ` [PATCH v3 10/28] x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default Haitao Huang
2023-07-12 23:01 ` [PATCH v3 11/28] x85/sgx: Return the number of EPC pages that were successfully reclaimed Haitao Huang
2023-07-29 12:47   ` Pavel Machek
2023-07-31 11:10     ` Jarkko Sakkinen
2023-07-12 23:01 ` [PATCH v3 12/28] x86/sgx: Add option to ignore age of page during EPC reclaim Haitao Huang
2023-07-12 23:01 ` [PATCH v3 13/28] x86/sgx: Prepare for multiple LRUs Haitao Huang
2023-07-12 23:01 ` [PATCH v3 14/28] x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup Haitao Huang
2023-07-12 23:01 ` [PATCH v3 15/28] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU Haitao Huang
2023-07-12 23:01 ` [PATCH v3 16/28] x86/sgx: Add EPC OOM path to forcefully reclaim EPC Haitao Huang
2023-07-12 23:01 ` [PATCH v3 17/28] x86/sgx: fix a NULL pointer Haitao Huang
2023-07-17 12:48   ` Jarkko Sakkinen
2023-07-17 12:49     ` Jarkko Sakkinen
2023-07-17 13:14       ` Haitao Huang
2023-07-17 14:33         ` Jarkko Sakkinen
2023-07-17 15:49     ` Dave Hansen
2023-07-17 18:49       ` Haitao Huang
2023-07-17 18:52       ` Jarkko Sakkinen
2023-07-12 23:01 ` [PATCH v3 18/28] cgroup/misc: Fix an overflow Haitao Huang
2023-07-17 13:15   ` Jarkko Sakkinen
2023-07-12 23:01 ` [PATCH v3 19/28] cgroup/misc: Add per resource callbacks for CSS events Haitao Huang
2023-07-17 13:16   ` Jarkko Sakkinen
2023-07-12 23:01 ` [PATCH v3 20/28] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver Haitao Huang
2023-07-12 23:01 ` [PATCH v3 21/28] x86/sgx: Limit process EPC usage with misc cgroup controller Haitao Huang
2023-07-13  0:03   ` Randy Dunlap
2023-08-17 15:12   ` Mikko Ylinen
2023-07-12 23:01 ` [PATCH v3 22/28] Docs/x86/sgx: Add description for cgroup support Haitao Huang
2023-07-13  0:10   ` Randy Dunlap
2023-07-14 20:01     ` Haitao Huang
2023-07-14 20:26   ` Haitao Huang
2023-08-17 15:18   ` Mikko Ylinen
2023-07-12 23:01 ` [PATCH v3 23/28] selftests/sgx: Retry the ioctl()'s returned with EAGAIN Haitao Huang
2023-07-12 23:01 ` [PATCH v3 24/28] selftests/sgx: Move ENCL_HEAP_SIZE_DEFAULT to main.c Haitao Huang
2023-07-12 23:01 ` [PATCH v3 25/28] selftests/sgx: Use encl->encl_size in sigstruct.c Haitao Huang
2023-07-12 23:02 ` [PATCH v3 26/28] selftests/sgx: Include the dynamic heap size to the ELRANGE calculation Haitao Huang
2023-07-12 23:02 ` [PATCH v3 27/28] selftests/sgx: Add SGX selftest augment_via_eaccept_long Haitao Huang
2023-07-12 23:02 ` [PATCH v3 28/28] selftests/sgx: Add scripts for epc cgroup testing Haitao Huang
2023-07-17 11:02 ` [PATCH v3 00/28] Add Cgroup support for SGX EPC memory Jarkko Sakkinen
2023-07-24 19:09 ` Sohil Mehta
2023-07-25 17:16   ` Haitao Huang
2023-08-17 15:04 ` Mikko Ylinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).